Commit Graph

153 Commits

Author SHA1 Message Date
wll
19c4a4815e
replace division with multiplication in GELU (#942) 2023-05-12 10:57:18 -04:00
Gregory Meyer (gregjm)
fcfbd23e26
Fix host compilation of cute::cast_smem_ptr_to_uint. (#940)
* Remove references to device-only intrinsics when compiling for host.

Currently, we attempt to use the `__device__`-only functions
`__cvta_generic_to_shared` and `__nvvm_get_smem_pointer` when compiling
`cute::cast_smem_ptr_to_uint` for the host on Clang. This results in a
compilation error, as expected. This commit changes the definition of
the `*_ACTIVATED` macros so that they are only true when `__CUDA_ARCH__`
is defined; that is, when compiling for the device.

Additionally, the declaration of `__nvvm_get_smem_pointer`
is currently only visible during the device compilation pass when
compiling with NVCC; this commit makes the declaration visible during
host compilation with the `__device__` annotation.

* Annotate cute::cast_smem_ptr_to_uint as device-only.

The implementation of `cute::cast_smem_ptr_to_uint` is currently an
unchecked failure on host code, and the only host implementation I can
think of -- casting a probably-64-bit pointer to 32 bits somehow --
doesn't make sense to implement. This commit marks this function as
device-only so that it can't be accidentally used on host code.

* small change

---------

Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2023-05-10 00:06:54 -04:00
Gregory Meyer (gregjm)
b250faccd3
Make operator() const-correct and add missing static functions. (#936)
* Make operator() const-correct and add missing static functions.

Currently, `*Converter::operator()` requires a mutable object to invoke,
and there are missing `static result_type convert(source_type const &
source)` overloads for certain partial specializations of `*Converter`
objects. This commit makes `operator()` const-correct and adds missing
function overloads where appropriate.

* minor changes

* format

---------

Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2023-05-09 16:33:01 -04:00
Janusz Lisiecki
24c8b7d8a2
Fix cuTE compilation with clang (#939)
- clang 1.14 complains about missing function from a host call:
  cutlass/include/cute/arch/util.hpp:106:32: error: no matching function for call to '__cvta_generic_to_shared'
  return static_cast<uint32_t>(__cvta_generic_to_shared(ptr));
- fixes this by defining CUTE_HOST_DEVICE for clang as well

Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
2023-05-09 09:51:45 -04:00
ANIKET SHIVAM
7c04f95415
Updates for 3.1 (#932) 2023-04-29 09:34:27 -04:00
Gregory Meyer (gregjm)
6f8596ce3f
Add missing #include directive to get access to cutlass::epilogue:🧵:ScaleType. (#925)
Currently, the `LinearCombinationClamp` header file is not standalone,
and must have the definition of `cutlass::epilogue:🧵:ScaleType`
already available when it is `#include`d.
2023-04-28 20:02:41 -04:00
Adnan Akhundov
fe2f491dd7
Get SM count with cudaDeviceGetAttribute in KernelHardwareInfo (#927) 2023-04-28 13:23:23 -04:00
Jakub Szuppe
180c5629bf
Add missing checks for NVRTC in CuTe (#921) 2023-04-25 12:52:43 -04:00
Guray Ozen
43cfbe0086
Allow L2 prefect for clang compiler (#914) 2023-04-15 01:23:22 -04:00
ANIKET SHIVAM
d572cc1aab
CUTLASS 3.1 (#915)
Co-authored-by: Aniket Shivam <ashivam@nvidia.com>
2023-04-14 23:19:34 -04:00
Adnan Akhundov
0435979f59
Remove const from 3.x GemmUniversalAdapter::operator() (#905) 2023-04-03 20:30:51 -04:00
Gregory Meyer (gregjm)
ecbd24566c
Enable shared memory intrinsics and ldmatrix PTX on Clang. (#754)
* Enable shared memory intrinsics and ldmatrix PTX on Clang.

This commit adds preprocessor checks to enable the shared memory
intrinsics `__cvta_generic_to_shared` and `__nvvm_get_smem_pointer`, as
well as the `ldmatrix` PTX instructions, on Clang. Preventing these
intrinsics from being used is a significant latency regression on Clang.

* refine the macro

---------

Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2023-03-31 21:42:24 -04:00
Feng Shijie
bc36122c3f
[layout] Fix AffineRank2ColumnMajor::packed() (#879)
* [layout] Fix AffineRank2ColumnMajor::packed()

* correct affine2row::packed

---------

Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2023-03-29 11:59:48 -04:00
Vijay Thakkar
15d9d31f1f
CUTLASS 3.0 Hopper GEMMs are GETTs in disguise (#897) 2023-03-29 10:42:40 -04:00
ptrblck
1eef5c3cf1
add guards for __CUDA_ARCH__ >= 530 (#891)
* add guards for sm>=70

* drop guard to 530
2023-03-28 17:47:10 -04:00
Alexander Zinoviev
42290f5d1c
Fix for dangling pointers (#885) 2023-03-25 01:15:14 -04:00
Vijay Thakkar
209faf7b94
remove spurious comma (#871) 2023-03-20 17:25:27 -04:00
Jack Kosaian
6116706c96
Set batch_strides on Params::update (#883) 2023-03-20 17:07:47 -04:00
Nikita Shulga
2670b973dd
Fix sign-compare warning in reorder_array (#869)
`std::vector<T>::size_type` is unsigned type, so let's iterate over unsigned type as well


Discovered, while trying to enable PyTorch building without `-Wno-sign-compare` warning suppression, see https://github.com/pytorch/pytorch/actions/runs/4418987999/jobs/7746850762#step:10:10532
2023-03-20 17:07:24 -04:00
Vijay Thakkar
af332d4aa9
Add missing comma in cutlass/arch/mma_sm90.h (#862) 2023-03-14 12:04:28 -04:00
Edward Rees
86cae03cea
expose StoreT parameter for potential speed (#838)
* expose StoreT parameter for potential speed

* add storeT to more elementwise

---------

Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2023-03-10 12:58:17 -05:00
Stepan Tezyunichev
29801e348a
Hide streams and typinfo from nvrtc (#853)
* Hide streams and typinfo from nvrtc

* Use __CUDACC_RTC__ instead CUDA_ARCH for guard
2023-03-09 23:24:47 -05:00
Alexander Pivovarov
7e370c9637
Fix typos 2 (#842)
Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com>
2023-03-09 23:22:56 -05:00
ANIKET SHIVAM
c4f6b8c6bc
Updates for 3.0 (#857)
Co-authored-by: Aniket Shivam <ashivam@nvidia.com>
2023-03-09 15:27:40 -05:00
psaab
a31b43b3f3
Re-enable aarch64 support lost in 277bd6e537 (#846) 2023-03-02 11:17:21 -05:00
dan_the_3rd
f396cdd15c
ex24[gemm_grouped]: Allow to change layout/dtype (#841)
* ex24[gemm_grouped]: Allow to change layout/dtype

* Address suggestion from @jackkosaian

---------

Co-authored-by: danthe3rd <danthe3rd>
2023-03-01 07:13:51 -05:00
Alexander Pivovarov
92ebbf1dc4
Fix typos (#839) 2023-02-27 11:17:58 -05:00
Haicheng Wu
65688c2a87
streamk fix (#836)
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2023-02-23 16:35:08 -05:00
Yuxin Wu
95f673ecf7
Update base_grouped.h (#832) 2023-02-21 14:48:30 -05:00
Haicheng Wu
91b8de8d32
streamk fix (#830)
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2023-02-20 11:03:16 -05:00
Sujan Kumar Gonugondla
d8359c804b
Changes to iterators to support s8 gemm with f16 outputs (#812)
* Changes to iterators to support s8 gemm with f16 outputs

* should work

---------

Co-authored-by: Sujan Gonugondla <gsujan@amaon.com>
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2023-02-16 18:37:51 -05:00
Haicheng Wu
9fb38ac048
fix alignmentC=8 for imma N=128 (#822)
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2023-02-15 12:06:00 -05:00
Shuai Shao
ce8597dc14
Fix type bug in conv2d/gemm with broadcast (#796)
add ElementVector

---------

Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2023-02-09 20:53:25 -05:00
Jack Kosaian
5ff5209ed5
Add acc2smem in epilogue/threadblock/epilogue.h (#806) 2023-02-06 22:04:16 -05:00
Jack Kosaian
5921043981
Re-enable all alignments for int accumulators (#807) 2023-02-06 22:01:15 -05:00
Mark Hoemmen
add4ba622f
Fix 8.4 + CUDA 11.4 build (#789)
Work around a likely GCC 8.x issue with fold expressions
and generic lambdas.

Only use the work-around when the host compiler is GCC 8.x.
This avoids any concerns about the work-around possibly
hindering inlining for a critical CuTe function (product).

Users can experiment with the work-around for other compilers
or compiler versions by defining the following macro.

CUTE_FOLD_GENERIC_LAMBDA_WORKAROUND

Fixes https://github.com/NVIDIA/cutlass/issues/788

Co-authored-by: Mark Hoemmen <mhoemmen@nvidia.com>
2023-01-27 09:18:59 -05:00
Vijay Thakkar
277bd6e537
CUTLASS 3.0.0 (#786)
* CUTLASS 3.0.0
2023-01-23 20:55:28 -05:00
ANIKET SHIVAM
66d9cddc83
New updates for 2.11 (#775)
* New updates.

* Minor profiler updates

Co-authored-by: Aniket Shivam <ashivam@nvidia.com>
2023-01-20 16:32:57 -05:00
psaab
d49bef88f9
Enable aarch64 support (#779) 2023-01-20 15:51:58 -05:00
Haicheng Wu
764b840d6f
streamk example and performance tuning (#760)
* streamk example and performance tuning

* one missing file

Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2023-01-10 16:10:02 -05:00
Ali Hassani
a1046d49c1
Adds missing semicolon (#759) 2023-01-09 21:50:46 -05:00
Gregory Meyer (gregjm)
7bdba07310
Add definitions for tag structs. (#752)
This commit changes the declarations of MMA operator class (SIMT, Tensor Core, WMMA Tensor Core) and operator type (multiply-add and so on) to definitions. This is done so that these tag structs are no longer incomplete types, which allows the `typeid` operator to be used on these tag structs. This is necessary for these tag structs to be used as type parameters in [GoogleTest typed tests](https://google.github.io/googletest/advanced.html#typed-tests).
2023-01-06 09:46:52 -05:00
Gregory Meyer (gregjm)
c54ede3a9e
Add const overloads for iterator functions. (#753)
This commit adds `const`-correct overloads for `Array::{begin,end,rbegin,rend}`. These overloads are necessary for usage with [the GMock Container Matchers](http://google.github.io/googletest/reference/matchers.html#container-matchers), which cast the `Container` argument to a constant reference.
2023-01-06 09:46:34 -05:00
Haicheng Wu
ff6e733fe1
restore the old epilogue for everything except streamk (#749)
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2023-01-04 11:02:55 -05:00
Haicheng Wu
1e64f153b3
improve streamk load balance (#743)
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2022-12-25 13:56:33 -05:00
Gregory Meyer (gregjm)
b85865d1ad
Add missing #include directives (#741)
This commit adds two `#include` directives so that the definitions of `cutlass::gemm::warp::WarpSize` from "cutlass/gemm/warp/mma.h" and `cutlass::arch::OpClassSimt` from "cutlass/arch/mma.h" are visible to "cutlass/epilogue/threadblock/default_epilogue_simt.h". Without them, there are compiler errors when building the header standalone:

```
In file included from cutlass/include/cutlass/epilogue/threadblock/default_epilogue_simt.cu:1:
./cutlass/include/cutlass/epilogue/threadblock/default_epilogue_simt.h:351:32: error: no member named 'warp' in namespace 'cutlass::gemm'; did you mean simply 'warp'?
  static int const kWarpSize = cutlass::gemm::warp::WarpSize<arch::OpClassSimt>::value;
                               ^
./cutlass/include/cutlass/epilogue/warp/tile_iterator_simt.h:49:11: note: 'warp' declared here
namespace warp {
          ^
In file included from cutlass/include/cutlass/epilogue/threadblock/default_epilogue_simt.cu:1:
./cutlass/include/cutlass/epilogue/threadblock/default_epilogue_simt.h:351:53: error: no member named 'WarpSize' in namespace 'cutlass::epilogue::warp'
  static int const kWarpSize = cutlass::gemm::warp::WarpSize<arch::OpClassSimt>::value;
                                              ~~~~~~^
./cutlass/include/cutlass/epilogue/threadblock/default_epilogue_simt.h:351:68: error: no member named 'OpClassSimt' in namespace 'cutlass::arch'
  static int const kWarpSize = cutlass::gemm::warp::WarpSize<arch::OpClassSimt>::value;
                                                             ~~~~~~^
./cutlass/include/cutlass/epilogue/threadblock/default_epilogue_simt.h:351:82: error: no member named 'value' in the global namespace
  static int const kWarpSize = cutlass::gemm::warp::WarpSize<arch::OpClassSimt>::value;
                                                                               ~~^
./cutlass/include/cutlass/epilogue/threadblock/default_epilogue_simt.h:367:5: error: use of class template 'OutputTileThreadMap' requires template arguments
    OutputTileThreadMap,
    ^
./cutlass/include/cutlass/epilogue/threadblock/output_tile_thread_map.h:134:8: note: template is declared here
struct OutputTileThreadMap : public OutputTileThreadMapHelpers<Iterations_, Delta_> {
       ^
In file included from cutlass/include/cutlass/epilogue/threadblock/default_epilogue_simt.cu:1:
./cutlass/include/cutlass/epilogue/threadblock/default_epilogue_simt.h:391:5: error: use of class template 'OutputTileThreadMap' requires template arguments
    OutputTileThreadMap,
    ^
./cutlass/include/cutlass/epilogue/threadblock/output_tile_thread_map.h:134:8: note: template is declared here
struct OutputTileThreadMap : public OutputTileThreadMapHelpers<Iterations_, Delta_> {
       ^
In file included from cutlass/include/cutlass/epilogue/threadblock/default_epilogue_simt.cu:1:
./cutlass/include/cutlass/epilogue/threadblock/default_epilogue_simt.h:405:5: error: unknown type name 'OutputTileIterator'; did you mean 'WarpTileIterator'?
    OutputTileIterator,
    ^
./cutlass/include/cutlass/epilogue/threadblock/default_epilogue_simt.h:380:9: note: 'WarpTileIterator' declared here
  using WarpTileIterator = cutlass::epilogue::warp::TileIteratorSimtDirect2dConv<
        ^
./cutlass/include/cutlass/epilogue/threadblock/default_epilogue_simt.h:408:5: error: use of class template 'SharedLoadIterator' requires template arguments
    SharedLoadIterator,
    ^
./cutlass/include/cutlass/epilogue/threadblock/shared_load_iterator.h:67:7: note: template is declared here
class SharedLoadIterator {
      ^
```
2022-12-21 11:40:20 -05:00
ANIKET SHIVAM
38193d76e3
Updates for stream-k (#728)
Co-authored-by: Aniket Shivam <ashivam@nvidia.com>
2022-12-08 23:48:10 -05:00
Gregory Meyer (gregjm)
1d7772f218
Add missing #include directive (#727) 2022-12-08 18:58:31 -05:00
Mike Iovine
d6117ca362
Relax stream K gemm alignment constraints (#717)
* Relax stream K gemm alignment constraints

The current alignment requirements are too strict. Make them identical
to the checks for the regular universal gemm.

* Revert "Relax stream K gemm alignment constraints"

This reverts commit 31e80a250e2b0ac4bda2e4b437b39dc5bcd5e845.

* Relax stream K gemm alignment constraints

The current alignment requirements are too strict. Make them identical
to the checks for the regular universal gemm.

Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2022-12-07 11:17:49 -05:00
Ali Hassani
9c0518608e
Fix typos in conv problem sizes (#720)
* Fix typos in conv problem sizes

* Typos
2022-12-05 15:54:58 -05:00