* Remove references to device-only intrinsics when compiling for host.
Currently, we attempt to use the `__device__`-only functions
`__cvta_generic_to_shared` and `__nvvm_get_smem_pointer` when compiling
`cute::cast_smem_ptr_to_uint` for the host on Clang. This results in a
compilation error, as expected. This commit changes the definition of
the `*_ACTIVATED` macros so that they are only true when `__CUDA_ARCH__`
is defined; that is, when compiling for the device.
Additionally, the declaration of `__nvvm_get_smem_pointer`
is currently only visible during the device compilation pass when
compiling with NVCC; this commit makes the declaration visible during
host compilation with the `__device__` annotation.
* Annotate cute::cast_smem_ptr_to_uint as device-only.
The implementation of `cute::cast_smem_ptr_to_uint` is currently an
unchecked failure on host code, and the only host implementation I can
think of -- casting a probably-64-bit pointer to 32 bits somehow --
doesn't make sense to implement. This commit marks this function as
device-only so that it can't be accidentally used on host code.
* small change
---------
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
* Make operator() const-correct and add missing static functions.
Currently, `*Converter::operator()` requires a mutable object to invoke,
and there are missing `static result_type convert(source_type const &
source)` overloads for certain partial specializations of `*Converter`
objects. This commit makes `operator()` const-correct and adds missing
function overloads where appropriate.
* minor changes
* format
---------
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
- clang 1.14 complains about missing function from a host call:
cutlass/include/cute/arch/util.hpp:106:32: error: no matching function for call to '__cvta_generic_to_shared'
return static_cast<uint32_t>(__cvta_generic_to_shared(ptr));
- fixes this by defining CUTE_HOST_DEVICE for clang as well
Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
Currently, the `LinearCombinationClamp` header file is not standalone,
and must have the definition of `cutlass::epilogue:🧵:ScaleType`
already available when it is `#include`d.
* Enable shared memory intrinsics and ldmatrix PTX on Clang.
This commit adds preprocessor checks to enable the shared memory
intrinsics `__cvta_generic_to_shared` and `__nvvm_get_smem_pointer`, as
well as the `ldmatrix` PTX instructions, on Clang. Preventing these
intrinsics from being used is a significant latency regression on Clang.
* refine the macro
---------
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
* Changes to iterators to support s8 gemm with f16 outputs
* should work
---------
Co-authored-by: Sujan Gonugondla <gsujan@amaon.com>
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
Work around a likely GCC 8.x issue with fold expressions
and generic lambdas.
Only use the work-around when the host compiler is GCC 8.x.
This avoids any concerns about the work-around possibly
hindering inlining for a critical CuTe function (product).
Users can experiment with the work-around for other compilers
or compiler versions by defining the following macro.
CUTE_FOLD_GENERIC_LAMBDA_WORKAROUND
Fixes https://github.com/NVIDIA/cutlass/issues/788
Co-authored-by: Mark Hoemmen <mhoemmen@nvidia.com>
This commit changes the declarations of MMA operator class (SIMT, Tensor Core, WMMA Tensor Core) and operator type (multiply-add and so on) to definitions. This is done so that these tag structs are no longer incomplete types, which allows the `typeid` operator to be used on these tag structs. This is necessary for these tag structs to be used as type parameters in [GoogleTest typed tests](https://google.github.io/googletest/advanced.html#typed-tests).
This commit adds two `#include` directives so that the definitions of `cutlass::gemm::warp::WarpSize` from "cutlass/gemm/warp/mma.h" and `cutlass::arch::OpClassSimt` from "cutlass/arch/mma.h" are visible to "cutlass/epilogue/threadblock/default_epilogue_simt.h". Without them, there are compiler errors when building the header standalone:
```
In file included from cutlass/include/cutlass/epilogue/threadblock/default_epilogue_simt.cu:1:
./cutlass/include/cutlass/epilogue/threadblock/default_epilogue_simt.h:351:32: error: no member named 'warp' in namespace 'cutlass::gemm'; did you mean simply 'warp'?
static int const kWarpSize = cutlass::gemm::warp::WarpSize<arch::OpClassSimt>::value;
^
./cutlass/include/cutlass/epilogue/warp/tile_iterator_simt.h:49:11: note: 'warp' declared here
namespace warp {
^
In file included from cutlass/include/cutlass/epilogue/threadblock/default_epilogue_simt.cu:1:
./cutlass/include/cutlass/epilogue/threadblock/default_epilogue_simt.h:351:53: error: no member named 'WarpSize' in namespace 'cutlass::epilogue::warp'
static int const kWarpSize = cutlass::gemm::warp::WarpSize<arch::OpClassSimt>::value;
~~~~~~^
./cutlass/include/cutlass/epilogue/threadblock/default_epilogue_simt.h:351:68: error: no member named 'OpClassSimt' in namespace 'cutlass::arch'
static int const kWarpSize = cutlass::gemm::warp::WarpSize<arch::OpClassSimt>::value;
~~~~~~^
./cutlass/include/cutlass/epilogue/threadblock/default_epilogue_simt.h:351:82: error: no member named 'value' in the global namespace
static int const kWarpSize = cutlass::gemm::warp::WarpSize<arch::OpClassSimt>::value;
~~^
./cutlass/include/cutlass/epilogue/threadblock/default_epilogue_simt.h:367:5: error: use of class template 'OutputTileThreadMap' requires template arguments
OutputTileThreadMap,
^
./cutlass/include/cutlass/epilogue/threadblock/output_tile_thread_map.h:134:8: note: template is declared here
struct OutputTileThreadMap : public OutputTileThreadMapHelpers<Iterations_, Delta_> {
^
In file included from cutlass/include/cutlass/epilogue/threadblock/default_epilogue_simt.cu:1:
./cutlass/include/cutlass/epilogue/threadblock/default_epilogue_simt.h:391:5: error: use of class template 'OutputTileThreadMap' requires template arguments
OutputTileThreadMap,
^
./cutlass/include/cutlass/epilogue/threadblock/output_tile_thread_map.h:134:8: note: template is declared here
struct OutputTileThreadMap : public OutputTileThreadMapHelpers<Iterations_, Delta_> {
^
In file included from cutlass/include/cutlass/epilogue/threadblock/default_epilogue_simt.cu:1:
./cutlass/include/cutlass/epilogue/threadblock/default_epilogue_simt.h:405:5: error: unknown type name 'OutputTileIterator'; did you mean 'WarpTileIterator'?
OutputTileIterator,
^
./cutlass/include/cutlass/epilogue/threadblock/default_epilogue_simt.h:380:9: note: 'WarpTileIterator' declared here
using WarpTileIterator = cutlass::epilogue::warp::TileIteratorSimtDirect2dConv<
^
./cutlass/include/cutlass/epilogue/threadblock/default_epilogue_simt.h:408:5: error: use of class template 'SharedLoadIterator' requires template arguments
SharedLoadIterator,
^
./cutlass/include/cutlass/epilogue/threadblock/shared_load_iterator.h:67:7: note: template is declared here
class SharedLoadIterator {
^
```
* Relax stream K gemm alignment constraints
The current alignment requirements are too strict. Make them identical
to the checks for the regular universal gemm.
* Revert "Relax stream K gemm alignment constraints"
This reverts commit 31e80a250e2b0ac4bda2e4b437b39dc5bcd5e845.
* Relax stream K gemm alignment constraints
The current alignment requirements are too strict. Make them identical
to the checks for the regular universal gemm.
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
* add two missing files
* fix bunch of bugs of gemm-reducek fusion and add a device interface
* small changes
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
* ex42: Fused MHA imported from xFormers
* Remove std:: references
* Support K>128 in the example
* Support causal option
* Support different head size for V, and different seqlength for KV
* Update FLOPS counter
* Remove bit_cast
* fix build: Replace M_LOG2E
* Add doc
* Revert "Remove bit_cast"
This reverts commit 9662fa86bb7c57c1a015ac0bf52cb52940fbbf80.
* Explicit casts to int32_t for windows build
Co-authored-by: danthe3rd <danthe3rd>
- adds missing commas
- adjusts misaligned usage of CUTLASS_DEVICE between
template declaration and specializations
Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
* Fixed template struct/class mismatch
* Use platform implementation instead of std::abs and std::conditional during nvrtc compilation
* Use platform implementation instead of std::abs and std::conditional during nvrtc compilation
* Revert absolute_value() usage
* Fix separate compilation `-dc`
- when cutlass is included in multiple compilation units
compiled with `-dc` OOB_NAN_F16x8 device constant is
instantiated multiple times causing
Multiple definition of '_ZN7cutlass4arch13OOB_NAN_F16x8E' error
This PR makes this variable a local constant as it is not
modified during runtime
Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
* Fix
Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
* Test GH
Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
* Revert test GH
Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
* Removed trivial copy constructors on parameter classes to enable device-side launch of CUTLASS kernels
* Added SFINAE to the `TensorRef(NonConstTensorRef const&)` constructor to avoid making it a copy-constructor for device code
* std => platform
* fix affine2
* really fix affine2
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
* Fix the build of cutlass/gemm/device/gemm_array.h and add a demo for GemmArray
* Add a reference to GemmArray to the docs
Co-authored-by: Ivan Komarov <dfyz@yandex-team.ru>