* Remove references to device-only intrinsics when compiling for host.
Currently, we attempt to use the `__device__`-only functions
`__cvta_generic_to_shared` and `__nvvm_get_smem_pointer` when compiling
`cute::cast_smem_ptr_to_uint` for the host on Clang. This results in a
compilation error, as expected. This commit changes the definition of
the `*_ACTIVATED` macros so that they are only true when `__CUDA_ARCH__`
is defined; that is, when compiling for the device.
Additionally, the declaration of `__nvvm_get_smem_pointer`
is currently only visible during the device compilation pass when
compiling with NVCC; this commit makes the declaration visible during
host compilation with the `__device__` annotation.
* Annotate cute::cast_smem_ptr_to_uint as device-only.
The implementation of `cute::cast_smem_ptr_to_uint` is currently an
unchecked failure on host code, and the only host implementation I can
think of -- casting a probably-64-bit pointer to 32 bits somehow --
doesn't make sense to implement. This commit marks this function as
device-only so that it can't be accidentally used on host code.
* small change
---------
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
* Make operator() const-correct and add missing static functions.
Currently, `*Converter::operator()` requires a mutable object to invoke,
and there are missing `static result_type convert(source_type const &
source)` overloads for certain partial specializations of `*Converter`
objects. This commit makes `operator()` const-correct and adds missing
function overloads where appropriate.
* minor changes
* format
---------
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
- clang 1.14 complains about missing function from a host call:
cutlass/include/cute/arch/util.hpp:106:32: error: no matching function for call to '__cvta_generic_to_shared'
return static_cast<uint32_t>(__cvta_generic_to_shared(ptr));
- fixes this by defining CUTE_HOST_DEVICE for clang as well
Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
Currently, the `LinearCombinationClamp` header file is not standalone,
and must have the definition of `cutlass::epilogue:🧵:ScaleType`
already available when it is `#include`d.
* added support of b2b bmm
* fixed arguments and params structures
* added batch_count argument
* removed SplitKSerial and added new test case with b2b bmm
* fixed support of Kbatched and added new test case with batch stride
* added batch support for bias and scale
* make test
* small changes
---------
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
* Enable shared memory intrinsics and ldmatrix PTX on Clang.
This commit adds preprocessor checks to enable the shared memory
intrinsics `__cvta_generic_to_shared` and `__nvvm_get_smem_pointer`, as
well as the `ldmatrix` PTX instructions, on Clang. Preventing these
intrinsics from being used is a significant latency regression on Clang.
* refine the macro
---------
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
* Changes to iterators to support s8 gemm with f16 outputs
* should work
---------
Co-authored-by: Sujan Gonugondla <gsujan@amaon.com>
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>