Currently, the default constructor of
`PredicatedTileAccessIteratorParams` will invoke undefined behavior in
its invocation of the `initialize` function. Specifically, it will
attempt to read from the uninitialized variables
`desc.element_size_bits` and `desc.advance_rank`. This commit changes
the default constructors of both `*Params` and `*Desc` to
zero-initialize all uninitialized members.
* Remove unused variables
* Qualify calls to make_fragment_? from templated base class.
Fixes clang build error.
* Add missing `#include <cstdio>`
* Various changes to fix clang compile errors.
* More changes to fix clang build.
Remaining issues:
- `params` initializer of `CollectiveEpilogue`.
- `ops` initializer of `Sm90VisitorImplBase`.
- `__usAtomicCAS` needs to be added to clang upstream.
* Fix remaining clang build issues.
* Qualify `cute::rank()` calls.
* Qualify some more calls that are otherwise ambiguous between `cute` and `std` namespace.
* Double-escape special registers in inline asm.
* small change
---------
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
Followup to #1224.
A change in the stream-k threadblock swizzle ctor since 3.3 breaks
single source GEMM with fused epilogue and stream-k. Multi-source was
already corrected.
Co-authored-by: Ali Hassani <ahassanijr@gmail.com>
* Release 3.3.0
Adds support for mixed precision GEMMs On Hopper and Ampere
Adds support for < 16B aligned GEMMs on Hopper
Enhancements to EVT
Enhancements to Python interface
Enhancements to Sub-byte type handling in CuTe
Several other bug-fixes and performance improvements.
* minor doc update
* set kIsHeavy member variables
* correct kIsHeavy value for Tanh
* set kIsHeavy=false for HardSwish
---------
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
* Passing warp-level mixed input F16*(S8/U8) tests
* passing device-level mixed input F16*(S8/U8) tests
* add to profiler - I8 (111 TFLOPs), U (123 TFLOPs)
* fast numeric conversions (I8 = 132 TFLOPs, U8 = 148 TFLOPs)
* Speedup reference compilation (REVERT THIS COMMIT)
* wider_add.u32_packed_sub.f16x2 (I8 = 132TFLOP/s, U8 = 170 TFLOP/s)
* Improve s8->f16 cvt and support bf16*u8 @158 TFLOPs
* BF16 * S8 (142 TFLOPs)
* Handle mixed-input upcast on OperandA (Support [S8|U8]*[F16|BF16]
* rename OpMultiplyAddMixedInput to OpMultiplyAddMixedInputUpcast
* Add device-level test and profiler support for upcast on operand A
* Move shfl before the cvt and reduce #shfls by 1/2
* fix smem_usage calculation for mixed_input types
* uncomment the stuff (getting ready for merge)
* profiler changes and mixed-input reference
* mixed input reference are in a new file
* use platform instead of std
* comments and typo only
* Use CreateGemmOperator and delete CreateMixedInputGemmOperator
* copyright for new files
* rebase follow-up
when I use cutlass::epilogue:🧵:LinearCombinationSigmoid, I encounter the this error:
cutlass/include/cutlass/array.h(1549): error: no operator "-" matches these operands
Moving operator "-" from line 1549 to 1548 can solve this error
* [WIP] GEMM StreamK w/ Fused Epilogue
* Adds Gemm Streamk with Fused Epilogue kernel level struct.
* Mostly based on Gemm with Fused Epilogue,
* Requires a new epilogue
* Work in progress
* [WIP] StreamK support for GemmUniversalWithBroadcast
* Just based off of how StreamK is allowed in GemmUniversal
* Untested and a work in progress
* Minor fixes
* [WIP] It compiles!
It is almost certainly incorrect, but we're past getting the templates
to match, so checkpointing.
* Correction to reference kernel
* Fix typo
* Added MSE measurement
* Switch back to reference kernel + host for loop
Still WIP. Now we're getting even a larger MSE, but it's both on
basic Split-K and Stream-K.
* Fix typos
* Fix broadcast vector + requested changes
* Comment typo
* Small int option and more
* Fix incorrect condition on source needed
* Requested changes
* I think I got it?
* Bias vector should be stride 0
* Two source added!
* Typos
* Merge examples
* Bring back vector row offset
Just to ensure consistency with universal gemm with fused epilogue
* Base arguments and params structs for StreamK
* StreamK epilogue with broadcast now inherits the original
* undo params_streamk_base.h
---------
Co-authored-by: Ali Hassani <ahassanijr@gmail.com>
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
* Remove references to device-only intrinsics when compiling for host.
Currently, we attempt to use the `__device__`-only functions
`__cvta_generic_to_shared` and `__nvvm_get_smem_pointer` when compiling
`cute::cast_smem_ptr_to_uint` for the host on Clang. This results in a
compilation error, as expected. This commit changes the definition of
the `*_ACTIVATED` macros so that they are only true when `__CUDA_ARCH__`
is defined; that is, when compiling for the device.
Additionally, the declaration of `__nvvm_get_smem_pointer`
is currently only visible during the device compilation pass when
compiling with NVCC; this commit makes the declaration visible during
host compilation with the `__device__` annotation.
* Annotate cute::cast_smem_ptr_to_uint as device-only.
The implementation of `cute::cast_smem_ptr_to_uint` is currently an
unchecked failure on host code, and the only host implementation I can
think of -- casting a probably-64-bit pointer to 32 bits somehow --
doesn't make sense to implement. This commit marks this function as
device-only so that it can't be accidentally used on host code.
* small change
---------
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
* Make operator() const-correct and add missing static functions.
Currently, `*Converter::operator()` requires a mutable object to invoke,
and there are missing `static result_type convert(source_type const &
source)` overloads for certain partial specializations of `*Converter`
objects. This commit makes `operator()` const-correct and adds missing
function overloads where appropriate.
* minor changes
* format
---------
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
- clang 1.14 complains about missing function from a host call:
cutlass/include/cute/arch/util.hpp:106:32: error: no matching function for call to '__cvta_generic_to_shared'
return static_cast<uint32_t>(__cvta_generic_to_shared(ptr));
- fixes this by defining CUTE_HOST_DEVICE for clang as well
Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
Currently, the `LinearCombinationClamp` header file is not standalone,
and must have the definition of `cutlass::epilogue:🧵:ScaleType`
already available when it is `#include`d.