* Release 3.3.0
Adds support for mixed precision GEMMs On Hopper and Ampere
Adds support for < 16B aligned GEMMs on Hopper
Enhancements to EVT
Enhancements to Python interface
Enhancements to Sub-byte type handling in CuTe
Several other bug-fixes and performance improvements.
* minor doc update
* Passing warp-level mixed input F16*(S8/U8) tests
* passing device-level mixed input F16*(S8/U8) tests
* add to profiler - I8 (111 TFLOPs), U (123 TFLOPs)
* fast numeric conversions (I8 = 132 TFLOPs, U8 = 148 TFLOPs)
* Speedup reference compilation (REVERT THIS COMMIT)
* wider_add.u32_packed_sub.f16x2 (I8 = 132TFLOP/s, U8 = 170 TFLOP/s)
* Improve s8->f16 cvt and support bf16*u8 @158 TFLOPs
* BF16 * S8 (142 TFLOPs)
* Handle mixed-input upcast on OperandA (Support [S8|U8]*[F16|BF16]
* rename OpMultiplyAddMixedInput to OpMultiplyAddMixedInputUpcast
* Add device-level test and profiler support for upcast on operand A
* Move shfl before the cvt and reduce #shfls by 1/2
* fix smem_usage calculation for mixed_input types
* uncomment the stuff (getting ready for merge)
* profiler changes and mixed-input reference
* mixed input reference are in a new file
* use platform instead of std
* comments and typo only
* Use CreateGemmOperator and delete CreateMixedInputGemmOperator
* copyright for new files
* rebase follow-up
* Split apart gemm reference templates into multiple TUs for parallel compilation
* remove old files
* better balancing of ref kernels across TUs
* remove 3 new added refcheck kernels and some un-necessary fp8 library instances to reduce lib size
* remove auto fp8 kernels
* remove some redundant kernels
* Support parallel split K mode for porfiling
Signed-off-by: Peter Han <fujun.han@iluvatar.ai>
* Parallel Split K support
1. find gemm kernel by preference key
2. switch m n for redution kernel
Signed-off-by: Peter Han <fujun.han@iluvatar.ai>
* parallel splitk for fp16 gemm
* add one missing file
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
CUTLASS 2.7
Mainloop fusion for GEMM: summation over A or B
Strided DGRAD (optimized iterators)
Half-precision GELU_taylor activation functions
Use these when accumulation and epilogue compute types are all cutlass::half_t
Tuning and bug fixes to fused GEMM + GEMM example
Support for smaller than 128b aligned Convolutions: see examples
Caching of results to accelerate Convolution unit tests
Can be enabled or disabled by running cmake .. -DCUTLASS_TEST_ENABLE_CACHED_RESULTS=OFF
Corrections and bug fixes reported by the CUTLASS community
Thank you for filing these issues!
authored-by: Haicheng Wu haichengw@nvidia.com, Manish Gupta manigupta@nvidia.com, Dustyn Blasig dblasig@nvidia.com, Andrew Kerr akerr@nvidia.com