* Add support for mixed 4-bit/8-bit data types GEMM
* fix ( and )
---------
Co-authored-by: Aleksandar Samardžić <asamardzic@matf.bg.ac.rs>
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
* Add couple configs into generator.py for mixed input MM
* change one unit test name; reenable 128x32 in the profiler
* Added U8/BF16 tests.
---------
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com>
* Allow per-column bias in EpilogueTensorBroadcast
EpilogueTensorBroadcast only supports per-row vector broadcast, because
the bias stride is hardcoded.
It can easily support both if the bias stride is made conditional, and
the original behavior is maintained by defaulting to per-row.
* Add unit test for EpilogueTensorBroadcast with per-col bias
---------
Co-authored-by: Ali Hassani <ahassanijr@gmail.com>
Co-authored-by: Ali Hassani <ali@hippoml.com>
* Remove unused variables
* Qualify calls to make_fragment_? from templated base class.
Fixes clang build error.
* Add missing `#include <cstdio>`
* Various changes to fix clang compile errors.
* More changes to fix clang build.
Remaining issues:
- `params` initializer of `CollectiveEpilogue`.
- `ops` initializer of `Sm90VisitorImplBase`.
- `__usAtomicCAS` needs to be added to clang upstream.
* Fix remaining clang build issues.
* Qualify `cute::rank()` calls.
* Qualify some more calls that are otherwise ambiguous between `cute` and `std` namespace.
* Double-escape special registers in inline asm.
* small change
---------
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
* Release 3.3.0
Adds support for mixed precision GEMMs On Hopper and Ampere
Adds support for < 16B aligned GEMMs on Hopper
Enhancements to EVT
Enhancements to Python interface
Enhancements to Sub-byte type handling in CuTe
Several other bug-fixes and performance improvements.
* minor doc update
* Passing warp-level mixed input F16*(S8/U8) tests
* passing device-level mixed input F16*(S8/U8) tests
* add to profiler - I8 (111 TFLOPs), U (123 TFLOPs)
* fast numeric conversions (I8 = 132 TFLOPs, U8 = 148 TFLOPs)
* Speedup reference compilation (REVERT THIS COMMIT)
* wider_add.u32_packed_sub.f16x2 (I8 = 132TFLOP/s, U8 = 170 TFLOP/s)
* Improve s8->f16 cvt and support bf16*u8 @158 TFLOPs
* BF16 * S8 (142 TFLOPs)
* Handle mixed-input upcast on OperandA (Support [S8|U8]*[F16|BF16]
* rename OpMultiplyAddMixedInput to OpMultiplyAddMixedInputUpcast
* Add device-level test and profiler support for upcast on operand A
* Move shfl before the cvt and reduce #shfls by 1/2
* fix smem_usage calculation for mixed_input types
* uncomment the stuff (getting ready for merge)
* profiler changes and mixed-input reference
* mixed input reference are in a new file
* use platform instead of std
* comments and typo only
* Use CreateGemmOperator and delete CreateMixedInputGemmOperator
* copyright for new files
* rebase follow-up
* Remove redundant <fstream> includes
* Fix fstream in examples/
* Fix <fstream> in test/
* Use consistent order for <fstream> (always after <iostream>)
* Remove an unneeded include in a file where std::ofstream usage is commented out
Co-authored-by: Ivan Komarov <dfyz@yandex-team.ru>
* Actually use float accumulation in gemm_f16t_f16t_f16t_wmma_tensor_op_f32_sm70.cu
As title
* Update gemm_f16t_f16t_f16t_wmma_tensor_op_f32_sm70.cu
change the missing one
Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com>
CUTLASS 2.3 adds GEMMs targeting Sparse Tensor Cores on the NVIDIA Ampere Architecture, fast SGEMM, and small matrix classes, bug fixes, and performance enhancements.
CUTLASS 2.1 contributes:
- BLAS-style host-side API added to CUTLASS Library
- Planar Complex GEMM kernels targeting Volta and Turing Tensor Cores
- Minor enhancements and bug fixes
CUTLASS 2.0
Substantially refactored for
- Better performance, particularly for native Turing Tensor Cores
- Robust and durable templates spanning the design space
- Encapsulated functionality embodying modern C++11 programming techniques
- Optimized containers and data types for efficient, generic, portable device code
Updates to:
- Quick start guide
- Documentation
- Utilities
- CUTLASS Profiler
Native Turing Tensor Cores
- Efficient GEMM kernels targeting Turing Tensor Cores
- Mixed-precision floating point, 8-bit integer, 4-bit integer, and binarized operands
Coverage of existing CUTLASS functionality:
- GEMM kernels targeting CUDA and Tensor Cores in NVIDIA GPUs
- Volta Tensor Cores through native mma.sync and through WMMA API
- Optimizations such as parallel reductions, threadblock rasterization, and intra-threadblock reductions
- Batched GEMM operations
- Complex-valued GEMMs
Note: this commit and all that follow require a host compiler supporting C++11 or greater.