* Split apart gemm reference templates into multiple TUs for parallel compilation
* remove old files
* better balancing of ref kernels across TUs
* remove 3 new added refcheck kernels and some un-necessary fp8 library instances to reduce lib size
* remove auto fp8 kernels
* remove some redundant kernels
CUTLASS 2.3 adds GEMMs targeting Sparse Tensor Cores on the NVIDIA Ampere Architecture, fast SGEMM, and small matrix classes, bug fixes, and performance enhancements.
CUTLASS 2.1 contributes:
- BLAS-style host-side API added to CUTLASS Library
- Planar Complex GEMM kernels targeting Volta and Turing Tensor Cores
- Minor enhancements and bug fixes
CUTLASS 2.0
Substantially refactored for
- Better performance, particularly for native Turing Tensor Cores
- Robust and durable templates spanning the design space
- Encapsulated functionality embodying modern C++11 programming techniques
- Optimized containers and data types for efficient, generic, portable device code
Updates to:
- Quick start guide
- Documentation
- Utilities
- CUTLASS Profiler
Native Turing Tensor Cores
- Efficient GEMM kernels targeting Turing Tensor Cores
- Mixed-precision floating point, 8-bit integer, 4-bit integer, and binarized operands
Coverage of existing CUTLASS functionality:
- GEMM kernels targeting CUDA and Tensor Cores in NVIDIA GPUs
- Volta Tensor Cores through native mma.sync and through WMMA API
- Optimizations such as parallel reductions, threadblock rasterization, and intra-threadblock reductions
- Batched GEMM operations
- Complex-valued GEMMs
Note: this commit and all that follow require a host compiler supporting C++11 or greater.