* Passing warp-level mixed input F16*(S8/U8) tests
* passing device-level mixed input F16*(S8/U8) tests
* add to profiler - I8 (111 TFLOPs), U (123 TFLOPs)
* fast numeric conversions (I8 = 132 TFLOPs, U8 = 148 TFLOPs)
* Speedup reference compilation (REVERT THIS COMMIT)
* wider_add.u32_packed_sub.f16x2 (I8 = 132TFLOP/s, U8 = 170 TFLOP/s)
* Improve s8->f16 cvt and support bf16*u8 @158 TFLOPs
* BF16 * S8 (142 TFLOPs)
* Handle mixed-input upcast on OperandA (Support [S8|U8]*[F16|BF16]
* rename OpMultiplyAddMixedInput to OpMultiplyAddMixedInputUpcast
* Add device-level test and profiler support for upcast on operand A
* Move shfl before the cvt and reduce #shfls by 1/2
* fix smem_usage calculation for mixed_input types
* uncomment the stuff (getting ready for merge)
* profiler changes and mixed-input reference
* mixed input reference are in a new file
* use platform instead of std
* comments and typo only
* Use CreateGemmOperator and delete CreateMixedInputGemmOperator
* copyright for new files
* rebase follow-up
CUTLASS 2.1 contributes:
- BLAS-style host-side API added to CUTLASS Library
- Planar Complex GEMM kernels targeting Volta and Turing Tensor Cores
- Minor enhancements and bug fixes
CUTLASS 2.0
Substantially refactored for
- Better performance, particularly for native Turing Tensor Cores
- Robust and durable templates spanning the design space
- Encapsulated functionality embodying modern C++11 programming techniques
- Optimized containers and data types for efficient, generic, portable device code
Updates to:
- Quick start guide
- Documentation
- Utilities
- CUTLASS Profiler
Native Turing Tensor Cores
- Efficient GEMM kernels targeting Turing Tensor Cores
- Mixed-precision floating point, 8-bit integer, 4-bit integer, and binarized operands
Coverage of existing CUTLASS functionality:
- GEMM kernels targeting CUDA and Tensor Cores in NVIDIA GPUs
- Volta Tensor Cores through native mma.sync and through WMMA API
- Optimizations such as parallel reductions, threadblock rasterization, and intra-threadblock reductions
- Batched GEMM operations
- Complex-valued GEMMs
Note: this commit and all that follow require a host compiler supporting C++11 or greater.