* Add epilogue functor for residual block fusion
* Do not run split-k tests when ActivationOp is not Identity
* explain TestSplitK param
* return early
* Support half precision sigmoid activation
* introduce a vectorized variant using fast_tanh
* move the math to fast_math.h
* fixed compile
* .raw() -> .to_half()
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
* Support half precision sigmoid activation
* introduce a vectorized variant using fast_tanh
* refactored sigmoid using the new interface
* refactored gelu
* add silu activation
* add hardswish
* remove sigmoid for now
* add description to silu and hardswish, and other doc update
* Do not ignore Round
* use constant N
* Set isHeavy = true in sigmoid and silu epilogue
CUTLASS 2.7
Mainloop fusion for GEMM: summation over A or B
Strided DGRAD (optimized iterators)
Half-precision GELU_taylor activation functions
Use these when accumulation and epilogue compute types are all cutlass::half_t
Tuning and bug fixes to fused GEMM + GEMM example
Support for smaller than 128b aligned Convolutions: see examples
Caching of results to accelerate Convolution unit tests
Can be enabled or disabled by running cmake .. -DCUTLASS_TEST_ENABLE_CACHED_RESULTS=OFF
Corrections and bug fixes reported by the CUTLASS community
Thank you for filing these issues!
authored-by: Haicheng Wu haichengw@nvidia.com, Manish Gupta manigupta@nvidia.com, Dustyn Blasig dblasig@nvidia.com, Andrew Kerr akerr@nvidia.com
CUTLASS 2.3 adds GEMMs targeting Sparse Tensor Cores on the NVIDIA Ampere Architecture, fast SGEMM, and small matrix classes, bug fixes, and performance enhancements.
- Updated mma_sm80.h to avoid perf penalty due to reinterpret_cast<>.
- Enhancement to CUTLASS Utility Library's HostTensorPlanarComplex template to support copy-in and copy-out
- Added test_examples target to build and test all CUTLASS examples
- Minor edits to documentation to point to GTC 2020 webinar
CUTLASS 2.1 contributes:
- BLAS-style host-side API added to CUTLASS Library
- Planar Complex GEMM kernels targeting Volta and Turing Tensor Cores
- Minor enhancements and bug fixes