This commit adds two `#include` directives so that the definitions of `cutlass::gemm::warp::WarpSize` from "cutlass/gemm/warp/mma.h" and `cutlass::arch::OpClassSimt` from "cutlass/arch/mma.h" are visible to "cutlass/epilogue/threadblock/default_epilogue_simt.h". Without them, there are compiler errors when building the header standalone:
```
In file included from cutlass/include/cutlass/epilogue/threadblock/default_epilogue_simt.cu:1:
./cutlass/include/cutlass/epilogue/threadblock/default_epilogue_simt.h:351:32: error: no member named 'warp' in namespace 'cutlass::gemm'; did you mean simply 'warp'?
static int const kWarpSize = cutlass::gemm::warp::WarpSize<arch::OpClassSimt>::value;
^
./cutlass/include/cutlass/epilogue/warp/tile_iterator_simt.h:49:11: note: 'warp' declared here
namespace warp {
^
In file included from cutlass/include/cutlass/epilogue/threadblock/default_epilogue_simt.cu:1:
./cutlass/include/cutlass/epilogue/threadblock/default_epilogue_simt.h:351:53: error: no member named 'WarpSize' in namespace 'cutlass::epilogue::warp'
static int const kWarpSize = cutlass::gemm::warp::WarpSize<arch::OpClassSimt>::value;
~~~~~~^
./cutlass/include/cutlass/epilogue/threadblock/default_epilogue_simt.h:351:68: error: no member named 'OpClassSimt' in namespace 'cutlass::arch'
static int const kWarpSize = cutlass::gemm::warp::WarpSize<arch::OpClassSimt>::value;
~~~~~~^
./cutlass/include/cutlass/epilogue/threadblock/default_epilogue_simt.h:351:82: error: no member named 'value' in the global namespace
static int const kWarpSize = cutlass::gemm::warp::WarpSize<arch::OpClassSimt>::value;
~~^
./cutlass/include/cutlass/epilogue/threadblock/default_epilogue_simt.h:367:5: error: use of class template 'OutputTileThreadMap' requires template arguments
OutputTileThreadMap,
^
./cutlass/include/cutlass/epilogue/threadblock/output_tile_thread_map.h:134:8: note: template is declared here
struct OutputTileThreadMap : public OutputTileThreadMapHelpers<Iterations_, Delta_> {
^
In file included from cutlass/include/cutlass/epilogue/threadblock/default_epilogue_simt.cu:1:
./cutlass/include/cutlass/epilogue/threadblock/default_epilogue_simt.h:391:5: error: use of class template 'OutputTileThreadMap' requires template arguments
OutputTileThreadMap,
^
./cutlass/include/cutlass/epilogue/threadblock/output_tile_thread_map.h:134:8: note: template is declared here
struct OutputTileThreadMap : public OutputTileThreadMapHelpers<Iterations_, Delta_> {
^
In file included from cutlass/include/cutlass/epilogue/threadblock/default_epilogue_simt.cu:1:
./cutlass/include/cutlass/epilogue/threadblock/default_epilogue_simt.h:405:5: error: unknown type name 'OutputTileIterator'; did you mean 'WarpTileIterator'?
OutputTileIterator,
^
./cutlass/include/cutlass/epilogue/threadblock/default_epilogue_simt.h:380:9: note: 'WarpTileIterator' declared here
using WarpTileIterator = cutlass::epilogue::warp::TileIteratorSimtDirect2dConv<
^
./cutlass/include/cutlass/epilogue/threadblock/default_epilogue_simt.h:408:5: error: use of class template 'SharedLoadIterator' requires template arguments
SharedLoadIterator,
^
./cutlass/include/cutlass/epilogue/threadblock/shared_load_iterator.h:67:7: note: template is declared here
class SharedLoadIterator {
^
```
* Relax stream K gemm alignment constraints
The current alignment requirements are too strict. Make them identical
to the checks for the regular universal gemm.
* Revert "Relax stream K gemm alignment constraints"
This reverts commit 31e80a250e2b0ac4bda2e4b437b39dc5bcd5e845.
* Relax stream K gemm alignment constraints
The current alignment requirements are too strict. Make them identical
to the checks for the regular universal gemm.
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
* add two missing files
* fix bunch of bugs of gemm-reducek fusion and add a device interface
* small changes
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
* ex42: Fused MHA imported from xFormers
* Remove std:: references
* Support K>128 in the example
* Support causal option
* Support different head size for V, and different seqlength for KV
* Update FLOPS counter
* Remove bit_cast
* fix build: Replace M_LOG2E
* Add doc
* Revert "Remove bit_cast"
This reverts commit 9662fa86bb7c57c1a015ac0bf52cb52940fbbf80.
* Explicit casts to int32_t for windows build
Co-authored-by: danthe3rd <danthe3rd>
- adds missing commas
- adjusts misaligned usage of CUTLASS_DEVICE between
template declaration and specializations
Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
* Fixed template struct/class mismatch
* Use platform implementation instead of std::abs and std::conditional during nvrtc compilation
* Use platform implementation instead of std::abs and std::conditional during nvrtc compilation
* Revert absolute_value() usage
* Fix separate compilation `-dc`
- when cutlass is included in multiple compilation units
compiled with `-dc` OOB_NAN_F16x8 device constant is
instantiated multiple times causing
Multiple definition of '_ZN7cutlass4arch13OOB_NAN_F16x8E' error
This PR makes this variable a local constant as it is not
modified during runtime
Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
* Fix
Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
* Test GH
Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
* Revert test GH
Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
* Removed trivial copy constructors on parameter classes to enable device-side launch of CUTLASS kernels
* Added SFINAE to the `TensorRef(NonConstTensorRef const&)` constructor to avoid making it a copy-constructor for device code
* std => platform
* fix affine2
* really fix affine2
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
* Fix the build of cutlass/gemm/device/gemm_array.h and add a demo for GemmArray
* Add a reference to GemmArray to the docs
Co-authored-by: Ivan Komarov <dfyz@yandex-team.ru>
* Add epilogue functor for residual block fusion
* Do not run split-k tests when ActivationOp is not Identity
* explain TestSplitK param
* return early
* Support half precision sigmoid activation
* introduce a vectorized variant using fast_tanh
* move the math to fast_math.h
* fixed compile
* .raw() -> .to_half()
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
* Support half precision sigmoid activation
* introduce a vectorized variant using fast_tanh
* refactored sigmoid using the new interface
* refactored gelu
* add silu activation
* add hardswish
* remove sigmoid for now
* add description to silu and hardswish, and other doc update
* Do not ignore Round
* use constant N
* Set isHeavy = true in sigmoid and silu epilogue
CUTLASS 2.7
Mainloop fusion for GEMM: summation over A or B
Strided DGRAD (optimized iterators)
Half-precision GELU_taylor activation functions
Use these when accumulation and epilogue compute types are all cutlass::half_t
Tuning and bug fixes to fused GEMM + GEMM example
Support for smaller than 128b aligned Convolutions: see examples
Caching of results to accelerate Convolution unit tests
Can be enabled or disabled by running cmake .. -DCUTLASS_TEST_ENABLE_CACHED_RESULTS=OFF
Corrections and bug fixes reported by the CUTLASS community
Thank you for filing these issues!
authored-by: Haicheng Wu haichengw@nvidia.com, Manish Gupta manigupta@nvidia.com, Dustyn Blasig dblasig@nvidia.com, Andrew Kerr akerr@nvidia.com
CUTLASS 2.3 adds GEMMs targeting Sparse Tensor Cores on the NVIDIA Ampere Architecture, fast SGEMM, and small matrix classes, bug fixes, and performance enhancements.
- Updated mma_sm80.h to avoid perf penalty due to reinterpret_cast<>.
- Enhancement to CUTLASS Utility Library's HostTensorPlanarComplex template to support copy-in and copy-out
- Added test_examples target to build and test all CUTLASS examples
- Minor edits to documentation to point to GTC 2020 webinar
CUTLASS 2.1 contributes:
- BLAS-style host-side API added to CUTLASS Library
- Planar Complex GEMM kernels targeting Volta and Turing Tensor Cores
- Minor enhancements and bug fixes
CUTLASS 2.0
Substantially refactored for
- Better performance, particularly for native Turing Tensor Cores
- Robust and durable templates spanning the design space
- Encapsulated functionality embodying modern C++11 programming techniques
- Optimized containers and data types for efficient, generic, portable device code
Updates to:
- Quick start guide
- Documentation
- Utilities
- CUTLASS Profiler
Native Turing Tensor Cores
- Efficient GEMM kernels targeting Turing Tensor Cores
- Mixed-precision floating point, 8-bit integer, 4-bit integer, and binarized operands
Coverage of existing CUTLASS functionality:
- GEMM kernels targeting CUDA and Tensor Cores in NVIDIA GPUs
- Volta Tensor Cores through native mma.sync and through WMMA API
- Optimizations such as parallel reductions, threadblock rasterization, and intra-threadblock reductions
- Batched GEMM operations
- Complex-valued GEMMs
Note: this commit and all that follow require a host compiler supporting C++11 or greater.