* Allow per-column bias in EpilogueTensorBroadcast
EpilogueTensorBroadcast only supports per-row vector broadcast, because
the bias stride is hardcoded.
It can easily support both if the bias stride is made conditional, and
the original behavior is maintained by defaulting to per-row.
* Add unit test for EpilogueTensorBroadcast with per-col bias
---------
Co-authored-by: Ali Hassani <ahassanijr@gmail.com>
Co-authored-by: Ali Hassani <ali@hippoml.com>
* Remove unused variables
* Qualify calls to make_fragment_? from templated base class.
Fixes clang build error.
* Add missing `#include <cstdio>`
* Various changes to fix clang compile errors.
* More changes to fix clang build.
Remaining issues:
- `params` initializer of `CollectiveEpilogue`.
- `ops` initializer of `Sm90VisitorImplBase`.
- `__usAtomicCAS` needs to be added to clang upstream.
* Fix remaining clang build issues.
* Qualify `cute::rank()` calls.
* Qualify some more calls that are otherwise ambiguous between `cute` and `std` namespace.
* Double-escape special registers in inline asm.
* small change
---------
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
* Release 3.3.0
Adds support for mixed precision GEMMs On Hopper and Ampere
Adds support for < 16B aligned GEMMs on Hopper
Enhancements to EVT
Enhancements to Python interface
Enhancements to Sub-byte type handling in CuTe
Several other bug-fixes and performance improvements.
* minor doc update
* Correct typos in comments
Correct comments in code on type of generated distribution. Improve Gaussian RNG to take advantage of Box Muller method
* Inline Box Muller
Added inline function for the Box Muller algorithm and updated code comments to be more concise
* Update tensor_fill.h
* Update tensor_fill.h
* small changes to pass tests
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
* Remove redundant <fstream> includes
* Fix fstream in examples/
* Fix <fstream> in test/
* Use consistent order for <fstream> (always after <iostream>)
* Remove an unneeded include in a file where std::ofstream usage is commented out
Co-authored-by: Ivan Komarov <dfyz@yandex-team.ru>
`CUDA_PERROR_EXIT ` can lead to incorrect usage (see e.g. [this description](https://www.cs.technion.ac.il/users/yechiel/c++-faq/macros-with-if.html)) because it contains an incomplete `if` expression. Consider:
```
if (condition)
CUDA_PERROR_EXIT(cudaFree(x))
else
free(x);
```
The author of the code forgot to add a semicolon after the macro. In that case, the `else` will bind to the `if` inside the macro definition, leading to code that the author did not intend or expect. It the author does use a semicolon, the code will not compile, which is awkward.
The change adds a `do while` around the `if`, which always requires a semicolon.
This PR also adds the text of the failing expression to the printed error message.
CUTLASS 2.7
Mainloop fusion for GEMM: summation over A or B
Strided DGRAD (optimized iterators)
Half-precision GELU_taylor activation functions
Use these when accumulation and epilogue compute types are all cutlass::half_t
Tuning and bug fixes to fused GEMM + GEMM example
Support for smaller than 128b aligned Convolutions: see examples
Caching of results to accelerate Convolution unit tests
Can be enabled or disabled by running cmake .. -DCUTLASS_TEST_ENABLE_CACHED_RESULTS=OFF
Corrections and bug fixes reported by the CUTLASS community
Thank you for filing these issues!
authored-by: Haicheng Wu haichengw@nvidia.com, Manish Gupta manigupta@nvidia.com, Dustyn Blasig dblasig@nvidia.com, Andrew Kerr akerr@nvidia.com
CUTLASS 2.3 adds GEMMs targeting Sparse Tensor Cores on the NVIDIA Ampere Architecture, fast SGEMM, and small matrix classes, bug fixes, and performance enhancements.
- Updated mma_sm80.h to avoid perf penalty due to reinterpret_cast<>.
- Enhancement to CUTLASS Utility Library's HostTensorPlanarComplex template to support copy-in and copy-out
- Added test_examples target to build and test all CUTLASS examples
- Minor edits to documentation to point to GTC 2020 webinar
CUTLASS 2.1 contributes:
- BLAS-style host-side API added to CUTLASS Library
- Planar Complex GEMM kernels targeting Volta and Turing Tensor Cores
- Minor enhancements and bug fixes
CUTLASS 2.0
Substantially refactored for
- Better performance, particularly for native Turing Tensor Cores
- Robust and durable templates spanning the design space
- Encapsulated functionality embodying modern C++11 programming techniques
- Optimized containers and data types for efficient, generic, portable device code
Updates to:
- Quick start guide
- Documentation
- Utilities
- CUTLASS Profiler
Native Turing Tensor Cores
- Efficient GEMM kernels targeting Turing Tensor Cores
- Mixed-precision floating point, 8-bit integer, 4-bit integer, and binarized operands
Coverage of existing CUTLASS functionality:
- GEMM kernels targeting CUDA and Tensor Cores in NVIDIA GPUs
- Volta Tensor Cores through native mma.sync and through WMMA API
- Optimizations such as parallel reductions, threadblock rasterization, and intra-threadblock reductions
- Batched GEMM operations
- Complex-valued GEMMs
Note: this commit and all that follow require a host compiler supporting C++11 or greater.
Requires a recent clang build (r359248 or newer).
Enable compilation with clang with these options:
cmake -DCUDA_COMPILER=clang -DCMAKE_CXX_COMPILER=/path/to/clang++