Commit Graph

49 Commits

Author SHA1 Message Date
ANIKET SHIVAM
751eb9a885
Update license year (#1306) 2024-01-16 14:37:22 -05:00
ANIKET SHIVAM
2f589ffa76
Updates for 3.4 release. (#1305) 2024-01-16 13:42:51 -05:00
Ali Hassani
d4be5ab5d7
Allow per-column bias in EpilogueTensorBroadcast (#1275)
* Allow per-column bias in EpilogueTensorBroadcast

EpilogueTensorBroadcast only supports per-row vector broadcast, because
the bias stride is hardcoded.

It can easily support both if the bias stride is made conditional, and
the original behavior is maintained by defaulting to per-row.

* Add unit test for EpilogueTensorBroadcast with per-col bias

---------

Co-authored-by: Ali Hassani <ahassanijr@gmail.com>
Co-authored-by: Ali Hassani <ali@hippoml.com>
2024-01-04 12:48:31 -05:00
Pradeep Ramani
8236f30675
CUTLASS 3.4.0 (#1286)
* CUTLASS 3.4.0

* Update CHANGELOG.md

---------

Co-authored-by: Pradeep Ramani <prramani@nvidia.com>
2023-12-29 15:21:31 -05:00
Andrey Portnoy
30ec1a4649
Use size_t index to iterate up to std::vector::size() (#1251)
Fixes a different signedness compare warning.
2023-12-09 08:44:31 -05:00
Christian Sigg
e1483d5fa0
Collection of changes to fix clang build. (#1200)
* Remove unused variables

* Qualify calls to make_fragment_? from templated base class.

Fixes clang build error.

* Add missing `#include <cstdio>`

* Various changes to fix clang compile errors.

* More changes to fix clang build.

Remaining issues:

- `params` initializer of `CollectiveEpilogue`.
- `ops` initializer of `Sm90VisitorImplBase`.
- `__usAtomicCAS` needs to be added to clang upstream.

* Fix remaining clang build issues.

* Qualify `cute::rank()` calls.

* Qualify some more calls that are otherwise ambiguous between `cute` and `std` namespace.

* Double-escape special registers in inline asm.

* small change

---------

Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2023-12-08 14:42:12 -05:00
Pradeep Ramani
c008b4aea8
CUTLASS 3.3.0 (#1167)
* Release 3.3.0

Adds support for mixed precision GEMMs On Hopper and Ampere
Adds support for < 16B aligned GEMMs on Hopper
Enhancements to EVT
Enhancements to Python interface
Enhancements to Sub-byte type handling in CuTe
Several other bug-fixes and performance improvements.

* minor doc update
2023-11-02 11:09:05 -04:00
masahi
ff61a49dd1
Allow changing epsilon parameter in RMS norm kernel (#1112) 2023-10-02 20:40:28 -04:00
Lequn Chen
26986bbc60
Fix type typo in rmsnorm (#1119)
Initially the variable `h4` is `half4`, but its last two fields are not used. Based on the semantics and the context, I believe it should be `half2`.
2023-10-02 20:40:04 -04:00
ANIKET SHIVAM
90d3b0fb18
CUTLASS 3.2.1 (#1113)
* Updates for 3.2.1 release.

* Minor fix in gemm op profiler for raster order.

* Add scheduler mapping for raster order in the kernels.
2023-09-26 17:24:26 -04:00
ANIKET SHIVAM
4575443d44
CUTLASS 3.2 (#1024)
* CUTLASS 3.2
2023-08-07 20:50:32 -04:00
masahi
f679663224
Add RMS norm (#979) 2023-07-10 21:31:27 -04:00
ANIKET SHIVAM
d572cc1aab
CUTLASS 3.1 (#915)
Co-authored-by: Aniket Shivam <ashivam@nvidia.com>
2023-04-14 23:19:34 -04:00
Vijay Thakkar
15d9d31f1f
CUTLASS 3.0 Hopper GEMMs are GETTs in disguise (#897) 2023-03-29 10:42:40 -04:00
Alexander Pivovarov
7e370c9637
Fix typos 2 (#842)
Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com>
2023-03-09 23:22:56 -05:00
Vijay Thakkar
277bd6e537
CUTLASS 3.0.0 (#786)
* CUTLASS 3.0.0
2023-01-23 20:55:28 -05:00
ANIKET SHIVAM
66d9cddc83
New updates for 2.11 (#775)
* New updates.

* Minor profiler updates

Co-authored-by: Aniket Shivam <ashivam@nvidia.com>
2023-01-20 16:32:57 -05:00
Haicheng Wu
9f1f37aa21
misc (#719)
* misc

* minor

Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2022-12-05 12:07:20 -05:00
Aditya Atluri
c975e2ccbb
releaase 2.11 (#703) 2022-11-19 09:02:15 -05:00
Alexander Freudenberg
cb539dab78
Correct typos in comments (#639)
* Correct typos in comments

Correct comments in code on type of generated distribution. Improve Gaussian RNG to take advantage of Box Muller method

* Inline Box Muller

Added inline function for the Box Muller algorithm and updated code comments to be more concise

* Update tensor_fill.h

* Update tensor_fill.h

* small changes to pass tests

Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2022-09-30 22:51:30 -04:00
Haicheng Wu
97bff52e8c
add two missing files (#636)
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2022-09-21 15:42:42 -04:00
ANIKET SHIVAM
e773429f7e
CUTLASS 2.10 updates (#622)
Co-authored-by: Aniket Shivam <ashivam@nvidia.com>
2022-09-12 21:26:30 -04:00
ANIKET SHIVAM
b72cbf957d
CUTLASS 2.10 (#615)
Co-authored-by: Aniket Shivam <ashivam@nvidia.com>
2022-09-03 18:48:46 -04:00
Ivan Komarov
0b8cacd6f1
Remove redundant <fstream> includes (#563)
* Remove redundant <fstream> includes

* Fix fstream in examples/

* Fix <fstream> in test/

* Use consistent order for <fstream> (always after <iostream>)

* Remove an unneeded include in a file where std::ofstream usage is commented out

Co-authored-by: Ivan Komarov <dfyz@yandex-team.ru>
2022-07-19 15:23:54 -04:00
Haicheng Wu
ec2b4fd85d
b2b bias vector support (#482)
* b2b bias vector support

* add files

Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2022-04-30 04:16:15 -07:00
Andrei Alexandrescu
d7b499deff
Fix CUDA_PERROR_EXIT and print failing expression (#446)
`CUDA_PERROR_EXIT ` can lead to incorrect usage (see e.g. [this description](https://www.cs.technion.ac.il/users/yechiel/c++-faq/macros-with-if.html)) because it contains an incomplete `if` expression. Consider:

```
if (condition)
    CUDA_PERROR_EXIT(cudaFree(x))
else
    free(x);
```

The author of the code forgot to add a semicolon after the macro. In that case, the `else` will bind to the `if` inside the macro definition, leading to code that the author did not intend or expect. It the author does use a semicolon, the code will not compile, which is awkward.

The change adds a `do while` around the `if`, which always requires a semicolon.

This PR also adds the text of the failing expression to the printed error message.
2022-04-24 16:29:43 -04:00
Fujun Han
4c0d6e1eb4
[BUGFIX]: Force unroll a loop that doesn't have compilation constant (#441)
loop times is dangerous.

Signed-off-by: Peter Han <fujun.han@iluvatar.ai>
2022-04-24 16:28:32 -04:00
Andrew Kerr
12f4108ac2
CUTLASS 2.9 (#468) 2022-04-23 15:02:38 -04:00
Manish Gupta
808c25337a
CUTLASS 2.8 (#363)
CUTLASS 2.8
2021-11-19 13:26:35 -08:00
Manish Gupta
2e07c4cc2f
CUTLASS 2.7 (#318)
CUTLASS 2.7

Mainloop fusion for GEMM: summation over A or B
Strided DGRAD (optimized iterators)
Half-precision GELU_taylor activation functions
Use these when accumulation and epilogue compute types are all cutlass::half_t
Tuning and bug fixes to fused GEMM + GEMM example
Support for smaller than 128b aligned Convolutions: see examples
Caching of results to accelerate Convolution unit tests
Can be enabled or disabled by running cmake .. -DCUTLASS_TEST_ENABLE_CACHED_RESULTS=OFF
Corrections and bug fixes reported by the CUTLASS community
Thank you for filing these issues!

authored-by: Haicheng Wu haichengw@nvidia.com, Manish Gupta manigupta@nvidia.com, Dustyn Blasig dblasig@nvidia.com, Andrew Kerr akerr@nvidia.com
2021-09-20 11:02:22 -07:00
Manish Gupta
6c2f8f2fb8
CUTLASS 2.6.1 - functional and performance enhancements to strided DGRAD, fixes, and tuning
* cutlass 2.6 update

* remove debug prints

* cutlass 2.6.1 (minor update)

* Updated CHANGELOG.

* Minor edit to readme to indicate patch version.

* Minor edit to readme.

Co-authored-by:  Haicheng Wu <haichengw@nvidia.com>, Andrew Kerr <akerr@nvidia.com>
2021-09-03 10:26:15 -07:00
Manish Gupta
1ac4559d12
Cutlass 2.6 Update 1 (#301)
* cutlass 2.6 update

* remove debug prints
2021-07-27 17:58:30 -07:00
Manish Gupta
e5d51840e8
CUTLASS 2.6 (#298)
CUTLASS 2.6
2021-07-23 00:40:53 -04:00
Bernardo Covas
1d8372a8e2 fix typo in reference conv3d 2021-05-28 21:06:59 +01:00
Zheng Zeng
a8f6f8eb07
add a missing 'device_memory::' before a function 2021-04-25 20:05:39 +08:00
Andrew Kerr
0e13748649 CUTLASS 2.5 2021-02-26 09:58:26 -05:00
Manish Gupta
6615010cd0
CUTLASS 2.4 (Implicit GEMM convolution) (#147)
CUTLASS 2.4 (Implicit GEMM Convolution)

Co-authored-by: Manish Gupta <manigupta@nvidia.com>, Haicheng Wu <haichengw@nvidia.com>, Dustyn Blasig <dblasig@nvidia.com>, Andrew Kerr <akerr@nvidia.com>
2020-11-19 21:25:25 -08:00
Andrew Kerr
c53f3339bb
CUTLASS 2.3 initial commit (#134)
CUTLASS 2.3 adds GEMMs targeting Sparse Tensor Cores on the NVIDIA Ampere Architecture, fast SGEMM, and small matrix classes, bug fixes, and performance enhancements.
2020-09-23 14:00:58 -07:00
Andrew Kerr
1ab1027954
Updated mma_sm80.h to avoid perf penalty due to reinterpret_cast<>. (#100)
- Updated mma_sm80.h to avoid perf penalty due to reinterpret_cast<>.
- Enhancement to CUTLASS Utility Library's HostTensorPlanarComplex template to support copy-in and copy-out
- Added test_examples target to build and test all CUTLASS examples
- Minor edits to documentation to point to GTC 2020 webinar
2020-06-15 10:47:01 -07:00
Andrew Kerr
86931fef85
CUTLASS 2.2 (#96)
Adds support for NVIDIA Ampere Architecture features. CUDA 11 Toolkit recommended.
2020-06-08 16:17:35 -07:00
Andrew Kerr
96dab34ad9
CUTLASS 2.1 (#83)
CUTLASS 2.1 contributes:
- BLAS-style host-side API added to CUTLASS Library
- Planar Complex GEMM kernels targeting Volta and Turing Tensor Cores
- Minor enhancements and bug fixes
2020-04-07 13:51:25 -07:00
Andrew Kerr
fb335f6a5f
CUTLASS 2.0 (#62)
CUTLASS 2.0

Substantially refactored for

- Better performance, particularly for native Turing Tensor Cores
- Robust and durable templates spanning the design space
- Encapsulated functionality embodying modern C++11 programming techniques
- Optimized containers and data types for efficient, generic, portable device code

Updates to:
- Quick start guide
- Documentation
- Utilities
- CUTLASS Profiler

Native Turing Tensor Cores
- Efficient GEMM kernels targeting Turing Tensor Cores
- Mixed-precision floating point, 8-bit integer, 4-bit integer, and binarized operands

Coverage of existing CUTLASS functionality:
- GEMM kernels targeting CUDA and Tensor Cores in NVIDIA GPUs
- Volta Tensor Cores through native mma.sync and through WMMA API
- Optimizations such as parallel reductions, threadblock rasterization, and intra-threadblock reductions
- Batched GEMM operations
- Complex-valued GEMMs

Note: this commit and all that follow require a host compiler supporting C++11 or greater.
2019-11-19 16:55:34 -08:00
Artem Belevich
fb8b3a98b7 Addressed code review comments. 2019-05-10 10:24:52 -07:00
Artem Belevich
e18292db46 Make CUTLASS compileable with Clang.
Requires a recent clang build (r359248 or newer).

Enable compilation with clang with these options:
cmake -DCUDA_COMPILER=clang -DCMAKE_CXX_COMPILER=/path/to/clang++
2019-05-02 11:00:22 -07:00
Andrew Kerr
877bdcace6
Cutlass 1.3 Release (#42)
CUTLASS 1.3 Release
- Efficient GEMM kernel targeting Volta Tensor Cores via mma.sync instruction added in CUDA 10.1.
2019-03-20 10:49:17 -07:00
akerr
74df0331f2 CUTLASS 1.2 2018-10-26 14:38:46 -07:00
akerr
461f417b9d Checkpointing CUTLASS 1.1 release. 2018-09-18 16:58:03 -07:00
akerr
374882be53 Replaced GoogleTest copy with submodule. Added updates to support intra-threadblock reductions. Added tests for same. 2018-06-11 11:47:15 -07:00
akerr
2028ebe120 CUTLASS v1.0 release 2018-05-16 11:44:56 -07:00