Commit Graph

532 Commits

Author SHA1 Message Date
xws117
6e3df975a2
Modify comments in code examples/08_turing_tensorop_gemm/turing_tensorop_gemm.cu (#1325) 2024-01-31 21:41:30 -05:00
reed
8825fbf1ef
fix unrecognized print format specifier for int8/uint8 (#1303)
* fix unrecognized print format specifier for int8/uint8

* use c++ static_cast instead of c cast style
2024-01-29 21:22:40 -05:00
reed
092f14db05
fix tile_size_mnk compilation warning (#1294) 2024-01-29 21:21:15 -05:00
Haicheng Wu
9385141f19
Update PUBLICATIONS.md
ptq paper from goog
2024-01-19 14:17:55 -05:00
Haicheng Wu
b4b5b11070
Update PUBLICATIONS.md
add odyssey llm paper from metuan
2024-01-18 10:30:21 -05:00
jayhshah
139b93db61
update publications (#1308) 2024-01-17 14:06:46 -05:00
Aleksandar Samardžić
ca37d632c9
Remove sparse GEMM with row broadcasted bias vector (#1302)
This reverts commit d3e72719b4.

Co-authored-by: Aleksandar Samardžić <asamardzic@matf.bg.ac.rs>
2024-01-17 14:06:27 -05:00
Chengquan Jiang
362abbf274
Support ElementD to be void for tma (#1153)
* Support void D with AuxStore

* refine get_element_aux
2024-01-16 18:15:42 -05:00
ANIKET SHIVAM
751eb9a885
Update license year (#1306) 2024-01-16 14:37:22 -05:00
ANIKET SHIVAM
2f589ffa76
Updates for 3.4 release. (#1305) 2024-01-16 13:42:51 -05:00
Tianao Ge
acba5beee5
Fix flops calculation and tensor b stride calculation in the example 36 (#1278)
* Fix flops calculation and tensor b stride calculation in the example 36

* Fix datatype

* Update gather_scatter_fusion.cu
2024-01-08 17:27:30 -05:00
Eugene Zhulenev
74d1f3e63a
Fix cute::array<T, 0> iterator (#1273) 2024-01-08 17:10:09 -05:00
Kun Wu
8ac2edc810
expose stream API in python kernel call interfaces (#1287)
* expose stream API in python kernel call interfaces

* add stream to ReductionArguments; document stream arg

* add stream argument to GemmGroupedArguments
2024-01-05 08:27:45 -05:00
Ali Hassani
d4be5ab5d7
Allow per-column bias in EpilogueTensorBroadcast (#1275)
* Allow per-column bias in EpilogueTensorBroadcast

EpilogueTensorBroadcast only supports per-row vector broadcast, because
the bias stride is hardcoded.

It can easily support both if the bias stride is made conditional, and
the original behavior is maintained by defaulting to per-row.

* Add unit test for EpilogueTensorBroadcast with per-col bias

---------

Co-authored-by: Ali Hassani <ahassanijr@gmail.com>
Co-authored-by: Ali Hassani <ali@hippoml.com>
2024-01-04 12:48:31 -05:00
Jee Li
c9591a694d
fix typo (#1279) 2024-01-04 12:41:39 -05:00
Aleksandar Samardžić
5c756eb774
Add support for sparse GEMM with visitor epilogue (#1189)
* Add support for sparse GEMM with visitor epilogue

* Refactor changes at the kernel level
2024-01-04 12:38:11 -05:00
Pradeep Ramani
8236f30675
CUTLASS 3.4.0 (#1286)
* CUTLASS 3.4.0

* Update CHANGELOG.md

---------

Co-authored-by: Pradeep Ramani <prramani@nvidia.com>
2023-12-29 15:21:31 -05:00
Christian Sigg
b7508e3379
Fix inline ptx escaping for predicates. (#1264)
* Fix inline ptx escaping for predicates.

Prevents `error: invalid % escape in inline assembly string` when compiling with clang.

* More double-quoting.
2023-12-14 11:16:15 -05:00
Gregory Meyer (gregjm)
f60786b536
Remove undefined behavior from default constructor of PredicatedTileAccessIteratorParams. (#1258)
Currently, the default constructor of
`PredicatedTileAccessIteratorParams` will invoke undefined behavior in
its invocation of the `initialize` function. Specifically, it will
attempt to read from the uninitialized variables
`desc.element_size_bits` and `desc.advance_rank`. This commit changes
the default constructors of both `*Params` and `*Desc` to
zero-initialize all uninitialized members.
2023-12-11 23:01:53 -05:00
Andrey Portnoy
30ec1a4649
Use size_t index to iterate up to std::vector::size() (#1251)
Fixes a different signedness compare warning.
2023-12-09 08:44:31 -05:00
Christian Sigg
e1483d5fa0
Collection of changes to fix clang build. (#1200)
* Remove unused variables

* Qualify calls to make_fragment_? from templated base class.

Fixes clang build error.

* Add missing `#include <cstdio>`

* Various changes to fix clang compile errors.

* More changes to fix clang build.

Remaining issues:

- `params` initializer of `CollectiveEpilogue`.
- `ops` initializer of `Sm90VisitorImplBase`.
- `__usAtomicCAS` needs to be added to clang upstream.

* Fix remaining clang build issues.

* Qualify `cute::rank()` calls.

* Qualify some more calls that are otherwise ambiguous between `cute` and `std` namespace.

* Double-escape special registers in inline asm.

* small change

---------

Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2023-12-08 14:42:12 -05:00
Ali Hassani
f4a0216601
Fix bug in single source GEMM with residual + streamk (#1249)
Followup to #1224.

A change in the stream-k threadblock swizzle ctor since 3.3 breaks
single source GEMM with fused epilogue and stream-k. Multi-source was
already corrected.

Co-authored-by: Ali Hassani <ahassanijr@gmail.com>
2023-12-07 11:12:02 -05:00
Valeriy Fedyunin
f188f9b709
Fix typo in quickstart.md (#1257) 2023-12-07 09:49:52 -05:00
Haicheng Wu
9c9b51d35c
Update PUBLICATIONS.md 2023-12-07 00:02:36 -05:00
Ali Hassani
a75b4ac483
Fix Stream-K reduce bug in epilogue with broadcast (#1224)
Co-authored-by: Ali Hassani <ahassanijr@gmail.com>
2023-12-05 15:35:41 -05:00
Pradeep Ramani
e9e30c2304
Updates and Bug fixes to CUTLASS 3.3 (#1232) 2023-12-05 09:50:49 -05:00
Haicheng Wu
4a1709e17e
Fixed illegal PTX syntax (#1225) 2023-12-01 12:29:48 -05:00
Christian Sigg
bef1fbcbe6
Add missing #include <cstdio> (#1197)
* Add missing `#include <cstdio>`

* move to non nvrtc part

---------

Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2023-12-01 11:58:53 -05:00
Christian Sigg
2375a07d01
Qualify calls to make_fragment_? from templated base class. (#1196)
Fixes clang build error.
2023-12-01 09:52:57 -05:00
Christian Sigg
60c8251b72
Remove unused variables (#1195) 2023-12-01 09:52:19 -05:00
cyyever
10b850f9c7
Fix some sign conversion warnings (#1172)
* Fix sign conversion warnings

* Fix type conversion warnings

* Fix sign conversion warnings

* Change smem_size_ to constexpr

* clang warnings

* undo cast change

* one miss change

* missing part

---------

Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2023-11-30 00:28:40 -05:00
Christian Sigg
99c4eebe3b
Explicitly cast blockIdx to uint3 (#1192)
This works around a clang issue where blockIdx is of a different type.
2023-11-30 00:26:23 -05:00
Christian Sigg
a759e85f5f
Add subclass declarations to generated files. (#1193) 2023-11-30 00:25:40 -05:00
Christian Sigg
56fc3df03b
Adding missing typename (#1191)
Fixes clang build failures.
2023-11-29 00:20:20 -05:00
reed
eb01d5449d
fix cp.async L2 prefetch typo (#1187) 2023-11-28 16:58:04 -05:00
Jack Kosaian
8098336d51
Updates to Python interface for PyPI packaging (#1209)
* Updates

* Updates to notebooks
2023-11-28 13:52:12 -05:00
Sergey Klevtsov
b5d8a5d9cc
Allow SM90 pingpong kernel to use custom tile schedulers (#1194)
Co-authored-by: Sergey Klevtsov <sklevtsov@nvidia.com>
2023-11-15 13:45:17 -05:00
reed
6e60b9b17c
enable L2::128B prefetch for cp.async by default (#1177) 2023-11-13 13:30:13 -05:00
Changho Hwang
1ab6cc7b68
Fix std::abs overloading for bfloat16_t (#1179) 2023-11-13 13:29:45 -05:00
Manish Gupta
5ae8133cfa
Doc only change changelog 3.3 (#1180) 2023-11-13 13:29:22 -05:00
reed
39c6a83f23
fix missing return warning (#1173) 2023-11-03 22:42:59 -04:00
wang-y-z
1d7f2a207e
Fix several broken links (#1168)
Co-authored-by: isaacw <isaacw@nvidia.com>
2023-11-03 00:01:25 -04:00
wang-y-z
557be3ab0e
Fix several typos (#1169)
Co-authored-by: isaacw <isaacw@nvidia.com>
2023-11-02 23:54:46 -04:00
Pradeep Ramani
c008b4aea8
CUTLASS 3.3.0 (#1167)
* Release 3.3.0

Adds support for mixed precision GEMMs On Hopper and Ampere
Adds support for < 16B aligned GEMMs on Hopper
Enhancements to EVT
Enhancements to Python interface
Enhancements to Sub-byte type handling in CuTe
Several other bug-fixes and performance improvements.

* minor doc update
2023-11-02 11:09:05 -04:00
reed
922fb5108b
clean the format (#1140) 2023-10-24 22:59:06 -04:00
cyyever
7a7796afae
Fix is_zero (#1147)
* Fix is_zero

* Use constexpr

* Add CUTLASS_PRAGMA_UNROLL to loops

* Avoid if branches in is_zero
2023-10-23 12:09:37 -04:00
milesvant
fb10fa5308
Fix broken pipeline link in docs (#1143) 2023-10-18 12:55:46 -04:00
Haicheng Wu
5e1a0a5adb
fix alignmentC for h16816_s8xf16 (#1146)
* fix alignmentC for h16816_s8xf16

* manish's change
2023-10-17 15:15:39 -04:00
Manish Gupta
757275f279
Adding more Threadblock Tiles for Mixed-input TensorOp (BF16 * S8) in cutlass_library (#1132)
* Adding more tiles in the cutlass_library for mixed-input support.

* fix rebase issue

* more tiles to upcast a
2023-10-13 11:33:15 -04:00
reed
fa8dfe631f
fix missing return warning for repeat and axpby (#1124) 2023-10-12 00:05:45 -04:00