Commit Graph

30 Commits

Author SHA1 Message Date
Aleksandar Samardžić
e1976daacc
Add support for mixed 4-bit/8-bit data types GEMM (#1413)
* Add support for mixed 4-bit/8-bit data types GEMM

* fix ( and )

---------

Co-authored-by: Aleksandar Samardžić <asamardzic@matf.bg.ac.rs>
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2024-08-29 23:11:06 -04:00
Aleksandar Samardžić
3f084f7f3c
Add couple configs into generator.py for mixed input MM (#1350)
* Add couple configs into generator.py for mixed input MM

* change one unit test name; reenable 128x32 in the profiler

* Added U8/BF16 tests.

---------

Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com>
2024-08-16 00:59:29 -04:00
dePaul Miller
2049c6c5a2
5476 cutlass 3x gemm kernels (#1695)
Co-authored-by: dePaul Miller <23461061+depaulmillz@users.noreply.github.com>
2024-08-08 13:56:23 -04:00
chenwei
e22ba590cd
support data type w2 used in cutlass_library (#1517) 2024-08-06 11:15:18 -04:00
Ali Hassani
eee0cab26c
Stamp out 1x1x1 clusters, 128x256 CTA shape (#1665)
Adds 128x256 tile shapes to FP16/BF16 and FP8 generators.
Also adds 1x1x1 clusters to all existing FP16/BF16/FP8 generators.

NOTE: it is important to set kernel filter (--kernels /
CUTLASS_LIBRARY_KERNELS) to a non empty string and skip pruning to get
all of the new configurations.

If profiling exhaustively, they can be set to `*`.

Number of CUTLASS 3.X GEMMs before this commit: 2868
Number of CUTLASS 3.X GEMMs after this commit: 4016

Co-authored-by: Ali Hassani <ahassani@nvidia.com>
2024-07-31 20:22:29 -04:00
Vijay Thakkar
be60a0b272
CUTLASS 3.5.1 (#1623)
* CUTLASS 3.5.1

* updates, optimizations, fixes
2024-07-29 08:46:24 -04:00
Vijay Thakkar
7d49e6c7e2
Updates for CUTLASS 3.5.0 (#1468) 2024-04-11 21:33:40 -04:00
jeromeku
f9ece1b42c
Python Gemm tile_descriptions fix (#1439)
* fix python gemm tile descriptions

* fix formatting

* fix math_operation filtering

* fix formatting
2024-03-30 09:00:46 -04:00
Vijay Thakkar
629f4653c3
CUTLASS 3.5.0 (#1411) 2024-03-19 17:51:04 -04:00
ANIKET SHIVAM
bbe579a9e3
Updates for CUTLASS 3.4.1 (#1346)
* Updates for CUTLASS 3.4.1

* minor epi change
2024-02-15 15:48:34 -05:00
ANIKET SHIVAM
751eb9a885
Update license year (#1306) 2024-01-16 14:37:22 -05:00
ANIKET SHIVAM
2f589ffa76
Updates for 3.4 release. (#1305) 2024-01-16 13:42:51 -05:00
Kun Wu
8ac2edc810
expose stream API in python kernel call interfaces (#1287)
* expose stream API in python kernel call interfaces

* add stream to ReductionArguments; document stream arg

* add stream argument to GemmGroupedArguments
2024-01-05 08:27:45 -05:00
Pradeep Ramani
8236f30675
CUTLASS 3.4.0 (#1286)
* CUTLASS 3.4.0

* Update CHANGELOG.md

---------

Co-authored-by: Pradeep Ramani <prramani@nvidia.com>
2023-12-29 15:21:31 -05:00
Pradeep Ramani
e9e30c2304
Updates and Bug fixes to CUTLASS 3.3 (#1232) 2023-12-05 09:50:49 -05:00
Christian Sigg
a759e85f5f
Add subclass declarations to generated files. (#1193) 2023-11-30 00:25:40 -05:00
Jack Kosaian
8098336d51
Updates to Python interface for PyPI packaging (#1209)
* Updates

* Updates to notebooks
2023-11-28 13:52:12 -05:00
Pradeep Ramani
c008b4aea8
CUTLASS 3.3.0 (#1167)
* Release 3.3.0

Adds support for mixed precision GEMMs On Hopper and Ampere
Adds support for < 16B aligned GEMMs on Hopper
Enhancements to EVT
Enhancements to Python interface
Enhancements to Sub-byte type handling in CuTe
Several other bug-fixes and performance improvements.

* minor doc update
2023-11-02 11:09:05 -04:00
Haicheng Wu
5e1a0a5adb
fix alignmentC for h16816_s8xf16 (#1146)
* fix alignmentC for h16816_s8xf16

* manish's change
2023-10-17 15:15:39 -04:00
Manish Gupta
757275f279
Adding more Threadblock Tiles for Mixed-input TensorOp (BF16 * S8) in cutlass_library (#1132)
* Adding more tiles in the cutlass_library for mixed-input support.

* fix rebase issue

* more tiles to upcast a
2023-10-13 11:33:15 -04:00
Manish Gupta
7d8317a63e
Support for Mixed Input TensorOp (#1084)
* Passing warp-level mixed input F16*(S8/U8) tests

* passing device-level mixed input F16*(S8/U8) tests

* add to profiler - I8 (111 TFLOPs), U (123 TFLOPs)

* fast numeric conversions (I8 = 132 TFLOPs, U8 = 148 TFLOPs)

* Speedup reference compilation (REVERT THIS COMMIT)

* wider_add.u32_packed_sub.f16x2 (I8 = 132TFLOP/s, U8 = 170 TFLOP/s)

* Improve s8->f16 cvt and support bf16*u8 @158 TFLOPs

* BF16 * S8 (142 TFLOPs)

* Handle mixed-input upcast on OperandA (Support [S8|U8]*[F16|BF16]

* rename OpMultiplyAddMixedInput to OpMultiplyAddMixedInputUpcast

* Add device-level test and profiler support for upcast on operand A

* Move shfl before the cvt and reduce #shfls by 1/2

* fix smem_usage calculation for mixed_input types

* uncomment the stuff (getting ready for merge)

* profiler changes and mixed-input reference

* mixed input reference are in a new file

* use platform instead of std

* comments and typo only

* Use CreateGemmOperator and delete CreateMixedInputGemmOperator

* copyright for new files

* rebase follow-up
2023-09-27 11:18:30 -04:00
ANIKET SHIVAM
90d3b0fb18
CUTLASS 3.2.1 (#1113)
* Updates for 3.2.1 release.

* Minor fix in gemm op profiler for raster order.

* Add scheduler mapping for raster order in the kernels.
2023-09-26 17:24:26 -04:00
ANIKET SHIVAM
a88c41cf8d
Updates for 3.2 release (#1065) 2023-08-25 23:05:46 -04:00
Sophia Wisdom
2d9a557427
torch.bfloat16 support in cutlass python (#1037)
* torch.bfloat16 support in cutlass python

* Update datatypes.py
2023-08-16 11:38:53 -04:00
ANIKET SHIVAM
4575443d44
CUTLASS 3.2 (#1024)
* CUTLASS 3.2
2023-08-07 20:50:32 -04:00
Tianqi Zhang (张天启)
8e85580859
fix layout bug (#1006) 2023-07-19 14:26:01 -04:00
q.yao
f6d42f2dd0
add library_dirs (#977) 2023-06-14 12:09:12 -04:00
ANIKET SHIVAM
7c04f95415
Updates for 3.1 (#932) 2023-04-29 09:34:27 -04:00
Jack Kosaian
9a83bd3381
CUTLASS 3.1 Python interface documentation (#917)
* Add 12.1 Dockerfile

* Add 3.1 docs
2023-04-18 15:11:35 -04:00
ANIKET SHIVAM
d572cc1aab
CUTLASS 3.1 (#915)
Co-authored-by: Aniket Shivam <ashivam@nvidia.com>
2023-04-14 23:19:34 -04:00