Commit Graph

56 Commits

Author SHA1 Message Date
Pradeep Ramani
c008b4aea8
CUTLASS 3.3.0 (#1167)
* Release 3.3.0

Adds support for mixed precision GEMMs On Hopper and Ampere
Adds support for < 16B aligned GEMMs on Hopper
Enhancements to EVT
Enhancements to Python interface
Enhancements to Sub-byte type handling in CuTe
Several other bug-fixes and performance improvements.

* minor doc update
2023-11-02 11:09:05 -04:00
Manish Gupta
757275f279
Adding more Threadblock Tiles for Mixed-input TensorOp (BF16 * S8) in cutlass_library (#1132)
* Adding more tiles in the cutlass_library for mixed-input support.

* fix rebase issue

* more tiles to upcast a
2023-10-13 11:33:15 -04:00
Manish Gupta
ff02da2667
Fx parallel split-k (#1116) 2023-10-06 12:02:40 -04:00
Manish Gupta
7d8317a63e
Support for Mixed Input TensorOp (#1084)
* Passing warp-level mixed input F16*(S8/U8) tests

* passing device-level mixed input F16*(S8/U8) tests

* add to profiler - I8 (111 TFLOPs), U (123 TFLOPs)

* fast numeric conversions (I8 = 132 TFLOPs, U8 = 148 TFLOPs)

* Speedup reference compilation (REVERT THIS COMMIT)

* wider_add.u32_packed_sub.f16x2 (I8 = 132TFLOP/s, U8 = 170 TFLOP/s)

* Improve s8->f16 cvt and support bf16*u8 @158 TFLOPs

* BF16 * S8 (142 TFLOPs)

* Handle mixed-input upcast on OperandA (Support [S8|U8]*[F16|BF16]

* rename OpMultiplyAddMixedInput to OpMultiplyAddMixedInputUpcast

* Add device-level test and profiler support for upcast on operand A

* Move shfl before the cvt and reduce #shfls by 1/2

* fix smem_usage calculation for mixed_input types

* uncomment the stuff (getting ready for merge)

* profiler changes and mixed-input reference

* mixed input reference are in a new file

* use platform instead of std

* comments and typo only

* Use CreateGemmOperator and delete CreateMixedInputGemmOperator

* copyright for new files

* rebase follow-up
2023-09-27 11:18:30 -04:00
ANIKET SHIVAM
90d3b0fb18
CUTLASS 3.2.1 (#1113)
* Updates for 3.2.1 release.

* Minor fix in gemm op profiler for raster order.

* Add scheduler mapping for raster order in the kernels.
2023-09-26 17:24:26 -04:00
ANIKET SHIVAM
34bbadd3ff
standarize fp8 generator (#1078) 2023-09-07 14:36:33 -04:00
Vijay Thakkar
e01b9b5029
Shard gemm reference templates into multiple TUs for parallel compilation (#1043)
* Split apart gemm reference templates into multiple TUs for parallel compilation

* remove old files

* better balancing of ref kernels across TUs

* remove 3 new added refcheck kernels and some un-necessary fp8 library instances to reduce lib size

* remove auto fp8 kernels

* remove some redundant kernels
2023-08-30 16:46:30 -04:00
Ying Zhang
3a8f57a3c8
Add simple hash and eq methods for gemm_operations. (#1053) 2023-08-27 20:41:57 -04:00
ANIKET SHIVAM
a88c41cf8d
Updates for 3.2 release (#1065) 2023-08-25 23:05:46 -04:00
ANIKET SHIVAM
4575443d44
CUTLASS 3.2 (#1024)
* CUTLASS 3.2
2023-08-07 20:50:32 -04:00
ANIKET SHIVAM
473a67073e
Fix Int8 and TF32 generator (#976) 2023-06-12 12:32:52 -04:00
ANIKET SHIVAM
f079619f5e
More updates for 3.1 (#958)
* Updates for 3.1

* Minor change

* doc link fix

* Minor updates
2023-05-24 10:17:16 -04:00
Manish Gupta
b97404837e
Adding 128x256 tile for 16b input datatype WGMMA gemm (#950) 2023-05-17 17:13:23 -04:00
ANIKET SHIVAM
7c04f95415
Updates for 3.1 (#932) 2023-04-29 09:34:27 -04:00
Adnan Akhundov
df02482f1d
Add missing schedules argument in SM90 fp16 op generation (#920) 2023-04-26 16:44:49 -04:00
ANIKET SHIVAM
d572cc1aab
CUTLASS 3.1 (#915)
Co-authored-by: Aniket Shivam <ashivam@nvidia.com>
2023-04-14 23:19:34 -04:00
dan_the_3rd
9b8166e3f0
fMHA: Add backward pass (#844)
* fMHA: Add backward pass

* Better checks for strides/alignments

* Remove fb-internal URL

* torch.Tensor.untyped_storage requires pytorch 2.0+

* minor changes

* make test

---------

Co-authored-by: danthe3rd <danthe3rd>
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2023-04-06 20:44:58 -04:00
Shuai Shao
e2d439ee7e
Add tile_n=32 and tile_k=32 kernels in generator.py (#858) 2023-04-06 10:00:52 -04:00
Manish Gupta
660a05f581
fix split_k_mode and add reduction kernel for f16 input/accum/output (#896) 2023-03-30 15:31:08 -04:00
Alexander Pivovarov
7e370c9637
Fix typos 2 (#842)
Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com>
2023-03-09 23:22:56 -05:00
ANIKET SHIVAM
c4f6b8c6bc
Updates for 3.0 (#857)
Co-authored-by: Aniket Shivam <ashivam@nvidia.com>
2023-03-09 15:27:40 -05:00
Yinghai Lu
a68e2f95f0
Reduce versbosity in manifest.py (#845) 2023-03-07 11:53:01 -05:00
Haicheng Wu
65688c2a87
streamk fix (#836)
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2023-02-23 16:35:08 -05:00
Shuai Shao
9cdbe33570
Add fixed_channel and few_channel mode to int8 in generator (#829) 2023-02-21 21:15:39 -05:00
Vijay Thakkar
277bd6e537
CUTLASS 3.0.0 (#786)
* CUTLASS 3.0.0
2023-01-23 20:55:28 -05:00
ANIKET SHIVAM
66d9cddc83
New updates for 2.11 (#775)
* New updates.

* Minor profiler updates

Co-authored-by: Aniket Shivam <ashivam@nvidia.com>
2023-01-20 16:32:57 -05:00
Haicheng Wu
764b840d6f
streamk example and performance tuning (#760)
* streamk example and performance tuning

* one missing file

Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2023-01-10 16:10:02 -05:00
Jack Kosaian
df81d847d7
Make Python interface work for non-SM80 targets (#726)
* Make Python interface work for non-SM80 targets

* Remove line in README
2022-12-07 21:53:33 -05:00
Aditya Atluri
c975e2ccbb
releaase 2.11 (#703) 2022-11-19 09:02:15 -05:00
seventh
168ea8b0e1
ensure singleton::get thread safe construct instance (#658)
* ensure singleton::get thread safe construct instance

* fix singleton return reference

Co-authored-by: xuweiqi <xuweiqi117@gmail.com>
2022-11-08 21:44:32 -05:00
Jack Kosaian
8c1bf9b784
Bump CUTLASS Python container version (#672)
* Update example 40 README

* Update CUTLASS Python README
2022-10-22 21:09:39 -04:00
ANIKET SHIVAM
e773429f7e
CUTLASS 2.10 updates (#622)
Co-authored-by: Aniket Shivam <ashivam@nvidia.com>
2022-09-12 21:26:30 -04:00
ANIKET SHIVAM
b72cbf957d
CUTLASS 2.10 (#615)
Co-authored-by: Aniket Shivam <ashivam@nvidia.com>
2022-09-03 18:48:46 -04:00
Mike Iovine
c4cf0dad82
Fix init-self compiler warnings (#493)
Fix a few errors caused by trying to initialize a class member
with itself. These errors can turn into errors if you compile
with `-Winit-self`.
2022-05-11 00:35:28 -04:00
Haicheng Wu
1604ebaf10
Update generator.py
stop generating analytical conv kernels to reduce kernel number
2022-05-08 21:47:15 -04:00
Andrew Kerr
12f4108ac2
CUTLASS 2.9 (#468) 2022-04-23 15:02:38 -04:00
Minmin Sun (孙敏敏)
eb0d4c9213
[library] pass pointer of arguments to get_host_workspace_size() in gemm_universal() (#412)
Otherwise GemmUniversalOperation::get_host_workspace_size() will fail on SegmentFault.
2022-03-22 12:36:34 -04:00
Fujun Han
1e4703cbab
Support parallel split K mode for porfiling (#277)
* Support parallel split K mode for porfiling

Signed-off-by: Peter Han <fujun.han@iluvatar.ai>

* Parallel Split K support

  1. find gemm kernel by preference key
  2. switch m n for redution kernel

Signed-off-by: Peter Han <fujun.han@iluvatar.ai>

* parallel splitk for fp16 gemm

* add one missing file

Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2022-01-27 10:37:37 -05:00
Masahiro Masuda
d7c9cbf0b9
Fix typo in scripts/library.py (wrong data size for u8) (#393) 2022-01-07 13:29:56 -05:00
Haicheng Wu
f78994bb40
add the missing pieces (#392)
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2021-12-25 04:29:54 -08:00
Andrew Kerr
ec4f7e5194
Updates to fused epilogue (#383)
* Enhancements and fixes to fused GEMM and Convolution epilogue.
* Need to explicitly list cudart as unit test library dependency.
2021-12-17 16:04:43 -05:00
Manish Gupta
808c25337a
CUTLASS 2.8 (#363)
CUTLASS 2.8
2021-11-19 13:26:35 -08:00
Manish Gupta
2e07c4cc2f
CUTLASS 2.7 (#318)
CUTLASS 2.7

Mainloop fusion for GEMM: summation over A or B
Strided DGRAD (optimized iterators)
Half-precision GELU_taylor activation functions
Use these when accumulation and epilogue compute types are all cutlass::half_t
Tuning and bug fixes to fused GEMM + GEMM example
Support for smaller than 128b aligned Convolutions: see examples
Caching of results to accelerate Convolution unit tests
Can be enabled or disabled by running cmake .. -DCUTLASS_TEST_ENABLE_CACHED_RESULTS=OFF
Corrections and bug fixes reported by the CUTLASS community
Thank you for filing these issues!

authored-by: Haicheng Wu haichengw@nvidia.com, Manish Gupta manigupta@nvidia.com, Dustyn Blasig dblasig@nvidia.com, Andrew Kerr akerr@nvidia.com
2021-09-20 11:02:22 -07:00
Haicheng Wu
59e2aa505a refine the implementation 2021-09-08 13:14:08 +00:00
Manish Gupta
6c2f8f2fb8
CUTLASS 2.6.1 - functional and performance enhancements to strided DGRAD, fixes, and tuning
* cutlass 2.6 update

* remove debug prints

* cutlass 2.6.1 (minor update)

* Updated CHANGELOG.

* Minor edit to readme to indicate patch version.

* Minor edit to readme.

Co-authored-by:  Haicheng Wu <haichengw@nvidia.com>, Andrew Kerr <akerr@nvidia.com>
2021-09-03 10:26:15 -07:00
Manish Gupta
1ac4559d12
Cutlass 2.6 Update 1 (#301)
* cutlass 2.6 update

* remove debug prints
2021-07-27 17:58:30 -07:00
Manish Gupta
e5d51840e8
CUTLASS 2.6 (#298)
CUTLASS 2.6
2021-07-23 00:40:53 -04:00
Manikandan Ananth
75a4737cfe Fix for public issue #211
- Add a slice-K tile size to the profiler
- fix num warps calculations in implicit gemm header
2021-04-01 14:42:00 -07:00
Haicheng Wu
34a42e5620
Update generator.py (#192) 2021-03-02 12:21:48 -08:00
Andrew Kerr
0e13748649 CUTLASS 2.5 2021-02-26 09:58:26 -05:00