cutlass

Author	SHA1	Message	Date
ANIKET SHIVAM	751eb9a885	Update license year (#1306 )	2024-01-16 14:37:22 -05:00
ANIKET SHIVAM	2f589ffa76	Updates for 3.4 release. (#1305 )	2024-01-16 13:42:51 -05:00
Pradeep Ramani	8236f30675	CUTLASS 3.4.0 (#1286 ) * CUTLASS 3.4.0 * Update CHANGELOG.md --------- Co-authored-by: Pradeep Ramani <prramani@nvidia.com>	2023-12-29 15:21:31 -05:00
Pradeep Ramani	c008b4aea8	CUTLASS 3.3.0 (#1167 ) * Release 3.3.0 Adds support for mixed precision GEMMs On Hopper and Ampere Adds support for < 16B aligned GEMMs on Hopper Enhancements to EVT Enhancements to Python interface Enhancements to Sub-byte type handling in CuTe Several other bug-fixes and performance improvements. * minor doc update	2023-11-02 11:09:05 -04:00
Manish Gupta	757275f279	Adding more Threadblock Tiles for Mixed-input TensorOp (BF16 * S8) in cutlass_library (#1132 ) * Adding more tiles in the cutlass_library for mixed-input support. * fix rebase issue * more tiles to upcast a	2023-10-13 11:33:15 -04:00
Manish Gupta	ff02da2667	Fx parallel split-k (#1116 )	2023-10-06 12:02:40 -04:00
Manish Gupta	7d8317a63e	Support for Mixed Input TensorOp (#1084 ) * Passing warp-level mixed input F16(S8/U8) tests passing device-level mixed input F16(S8/U8) tests add to profiler - I8 (111 TFLOPs), U (123 TFLOPs) * fast numeric conversions (I8 = 132 TFLOPs, U8 = 148 TFLOPs) * Speedup reference compilation (REVERT THIS COMMIT) * wider_add.u32_packed_sub.f16x2 (I8 = 132TFLOP/s, U8 = 170 TFLOP/s) * Improve s8->f16 cvt and support bf16u8 @158 TFLOPs BF16 * S8 (142 TFLOPs) * Handle mixed-input upcast on OperandA (Support [S8\|U8][F16\|BF16] rename OpMultiplyAddMixedInput to OpMultiplyAddMixedInputUpcast * Add device-level test and profiler support for upcast on operand A * Move shfl before the cvt and reduce #shfls by 1/2 * fix smem_usage calculation for mixed_input types * uncomment the stuff (getting ready for merge) * profiler changes and mixed-input reference * mixed input reference are in a new file * use platform instead of std * comments and typo only * Use CreateGemmOperator and delete CreateMixedInputGemmOperator * copyright for new files * rebase follow-up	2023-09-27 11:18:30 -04:00
ANIKET SHIVAM	90d3b0fb18	CUTLASS 3.2.1 (#1113 ) * Updates for 3.2.1 release. * Minor fix in gemm op profiler for raster order. * Add scheduler mapping for raster order in the kernels.	2023-09-26 17:24:26 -04:00
ANIKET SHIVAM	34bbadd3ff	standarize fp8 generator (#1078 )	2023-09-07 14:36:33 -04:00
Vijay Thakkar	e01b9b5029	Shard gemm reference templates into multiple TUs for parallel compilation (#1043 ) * Split apart gemm reference templates into multiple TUs for parallel compilation * remove old files * better balancing of ref kernels across TUs * remove 3 new added refcheck kernels and some un-necessary fp8 library instances to reduce lib size * remove auto fp8 kernels * remove some redundant kernels	2023-08-30 16:46:30 -04:00
Ying Zhang	3a8f57a3c8	Add simple hash and eq methods for gemm_operations. (#1053 )	2023-08-27 20:41:57 -04:00
ANIKET SHIVAM	a88c41cf8d	Updates for 3.2 release (#1065 )	2023-08-25 23:05:46 -04:00
ANIKET SHIVAM	4575443d44	CUTLASS 3.2 (#1024 ) * CUTLASS 3.2	2023-08-07 20:50:32 -04:00
ANIKET SHIVAM	473a67073e	Fix Int8 and TF32 generator (#976 )	2023-06-12 12:32:52 -04:00
ANIKET SHIVAM	f079619f5e	More updates for 3.1 (#958 ) * Updates for 3.1 * Minor change * doc link fix * Minor updates	2023-05-24 10:17:16 -04:00
Manish Gupta	b97404837e	Adding 128x256 tile for 16b input datatype WGMMA gemm (#950 )	2023-05-17 17:13:23 -04:00
ANIKET SHIVAM	7c04f95415	Updates for 3.1 (#932 )	2023-04-29 09:34:27 -04:00
Adnan Akhundov	df02482f1d	Add missing schedules argument in SM90 fp16 op generation (#920 )	2023-04-26 16:44:49 -04:00
ANIKET SHIVAM	d572cc1aab	CUTLASS 3.1 (#915 ) Co-authored-by: Aniket Shivam <ashivam@nvidia.com>	2023-04-14 23:19:34 -04:00
dan_the_3rd	9b8166e3f0	fMHA: Add backward pass (#844 ) * fMHA: Add backward pass * Better checks for strides/alignments * Remove fb-internal URL * torch.Tensor.untyped_storage requires pytorch 2.0+ * minor changes * make test --------- Co-authored-by: danthe3rd <danthe3rd> Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2023-04-06 20:44:58 -04:00
Shuai Shao	e2d439ee7e	Add tile_n=32 and tile_k=32 kernels in generator.py (#858 )	2023-04-06 10:00:52 -04:00
Manish Gupta	660a05f581	fix split_k_mode and add reduction kernel for f16 input/accum/output (#896 )	2023-03-30 15:31:08 -04:00
Alexander Pivovarov	7e370c9637	Fix typos 2 (#842 ) Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com>	2023-03-09 23:22:56 -05:00
ANIKET SHIVAM	c4f6b8c6bc	Updates for 3.0 (#857 ) Co-authored-by: Aniket Shivam <ashivam@nvidia.com>	2023-03-09 15:27:40 -05:00
Yinghai Lu	a68e2f95f0	Reduce versbosity in manifest.py (#845 )	2023-03-07 11:53:01 -05:00
Haicheng Wu	65688c2a87	streamk fix (#836 ) Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2023-02-23 16:35:08 -05:00
Shuai Shao	9cdbe33570	Add fixed_channel and few_channel mode to int8 in generator (#829 )	2023-02-21 21:15:39 -05:00
Vijay Thakkar	277bd6e537	CUTLASS 3.0.0 (#786 ) * CUTLASS 3.0.0	2023-01-23 20:55:28 -05:00
ANIKET SHIVAM	66d9cddc83	New updates for 2.11 (#775 ) * New updates. * Minor profiler updates Co-authored-by: Aniket Shivam <ashivam@nvidia.com>	2023-01-20 16:32:57 -05:00
Haicheng Wu	764b840d6f	streamk example and performance tuning (#760 ) * streamk example and performance tuning * one missing file Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2023-01-10 16:10:02 -05:00
Jack Kosaian	df81d847d7	Make Python interface work for non-SM80 targets (#726 ) * Make Python interface work for non-SM80 targets * Remove line in README	2022-12-07 21:53:33 -05:00
Aditya Atluri	c975e2ccbb	releaase 2.11 (#703 )	2022-11-19 09:02:15 -05:00
seventh	168ea8b0e1	ensure singleton::get thread safe construct instance (#658 ) * ensure singleton::get thread safe construct instance * fix singleton return reference Co-authored-by: xuweiqi <xuweiqi117@gmail.com>	2022-11-08 21:44:32 -05:00
Jack Kosaian	8c1bf9b784	Bump CUTLASS Python container version (#672 ) * Update example 40 README * Update CUTLASS Python README	2022-10-22 21:09:39 -04:00
ANIKET SHIVAM	e773429f7e	CUTLASS 2.10 updates (#622 ) Co-authored-by: Aniket Shivam <ashivam@nvidia.com>	2022-09-12 21:26:30 -04:00
ANIKET SHIVAM	b72cbf957d	CUTLASS 2.10 (#615 ) Co-authored-by: Aniket Shivam <ashivam@nvidia.com>	2022-09-03 18:48:46 -04:00
Mike Iovine	c4cf0dad82	Fix init-self compiler warnings (#493 ) Fix a few errors caused by trying to initialize a class member with itself. These errors can turn into errors if you compile with `-Winit-self`.	2022-05-11 00:35:28 -04:00
Haicheng Wu	1604ebaf10	Update generator.py stop generating analytical conv kernels to reduce kernel number	2022-05-08 21:47:15 -04:00
Andrew Kerr	12f4108ac2	CUTLASS 2.9 (#468 )	2022-04-23 15:02:38 -04:00
Minmin Sun (孙敏敏)	eb0d4c9213	[library] pass pointer of arguments to get_host_workspace_size() in gemm_universal() (#412 ) Otherwise GemmUniversalOperation::get_host_workspace_size() will fail on SegmentFault.	2022-03-22 12:36:34 -04:00
Fujun Han	1e4703cbab	Support parallel split K mode for porfiling (#277 ) * Support parallel split K mode for porfiling Signed-off-by: Peter Han <fujun.han@iluvatar.ai> * Parallel Split K support 1. find gemm kernel by preference key 2. switch m n for redution kernel Signed-off-by: Peter Han <fujun.han@iluvatar.ai> * parallel splitk for fp16 gemm * add one missing file Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2022-01-27 10:37:37 -05:00
Masahiro Masuda	d7c9cbf0b9	Fix typo in scripts/library.py (wrong data size for u8) (#393 )	2022-01-07 13:29:56 -05:00
Haicheng Wu	f78994bb40	add the missing pieces (#392 ) Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2021-12-25 04:29:54 -08:00
Andrew Kerr	ec4f7e5194	Updates to fused epilogue (#383 ) * Enhancements and fixes to fused GEMM and Convolution epilogue. * Need to explicitly list cudart as unit test library dependency.	2021-12-17 16:04:43 -05:00
Manish Gupta	808c25337a	CUTLASS 2.8 (#363 ) CUTLASS 2.8	2021-11-19 13:26:35 -08:00
Manish Gupta	2e07c4cc2f	CUTLASS 2.7 (#318 ) CUTLASS 2.7 Mainloop fusion for GEMM: summation over A or B Strided DGRAD (optimized iterators) Half-precision GELU_taylor activation functions Use these when accumulation and epilogue compute types are all cutlass::half_t Tuning and bug fixes to fused GEMM + GEMM example Support for smaller than 128b aligned Convolutions: see examples Caching of results to accelerate Convolution unit tests Can be enabled or disabled by running cmake .. -DCUTLASS_TEST_ENABLE_CACHED_RESULTS=OFF Corrections and bug fixes reported by the CUTLASS community Thank you for filing these issues! authored-by: Haicheng Wu haichengw@nvidia.com, Manish Gupta manigupta@nvidia.com, Dustyn Blasig dblasig@nvidia.com, Andrew Kerr akerr@nvidia.com	2021-09-20 11:02:22 -07:00
Haicheng Wu	59e2aa505a	refine the implementation	2021-09-08 13:14:08 +00:00
Manish Gupta	6c2f8f2fb8	CUTLASS 2.6.1 - functional and performance enhancements to strided DGRAD, fixes, and tuning * cutlass 2.6 update * remove debug prints * cutlass 2.6.1 (minor update) * Updated CHANGELOG. * Minor edit to readme to indicate patch version. * Minor edit to readme. Co-authored-by: Haicheng Wu <haichengw@nvidia.com>, Andrew Kerr <akerr@nvidia.com>	2021-09-03 10:26:15 -07:00
Manish Gupta	1ac4559d12	Cutlass 2.6 Update 1 (#301 ) * cutlass 2.6 update * remove debug prints	2021-07-27 17:58:30 -07:00
Manish Gupta	e5d51840e8	CUTLASS 2.6 (#298 ) CUTLASS 2.6	2021-07-23 00:40:53 -04:00

1 2

59 Commits