cutlass

Author	SHA1	Message	Date
Aleksandar Samardžić	e1976daacc	Add support for mixed 4-bit/8-bit data types GEMM (#1413 ) * Add support for mixed 4-bit/8-bit data types GEMM * fix ( and ) --------- Co-authored-by: Aleksandar Samardžić <asamardzic@matf.bg.ac.rs> Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2024-08-29 23:11:06 -04:00
Aleksandar Samardžić	3f084f7f3c	Add couple configs into generator.py for mixed input MM (#1350 ) * Add couple configs into generator.py for mixed input MM * change one unit test name; reenable 128x32 in the profiler * Added U8/BF16 tests. --------- Co-authored-by: Haicheng Wu <haichengw@nvidia.com> Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com>	2024-08-16 00:59:29 -04:00
dePaul Miller	2049c6c5a2	5476 cutlass 3x gemm kernels (#1695 ) Co-authored-by: dePaul Miller <23461061+depaulmillz@users.noreply.github.com>	2024-08-08 13:56:23 -04:00
chenwei	e22ba590cd	support data type w2 used in cutlass_library (#1517 )	2024-08-06 11:15:18 -04:00
Ali Hassani	eee0cab26c	Stamp out 1x1x1 clusters, 128x256 CTA shape (#1665 ) Adds 128x256 tile shapes to FP16/BF16 and FP8 generators. Also adds 1x1x1 clusters to all existing FP16/BF16/FP8 generators. NOTE: it is important to set kernel filter (--kernels / CUTLASS_LIBRARY_KERNELS) to a non empty string and skip pruning to get all of the new configurations. If profiling exhaustively, they can be set to `*`. Number of CUTLASS 3.X GEMMs before this commit: 2868 Number of CUTLASS 3.X GEMMs after this commit: 4016 Co-authored-by: Ali Hassani <ahassani@nvidia.com>	2024-07-31 20:22:29 -04:00
Vijay Thakkar	be60a0b272	CUTLASS 3.5.1 (#1623 ) * CUTLASS 3.5.1 * updates, optimizations, fixes	2024-07-29 08:46:24 -04:00
Vijay Thakkar	7d49e6c7e2	Updates for CUTLASS 3.5.0 (#1468 )	2024-04-11 21:33:40 -04:00
jeromeku	f9ece1b42c	Python `Gemm` `tile_descriptions` fix (#1439 ) * fix python gemm tile descriptions * fix formatting * fix math_operation filtering * fix formatting	2024-03-30 09:00:46 -04:00
Vijay Thakkar	629f4653c3	CUTLASS 3.5.0 (#1411 )	2024-03-19 17:51:04 -04:00
ANIKET SHIVAM	bbe579a9e3	Updates for CUTLASS 3.4.1 (#1346 ) * Updates for CUTLASS 3.4.1 * minor epi change	2024-02-15 15:48:34 -05:00
ANIKET SHIVAM	751eb9a885	Update license year (#1306 )	2024-01-16 14:37:22 -05:00
ANIKET SHIVAM	2f589ffa76	Updates for 3.4 release. (#1305 )	2024-01-16 13:42:51 -05:00
Kun Wu	8ac2edc810	expose stream API in python kernel call interfaces (#1287 ) * expose stream API in python kernel call interfaces * add stream to ReductionArguments; document stream arg * add stream argument to GemmGroupedArguments	2024-01-05 08:27:45 -05:00
Pradeep Ramani	8236f30675	CUTLASS 3.4.0 (#1286 ) * CUTLASS 3.4.0 * Update CHANGELOG.md --------- Co-authored-by: Pradeep Ramani <prramani@nvidia.com>	2023-12-29 15:21:31 -05:00
Pradeep Ramani	e9e30c2304	Updates and Bug fixes to CUTLASS 3.3 (#1232 )	2023-12-05 09:50:49 -05:00
Christian Sigg	a759e85f5f	Add subclass declarations to generated files. (#1193 )	2023-11-30 00:25:40 -05:00
Jack Kosaian	8098336d51	Updates to Python interface for PyPI packaging (#1209 ) * Updates * Updates to notebooks	2023-11-28 13:52:12 -05:00
Pradeep Ramani	c008b4aea8	CUTLASS 3.3.0 (#1167 ) * Release 3.3.0 Adds support for mixed precision GEMMs On Hopper and Ampere Adds support for < 16B aligned GEMMs on Hopper Enhancements to EVT Enhancements to Python interface Enhancements to Sub-byte type handling in CuTe Several other bug-fixes and performance improvements. * minor doc update	2023-11-02 11:09:05 -04:00
Haicheng Wu	5e1a0a5adb	fix alignmentC for h16816_s8xf16 (#1146 ) * fix alignmentC for h16816_s8xf16 * manish's change	2023-10-17 15:15:39 -04:00
Manish Gupta	757275f279	Adding more Threadblock Tiles for Mixed-input TensorOp (BF16 * S8) in cutlass_library (#1132 ) * Adding more tiles in the cutlass_library for mixed-input support. * fix rebase issue * more tiles to upcast a	2023-10-13 11:33:15 -04:00
Manish Gupta	7d8317a63e	Support for Mixed Input TensorOp (#1084 ) * Passing warp-level mixed input F16(S8/U8) tests passing device-level mixed input F16(S8/U8) tests add to profiler - I8 (111 TFLOPs), U (123 TFLOPs) * fast numeric conversions (I8 = 132 TFLOPs, U8 = 148 TFLOPs) * Speedup reference compilation (REVERT THIS COMMIT) * wider_add.u32_packed_sub.f16x2 (I8 = 132TFLOP/s, U8 = 170 TFLOP/s) * Improve s8->f16 cvt and support bf16u8 @158 TFLOPs BF16 * S8 (142 TFLOPs) * Handle mixed-input upcast on OperandA (Support [S8\|U8][F16\|BF16] rename OpMultiplyAddMixedInput to OpMultiplyAddMixedInputUpcast * Add device-level test and profiler support for upcast on operand A * Move shfl before the cvt and reduce #shfls by 1/2 * fix smem_usage calculation for mixed_input types * uncomment the stuff (getting ready for merge) * profiler changes and mixed-input reference * mixed input reference are in a new file * use platform instead of std * comments and typo only * Use CreateGemmOperator and delete CreateMixedInputGemmOperator * copyright for new files * rebase follow-up	2023-09-27 11:18:30 -04:00
ANIKET SHIVAM	90d3b0fb18	CUTLASS 3.2.1 (#1113 ) * Updates for 3.2.1 release. * Minor fix in gemm op profiler for raster order. * Add scheduler mapping for raster order in the kernels.	2023-09-26 17:24:26 -04:00
ANIKET SHIVAM	a88c41cf8d	Updates for 3.2 release (#1065 )	2023-08-25 23:05:46 -04:00
Sophia Wisdom	2d9a557427	torch.bfloat16 support in cutlass python (#1037 ) * torch.bfloat16 support in cutlass python * Update datatypes.py	2023-08-16 11:38:53 -04:00
ANIKET SHIVAM	4575443d44	CUTLASS 3.2 (#1024 ) * CUTLASS 3.2	2023-08-07 20:50:32 -04:00
Tianqi Zhang (张天启)	8e85580859	fix layout bug (#1006 )	2023-07-19 14:26:01 -04:00
q.yao	f6d42f2dd0	add library_dirs (#977 )	2023-06-14 12:09:12 -04:00
ANIKET SHIVAM	7c04f95415	Updates for 3.1 (#932 )	2023-04-29 09:34:27 -04:00
Jack Kosaian	9a83bd3381	CUTLASS 3.1 Python interface documentation (#917 ) * Add 12.1 Dockerfile * Add 3.1 docs	2023-04-18 15:11:35 -04:00
ANIKET SHIVAM	d572cc1aab	CUTLASS 3.1 (#915 ) Co-authored-by: Aniket Shivam <ashivam@nvidia.com>	2023-04-14 23:19:34 -04:00

30 Commits