cutlass

Author	SHA1	Message	Date
Wenlei Bao	44dae8b90e	Adjust profiler space for SM89 (#1553 )	2024-09-19 11:40:30 -04:00
reed	2991ce18d3	Add print_svg for mma (#1733 ) * add print_svg for mma * correct the code indentation	2024-09-18 10:37:24 -04:00
Chenggang Zhao	1ebda1ccef	Fix MMA promotion interval assertions (#1641 )	2024-09-16 12:38:42 -04:00
reed	9f68995de5	add publication: ‘EVT: Accelerating Deep Learning Training with Epilogue Visitor Tree’ (#1526 ) Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com>	2024-09-16 11:55:09 -04:00
John Shumway	3a8c01a18b	Prefix a member template name with the template keyword. (#1796 ) Fixes llvm buld error.	2024-09-11 13:33:56 -04:00
Junkai-Wu	dbdae514e0	Support for TMA Epilogue for Group Gemm and add pingpong ptr array & Group Gemm (#1795 )	2024-09-11 00:07:31 -04:00
Sean Xiaowen Zhang	21d0534167	fix assertion (#1790 )	2024-09-09 14:05:27 -04:00
Tri Dao	323c8170bf	Support ComputeFn where output type differs from input type (#1771 ) This is useful for e.g. function taking in 2 float inputs and turn them to complex	2024-09-05 23:25:03 -04:00
Gabriel Wu	82f5075946	set_slice3x3 -> set_slice_3x3 (#1784 )	2024-09-05 23:24:10 -04:00
Saagar Jha	06e337758d	Remove extraneous comma in declaration (#1776 )	2024-09-05 17:14:15 -04:00
JiayuSun	7369adcaca	Add Sm90LinCombPerColBias (#1774 ) Co-authored-by: Jiayu Sun <jiayus@s4124-0071.nvidia.com>	2024-09-04 15:11:24 -04:00
Alchan Kim	6c3044136b	Update barrier.h (#1782 )	2024-09-04 14:52:11 -04:00
Aleksandar Samardžić	e1976daacc	Add support for mixed 4-bit/8-bit data types GEMM (#1413 ) * Add support for mixed 4-bit/8-bit data types GEMM * fix ( and ) --------- Co-authored-by: Aleksandar Samardžić <asamardzic@matf.bg.ac.rs> Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2024-08-29 23:11:06 -04:00
Shreya Gaur	f7b19de32c	minor fix for a double quote in CMakeLists.txt (#1727 )	2024-08-19 22:21:42 -04:00
shunfan-shao	4dbf5dbed2	Use CUDA runtime API to retrieve function pointer to driver API (#1700 ) * Query pfn to driver api * use default for older toolkits --------- Co-authored-by: shunfans <shunfans@nvidia.com>	2024-08-19 13:26:09 -04:00
Dustyn Blasig	f93a69134e	Merge pull request #1714 from NVIDIA/u128_div fix uint128	2024-08-16 07:14:59 -05:00
Aleksandar Samardžić	3f084f7f3c	Add couple configs into generator.py for mixed input MM (#1350 ) * Add couple configs into generator.py for mixed input MM * change one unit test name; reenable 128x32 in the profiler * Added U8/BF16 tests. --------- Co-authored-by: Haicheng Wu <haichengw@nvidia.com> Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com>	2024-08-16 00:59:29 -04:00
Haicheng Wu	b0296bf682	fix uint128	2024-08-15 21:06:01 -07:00
Dustyn Blasig	865be73a97	Merge pull request #1713 from NVIDIA/351_sparse_update update 3.5.1 readme/changelog	2024-08-15 11:44:49 -05:00
Haicheng Wu	8d8cfdf375	update 3.5.1 readme/changelog	2024-08-14 21:12:44 -07:00
eqy	fb170439e8	Update half.h (#1709 )	2024-08-14 14:59:59 -04:00
dePaul Miller	4e5a8f6853	3.5.1 plots and updated readme (#1708 ) Co-authored-by: dePaul Miller <23461061+depaulmillz@users.noreply.github.com>	2024-08-12 18:55:55 -04:00
Tri Dao	7192f4ab23	Add CLayout_64x208 (#1680 ) Without this I get compilation error when the extended shapes are enabled	2024-08-08 14:00:24 -04:00
dePaul Miller	2049c6c5a2	5476 cutlass 3x gemm kernels (#1695 ) Co-authored-by: dePaul Miller <23461061+depaulmillz@users.noreply.github.com>	2024-08-08 13:56:23 -04:00
chenwei	e22ba590cd	support data type w2 used in cutlass_library (#1517 )	2024-08-06 11:15:18 -04:00
Mark Hoemmen	19b4c5e065	Fix isnan namespace qualification in cutlass/functional.h (#1679 ) * Fix unrelated MSVC build warnings * Fix use of isnan in functional.h Correct namespace qualification of isnan in functional.h so that it invokes cutlass::isnan for half_t, instead of converting half_t to float and invoking std::isnan (on host, or ::isnan on device).	2024-08-05 14:28:13 -04:00
dePaul Miller	06b21349bc	1x1x1 cluster launch (#1673 )	2024-08-01 12:20:28 -04:00
Ali Hassani	eee0cab26c	Stamp out 1x1x1 clusters, 128x256 CTA shape (#1665 ) Adds 128x256 tile shapes to FP16/BF16 and FP8 generators. Also adds 1x1x1 clusters to all existing FP16/BF16/FP8 generators. NOTE: it is important to set kernel filter (--kernels / CUTLASS_LIBRARY_KERNELS) to a non empty string and skip pruning to get all of the new configurations. If profiling exhaustively, they can be set to `*`. Number of CUTLASS 3.X GEMMs before this commit: 2868 Number of CUTLASS 3.X GEMMs after this commit: 4016 Co-authored-by: Ali Hassani <ahassani@nvidia.com>	2024-07-31 20:22:29 -04:00
Sergey Klevtsov	36cbfcf483	Add extended wgmma shapes for all data types (#1666 )	2024-07-31 18:33:14 -04:00
Ali Hassani	1f2b590da6	Skip void-C kernels in the profiler when beta is non zero (#1661 ) * Skip void-C kernels in the profiler when beta is non zero CUTLASS profiler will only skip disposition for void-C kernels when beta is non zero, when it makes more sense to skip running it in the first place. Not all users are aware of void-C kernels (as far as I know it wasn't a thing in 2.X), and not everyone remembers to filter out voidC kernels when running the profiler with a non zero beta. The easiest solution (and as far as I can tell correct way of handling this) is that `can_implement` return `false` when beta is non zero (or whatever argument indicates an epilogue source) but we have a void-C kernel. Profiler already includes functionality to skip running kernels that fail `can_implement`. * Move checks to collectives instead --------- Co-authored-by: Ali Hassani <ahassani@nvidia.com>	2024-07-31 18:11:58 -04:00
dePaul Miller	8b2a0408bd	Profiler docs and argument update for raster order (#1667 )	2024-07-31 16:40:10 -04:00
eqy	fbd116c0e5	fix build on SM 5.2 (#1664 )	2024-07-31 09:54:57 -04:00
Tri Dao	5b283c872c	Add more GMMA shapes (#1630 ) * Add more GMMA shapes * Add more shapes for BF16	2024-07-29 19:09:51 -04:00
Vijay Thakkar	be60a0b272	CUTLASS 3.5.1 (#1623 ) * CUTLASS 3.5.1 * updates, optimizations, fixes	2024-07-29 08:46:24 -04:00
Chengquan Jiang	56b46e2d13	Fix grouped gemm invalid memory access to problem shapes (#1543 )	2024-07-10 11:55:22 -04:00
Kevin Tong	52fb43f30f	fix mbarrier invalidate (#1494 )	2024-07-10 11:35:26 -04:00
Joe Rowell	843adf0408	Fix SMEM index for C in CuTe examples (#1477 )	2024-07-10 11:14:15 -04:00
LiYu Lu	e48c7618e4	[bug] fix device thread `gemm.h` constructor (#1473 )	2024-07-10 11:12:36 -04:00
Ali Hassani	c5239d8312	Add Faster Neighborhood Attention to pubs (#1471 )	2024-07-10 11:09:13 -04:00
Daniel Richard G	d6580c3dc0	Support use of external/system GTest installation (#1469 ) * Support use of system/external GTest installation * Create working directory for tests explicitly	2024-07-10 11:07:57 -04:00
Andy Lo	81b06ee0e0	Fix B operand variable name and comments (#1458 )	2024-07-10 11:06:29 -04:00
Alexander Zinoviev	dbfced05e7	Fix typos in convolution tests (#1433 )	2024-07-10 11:00:52 -04:00
Raul	2448bb56e6	Update gemm_api_3x.md (#1386 ) Fixed what it seems to be an obvious typo.	2024-07-10 10:59:02 -04:00
Nick John Eliopoulos	637b159063	Fix C++17 version detection in helper_macros.hpp (#1479 ) * It seems that __cplusplus can be inconsistent with _MSVC_LANG when discerning C++17 version. See https://github.com/NVIDIA/cutlass/issues/1474. Added switch to check _MSVC_LANG in addition to __cplusplus * Fixed typo. * Oops, another typo. * Changed incorrect logic, ifndef to ifdef * Define CUTLAS_CPLUSPLUS for language version testing Co-authored-by: Mark Hoemmen <mhoemmen@users.noreply.github.com> --------- Co-authored-by: Mark Hoemmen <mhoemmen@users.noreply.github.com>	2024-05-28 11:00:51 -04:00
Manish Gupta	033d9efd2d	[Documentation] Fixes the confusion between concatenated vs. composed layout in CuTe documentation (#1498 ) * Update 02_layout_algebra.md * Update 02_layout_algebra.md	2024-05-02 15:35:12 -04:00
Sin	acc3ee18a1	Fix typos in cute docs (#1486 ) * fix typos in 02_layout_algebra.md * fix typos in 03_tensor.md	2024-05-02 15:34:36 -04:00
djns99	5c447dd84f	Update packed_stride.hpp to add CUTLASS_HOST_DEVICE decorator to new functions (#1495 )	2024-04-19 12:07:57 -04:00
Vijay Thakkar	7d49e6c7e2	Updates for CUTLASS 3.5.0 (#1468 )	2024-04-11 21:33:40 -04:00
Mehdi Yazdani	a40e08e9d5	Update 02_layout_algebra.md (#1451 ) change line 348 to reflect correct layout.	2024-04-10 10:57:57 -04:00
lzw	8e7d9f483d	add missing header for size_t in `numeric_types.h` (#1420 ) * add missing header for size_t in `numeric_types.h` * make nvrtc happy * add missing header for int types in `cutlass/arch/memory.h` --------- Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2024-04-09 14:15:48 -04:00

1 2 3 4 5 ...

543 Commits