cutlass

Author	SHA1	Message	Date
Xinyu Yang	b0c09ed077	fix by adding public (#1753 )	2024-10-23 12:45:58 -04:00
Sergey Klevtsov	d65266a868	Add all supported GMMA shapes (#1890 )	2024-10-22 18:13:36 -04:00
Tri Dao	5b50a8faaf	Add GMMA shape m64n40k16 (#1864 )	2024-10-21 20:41:47 -04:00
Sergey Klevtsov	08101d9d0c	Improve sm90 mixed dtype kernel (#1883 )	2024-10-17 20:06:38 -04:00
Haicheng Wu	755194a7bd	add is_last_tile	2024-10-17 12:11:02 -07:00
Saagar Jha	53668799b2	Handle MNK Sm90{Row, Col}Reduction problem shapes (#1803 )	2024-10-14 19:46:20 -04:00
Yujia Zhai	cc3c29a81a	CUTLASS 3.6.0 (#1850 ) * v3.6 * update changelog * update readme * fix typo * fixing typos * hopper gemm with weight prefetch --------- Co-authored-by: yuzhai <yuzhai@nvidia.com> Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2024-10-09 15:33:27 -04:00
Feng Shijie	0837a2a00a	Fix typo in comment (#1787 )	2024-10-07 12:39:59 -04:00
Junkai-Wu	e2b0789927	Add some can implement rules of hopper convolution. (#1835 )	2024-09-25 11:28:10 -04:00
reed	2991ce18d3	Add print_svg for mma (#1733 ) * add print_svg for mma * correct the code indentation	2024-09-18 10:37:24 -04:00
Chenggang Zhao	1ebda1ccef	Fix MMA promotion interval assertions (#1641 )	2024-09-16 12:38:42 -04:00
John Shumway	3a8c01a18b	Prefix a member template name with the template keyword. (#1796 ) Fixes llvm buld error.	2024-09-11 13:33:56 -04:00
Junkai-Wu	dbdae514e0	Support for TMA Epilogue for Group Gemm and add pingpong ptr array & Group Gemm (#1795 )	2024-09-11 00:07:31 -04:00
Sean Xiaowen Zhang	21d0534167	fix assertion (#1790 )	2024-09-09 14:05:27 -04:00
Tri Dao	323c8170bf	Support ComputeFn where output type differs from input type (#1771 ) This is useful for e.g. function taking in 2 float inputs and turn them to complex	2024-09-05 23:25:03 -04:00
Gabriel Wu	82f5075946	set_slice3x3 -> set_slice_3x3 (#1784 )	2024-09-05 23:24:10 -04:00
Saagar Jha	06e337758d	Remove extraneous comma in declaration (#1776 )	2024-09-05 17:14:15 -04:00
JiayuSun	7369adcaca	Add Sm90LinCombPerColBias (#1774 ) Co-authored-by: Jiayu Sun <jiayus@s4124-0071.nvidia.com>	2024-09-04 15:11:24 -04:00
Alchan Kim	6c3044136b	Update barrier.h (#1782 )	2024-09-04 14:52:11 -04:00
Aleksandar Samardžić	e1976daacc	Add support for mixed 4-bit/8-bit data types GEMM (#1413 ) * Add support for mixed 4-bit/8-bit data types GEMM * fix ( and ) --------- Co-authored-by: Aleksandar Samardžić <asamardzic@matf.bg.ac.rs> Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2024-08-29 23:11:06 -04:00
shunfan-shao	4dbf5dbed2	Use CUDA runtime API to retrieve function pointer to driver API (#1700 ) * Query pfn to driver api * use default for older toolkits --------- Co-authored-by: shunfans <shunfans@nvidia.com>	2024-08-19 13:26:09 -04:00
Haicheng Wu	b0296bf682	fix uint128	2024-08-15 21:06:01 -07:00
eqy	fb170439e8	Update half.h (#1709 )	2024-08-14 14:59:59 -04:00
Tri Dao	7192f4ab23	Add CLayout_64x208 (#1680 ) Without this I get compilation error when the extended shapes are enabled	2024-08-08 14:00:24 -04:00
Mark Hoemmen	19b4c5e065	Fix isnan namespace qualification in cutlass/functional.h (#1679 ) * Fix unrelated MSVC build warnings * Fix use of isnan in functional.h Correct namespace qualification of isnan in functional.h so that it invokes cutlass::isnan for half_t, instead of converting half_t to float and invoking std::isnan (on host, or ::isnan on device).	2024-08-05 14:28:13 -04:00
dePaul Miller	06b21349bc	1x1x1 cluster launch (#1673 )	2024-08-01 12:20:28 -04:00
Sergey Klevtsov	36cbfcf483	Add extended wgmma shapes for all data types (#1666 )	2024-07-31 18:33:14 -04:00
Ali Hassani	1f2b590da6	Skip void-C kernels in the profiler when beta is non zero (#1661 ) * Skip void-C kernels in the profiler when beta is non zero CUTLASS profiler will only skip disposition for void-C kernels when beta is non zero, when it makes more sense to skip running it in the first place. Not all users are aware of void-C kernels (as far as I know it wasn't a thing in 2.X), and not everyone remembers to filter out voidC kernels when running the profiler with a non zero beta. The easiest solution (and as far as I can tell correct way of handling this) is that `can_implement` return `false` when beta is non zero (or whatever argument indicates an epilogue source) but we have a void-C kernel. Profiler already includes functionality to skip running kernels that fail `can_implement`. * Move checks to collectives instead --------- Co-authored-by: Ali Hassani <ahassani@nvidia.com>	2024-07-31 18:11:58 -04:00
eqy	fbd116c0e5	fix build on SM 5.2 (#1664 )	2024-07-31 09:54:57 -04:00
Tri Dao	5b283c872c	Add more GMMA shapes (#1630 ) * Add more GMMA shapes * Add more shapes for BF16	2024-07-29 19:09:51 -04:00
Vijay Thakkar	be60a0b272	CUTLASS 3.5.1 (#1623 ) * CUTLASS 3.5.1 * updates, optimizations, fixes	2024-07-29 08:46:24 -04:00
Chengquan Jiang	56b46e2d13	Fix grouped gemm invalid memory access to problem shapes (#1543 )	2024-07-10 11:55:22 -04:00
Kevin Tong	52fb43f30f	fix mbarrier invalidate (#1494 )	2024-07-10 11:35:26 -04:00
Andy Lo	81b06ee0e0	Fix B operand variable name and comments (#1458 )	2024-07-10 11:06:29 -04:00
Nick John Eliopoulos	637b159063	Fix C++17 version detection in helper_macros.hpp (#1479 ) * It seems that __cplusplus can be inconsistent with _MSVC_LANG when discerning C++17 version. See https://github.com/NVIDIA/cutlass/issues/1474. Added switch to check _MSVC_LANG in addition to __cplusplus * Fixed typo. * Oops, another typo. * Changed incorrect logic, ifndef to ifdef * Define CUTLAS_CPLUSPLUS for language version testing Co-authored-by: Mark Hoemmen <mhoemmen@users.noreply.github.com> --------- Co-authored-by: Mark Hoemmen <mhoemmen@users.noreply.github.com>	2024-05-28 11:00:51 -04:00
Vijay Thakkar	7d49e6c7e2	Updates for CUTLASS 3.5.0 (#1468 )	2024-04-11 21:33:40 -04:00
lzw	8e7d9f483d	add missing header for size_t in `numeric_types.h` (#1420 ) * add missing header for size_t in `numeric_types.h` * make nvrtc happy * add missing header for int types in `cutlass/arch/memory.h` --------- Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2024-04-09 14:15:48 -04:00
reed	19f3cc33f1	Fix uint128 operator add (#1400 ) * fix uint128 operator add for 64-bit hilo implemenation * add uint128 test for operator add * make clang happy --------- Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2024-04-02 13:32:18 -04:00
reed	28cbacbf64	fix stride compilation warning (#1415 )	2024-03-29 23:50:33 -04:00
seventh	c4e3e122e2	group gemm set stride L = cute::Int<0> (#1416 )	2024-03-20 17:31:14 -04:00
Vijay Thakkar	629f4653c3	CUTLASS 3.5.0 (#1411 )	2024-03-19 17:51:04 -04:00
LiYu Lu	a8f2c80db0	fix `tile_size(TiledCopy<Args...> const&)` error (#1357 )	2024-02-24 00:33:01 -05:00
ANIKET SHIVAM	bbe579a9e3	Updates for CUTLASS 3.4.1 (#1346 ) * Updates for CUTLASS 3.4.1 * minor epi change	2024-02-15 15:48:34 -05:00
Driss Guessous	47a3ebbea9	Add a missing platform include (#1328 )	2024-02-03 01:30:32 -05:00
reed	8825fbf1ef	fix unrecognized print format specifier for int8/uint8 (#1303 ) * fix unrecognized print format specifier for int8/uint8 * use c++ static_cast instead of c cast style	2024-01-29 21:22:40 -05:00
reed	092f14db05	fix tile_size_mnk compilation warning (#1294 )	2024-01-29 21:21:15 -05:00
Aleksandar Samardžić	ca37d632c9	Remove sparse GEMM with row broadcasted bias vector (#1302 ) This reverts commit `d3e72719b4`. Co-authored-by: Aleksandar Samardžić <asamardzic@matf.bg.ac.rs>	2024-01-17 14:06:27 -05:00
Chengquan Jiang	362abbf274	Support ElementD to be void for tma (#1153 ) * Support void D with AuxStore * refine get_element_aux	2024-01-16 18:15:42 -05:00
ANIKET SHIVAM	751eb9a885	Update license year (#1306 )	2024-01-16 14:37:22 -05:00
ANIKET SHIVAM	2f589ffa76	Updates for 3.4 release. (#1305 )	2024-01-16 13:42:51 -05:00

1 2 3 4 5 ...

251 Commits