cutlass

Author	SHA1	Message	Date
eqy	fb170439e8	Update half.h (#1709 )	2024-08-14 14:59:59 -04:00
Tri Dao	7192f4ab23	Add CLayout_64x208 (#1680 ) Without this I get compilation error when the extended shapes are enabled	2024-08-08 14:00:24 -04:00
Mark Hoemmen	19b4c5e065	Fix isnan namespace qualification in cutlass/functional.h (#1679 ) * Fix unrelated MSVC build warnings * Fix use of isnan in functional.h Correct namespace qualification of isnan in functional.h so that it invokes cutlass::isnan for half_t, instead of converting half_t to float and invoking std::isnan (on host, or ::isnan on device).	2024-08-05 14:28:13 -04:00
dePaul Miller	06b21349bc	1x1x1 cluster launch (#1673 )	2024-08-01 12:20:28 -04:00
Sergey Klevtsov	36cbfcf483	Add extended wgmma shapes for all data types (#1666 )	2024-07-31 18:33:14 -04:00
Ali Hassani	1f2b590da6	Skip void-C kernels in the profiler when beta is non zero (#1661 ) * Skip void-C kernels in the profiler when beta is non zero CUTLASS profiler will only skip disposition for void-C kernels when beta is non zero, when it makes more sense to skip running it in the first place. Not all users are aware of void-C kernels (as far as I know it wasn't a thing in 2.X), and not everyone remembers to filter out voidC kernels when running the profiler with a non zero beta. The easiest solution (and as far as I can tell correct way of handling this) is that `can_implement` return `false` when beta is non zero (or whatever argument indicates an epilogue source) but we have a void-C kernel. Profiler already includes functionality to skip running kernels that fail `can_implement`. * Move checks to collectives instead --------- Co-authored-by: Ali Hassani <ahassani@nvidia.com>	2024-07-31 18:11:58 -04:00
eqy	fbd116c0e5	fix build on SM 5.2 (#1664 )	2024-07-31 09:54:57 -04:00
Tri Dao	5b283c872c	Add more GMMA shapes (#1630 ) * Add more GMMA shapes * Add more shapes for BF16	2024-07-29 19:09:51 -04:00
Vijay Thakkar	be60a0b272	CUTLASS 3.5.1 (#1623 ) * CUTLASS 3.5.1 * updates, optimizations, fixes	2024-07-29 08:46:24 -04:00
Chengquan Jiang	56b46e2d13	Fix grouped gemm invalid memory access to problem shapes (#1543 )	2024-07-10 11:55:22 -04:00
Kevin Tong	52fb43f30f	fix mbarrier invalidate (#1494 )	2024-07-10 11:35:26 -04:00
Andy Lo	81b06ee0e0	Fix B operand variable name and comments (#1458 )	2024-07-10 11:06:29 -04:00
Nick John Eliopoulos	637b159063	Fix C++17 version detection in helper_macros.hpp (#1479 ) * It seems that __cplusplus can be inconsistent with _MSVC_LANG when discerning C++17 version. See https://github.com/NVIDIA/cutlass/issues/1474. Added switch to check _MSVC_LANG in addition to __cplusplus * Fixed typo. * Oops, another typo. * Changed incorrect logic, ifndef to ifdef * Define CUTLAS_CPLUSPLUS for language version testing Co-authored-by: Mark Hoemmen <mhoemmen@users.noreply.github.com> --------- Co-authored-by: Mark Hoemmen <mhoemmen@users.noreply.github.com>	2024-05-28 11:00:51 -04:00
Vijay Thakkar	7d49e6c7e2	Updates for CUTLASS 3.5.0 (#1468 )	2024-04-11 21:33:40 -04:00
lzw	8e7d9f483d	add missing header for size_t in `numeric_types.h` (#1420 ) * add missing header for size_t in `numeric_types.h` * make nvrtc happy * add missing header for int types in `cutlass/arch/memory.h` --------- Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2024-04-09 14:15:48 -04:00
reed	19f3cc33f1	Fix uint128 operator add (#1400 ) * fix uint128 operator add for 64-bit hilo implemenation * add uint128 test for operator add * make clang happy --------- Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2024-04-02 13:32:18 -04:00
reed	28cbacbf64	fix stride compilation warning (#1415 )	2024-03-29 23:50:33 -04:00
seventh	c4e3e122e2	group gemm set stride L = cute::Int<0> (#1416 )	2024-03-20 17:31:14 -04:00
Vijay Thakkar	629f4653c3	CUTLASS 3.5.0 (#1411 )	2024-03-19 17:51:04 -04:00
LiYu Lu	a8f2c80db0	fix `tile_size(TiledCopy<Args...> const&)` error (#1357 )	2024-02-24 00:33:01 -05:00
ANIKET SHIVAM	bbe579a9e3	Updates for CUTLASS 3.4.1 (#1346 ) * Updates for CUTLASS 3.4.1 * minor epi change	2024-02-15 15:48:34 -05:00
Driss Guessous	47a3ebbea9	Add a missing platform include (#1328 )	2024-02-03 01:30:32 -05:00
reed	8825fbf1ef	fix unrecognized print format specifier for int8/uint8 (#1303 ) * fix unrecognized print format specifier for int8/uint8 * use c++ static_cast instead of c cast style	2024-01-29 21:22:40 -05:00
reed	092f14db05	fix tile_size_mnk compilation warning (#1294 )	2024-01-29 21:21:15 -05:00
Aleksandar Samardžić	ca37d632c9	Remove sparse GEMM with row broadcasted bias vector (#1302 ) This reverts commit `d3e72719b4`. Co-authored-by: Aleksandar Samardžić <asamardzic@matf.bg.ac.rs>	2024-01-17 14:06:27 -05:00
Chengquan Jiang	362abbf274	Support ElementD to be void for tma (#1153 ) * Support void D with AuxStore * refine get_element_aux	2024-01-16 18:15:42 -05:00
ANIKET SHIVAM	751eb9a885	Update license year (#1306 )	2024-01-16 14:37:22 -05:00
ANIKET SHIVAM	2f589ffa76	Updates for 3.4 release. (#1305 )	2024-01-16 13:42:51 -05:00
Eugene Zhulenev	74d1f3e63a	Fix cute::array<T, 0> iterator (#1273 )	2024-01-08 17:10:09 -05:00
Ali Hassani	d4be5ab5d7	Allow per-column bias in EpilogueTensorBroadcast (#1275 ) * Allow per-column bias in EpilogueTensorBroadcast EpilogueTensorBroadcast only supports per-row vector broadcast, because the bias stride is hardcoded. It can easily support both if the bias stride is made conditional, and the original behavior is maintained by defaulting to per-row. * Add unit test for EpilogueTensorBroadcast with per-col bias --------- Co-authored-by: Ali Hassani <ahassanijr@gmail.com> Co-authored-by: Ali Hassani <ali@hippoml.com>	2024-01-04 12:48:31 -05:00
Aleksandar Samardžić	5c756eb774	Add support for sparse GEMM with visitor epilogue (#1189 ) * Add support for sparse GEMM with visitor epilogue * Refactor changes at the kernel level	2024-01-04 12:38:11 -05:00
Pradeep Ramani	8236f30675	CUTLASS 3.4.0 (#1286 ) * CUTLASS 3.4.0 * Update CHANGELOG.md --------- Co-authored-by: Pradeep Ramani <prramani@nvidia.com>	2023-12-29 15:21:31 -05:00
Christian Sigg	b7508e3379	Fix inline ptx escaping for predicates. (#1264 ) * Fix inline ptx escaping for predicates. Prevents `error: invalid % escape in inline assembly string` when compiling with clang. * More double-quoting.	2023-12-14 11:16:15 -05:00
Gregory Meyer (gregjm)	f60786b536	Remove undefined behavior from default constructor of PredicatedTileAccessIteratorParams. (#1258 ) Currently, the default constructor of `PredicatedTileAccessIteratorParams` will invoke undefined behavior in its invocation of the `initialize` function. Specifically, it will attempt to read from the uninitialized variables `desc.element_size_bits` and `desc.advance_rank`. This commit changes the default constructors of both `Params` and `Desc` to zero-initialize all uninitialized members.	2023-12-11 23:01:53 -05:00
Christian Sigg	e1483d5fa0	Collection of changes to fix clang build. (#1200 ) * Remove unused variables * Qualify calls to make_fragment_? from templated base class. Fixes clang build error. * Add missing `#include <cstdio>` * Various changes to fix clang compile errors. * More changes to fix clang build. Remaining issues: - `params` initializer of `CollectiveEpilogue`. - `ops` initializer of `Sm90VisitorImplBase`. - `__usAtomicCAS` needs to be added to clang upstream. * Fix remaining clang build issues. * Qualify `cute::rank()` calls. * Qualify some more calls that are otherwise ambiguous between `cute` and `std` namespace. * Double-escape special registers in inline asm. * small change --------- Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2023-12-08 14:42:12 -05:00
Ali Hassani	f4a0216601	Fix bug in single source GEMM with residual + streamk (#1249 ) Followup to #1224. A change in the stream-k threadblock swizzle ctor since 3.3 breaks single source GEMM with fused epilogue and stream-k. Multi-source was already corrected. Co-authored-by: Ali Hassani <ahassanijr@gmail.com>	2023-12-07 11:12:02 -05:00
Ali Hassani	a75b4ac483	Fix Stream-K reduce bug in epilogue with broadcast (#1224 ) Co-authored-by: Ali Hassani <ahassanijr@gmail.com>	2023-12-05 15:35:41 -05:00
Pradeep Ramani	e9e30c2304	Updates and Bug fixes to CUTLASS 3.3 (#1232 )	2023-12-05 09:50:49 -05:00
Haicheng Wu	4a1709e17e	Fixed illegal PTX syntax (#1225 )	2023-12-01 12:29:48 -05:00
Christian Sigg	bef1fbcbe6	Add missing `#include <cstdio>` (#1197 ) * Add missing `#include <cstdio>` * move to non nvrtc part --------- Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2023-12-01 11:58:53 -05:00
Christian Sigg	2375a07d01	Qualify calls to make_fragment_? from templated base class. (#1196 ) Fixes clang build error.	2023-12-01 09:52:57 -05:00
cyyever	10b850f9c7	Fix some sign conversion warnings (#1172 ) * Fix sign conversion warnings * Fix type conversion warnings * Fix sign conversion warnings * Change smem_size_ to constexpr * clang warnings * undo cast change * one miss change * missing part --------- Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2023-11-30 00:28:40 -05:00
Christian Sigg	99c4eebe3b	Explicitly cast `blockIdx` to `uint3` (#1192 ) This works around a clang issue where blockIdx is of a different type.	2023-11-30 00:26:23 -05:00
reed	eb01d5449d	fix cp.async L2 prefetch typo (#1187 )	2023-11-28 16:58:04 -05:00
Sergey Klevtsov	b5d8a5d9cc	Allow SM90 pingpong kernel to use custom tile schedulers (#1194 ) Co-authored-by: Sergey Klevtsov <sklevtsov@nvidia.com>	2023-11-15 13:45:17 -05:00
reed	6e60b9b17c	enable L2::128B prefetch for cp.async by default (#1177 )	2023-11-13 13:30:13 -05:00
Changho Hwang	1ab6cc7b68	Fix `std::abs` overloading for `bfloat16_t` (#1179 )	2023-11-13 13:29:45 -05:00
reed	39c6a83f23	fix missing return warning (#1173 )	2023-11-03 22:42:59 -04:00
wang-y-z	557be3ab0e	Fix several typos (#1169 ) Co-authored-by: isaacw <isaacw@nvidia.com>	2023-11-02 23:54:46 -04:00
Pradeep Ramani	c008b4aea8	CUTLASS 3.3.0 (#1167 ) * Release 3.3.0 Adds support for mixed precision GEMMs On Hopper and Ampere Adds support for < 16B aligned GEMMs on Hopper Enhancements to EVT Enhancements to Python interface Enhancements to Sub-byte type handling in CuTe Several other bug-fixes and performance improvements. * minor doc update	2023-11-02 11:09:05 -04:00

1 2 3 4 5

229 Commits