cutlass

Author	SHA1	Message	Date
Mark Hoemmen	19b4c5e065	Fix isnan namespace qualification in cutlass/functional.h (#1679 ) * Fix unrelated MSVC build warnings * Fix use of isnan in functional.h Correct namespace qualification of isnan in functional.h so that it invokes cutlass::isnan for half_t, instead of converting half_t to float and invoking std::isnan (on host, or ::isnan on device).	2024-08-05 14:28:13 -04:00
dePaul Miller	06b21349bc	1x1x1 cluster launch (#1673 )	2024-08-01 12:20:28 -04:00
Ali Hassani	eee0cab26c	Stamp out 1x1x1 clusters, 128x256 CTA shape (#1665 ) Adds 128x256 tile shapes to FP16/BF16 and FP8 generators. Also adds 1x1x1 clusters to all existing FP16/BF16/FP8 generators. NOTE: it is important to set kernel filter (--kernels / CUTLASS_LIBRARY_KERNELS) to a non empty string and skip pruning to get all of the new configurations. If profiling exhaustively, they can be set to `*`. Number of CUTLASS 3.X GEMMs before this commit: 2868 Number of CUTLASS 3.X GEMMs after this commit: 4016 Co-authored-by: Ali Hassani <ahassani@nvidia.com>	2024-07-31 20:22:29 -04:00
Sergey Klevtsov	36cbfcf483	Add extended wgmma shapes for all data types (#1666 )	2024-07-31 18:33:14 -04:00
Ali Hassani	1f2b590da6	Skip void-C kernels in the profiler when beta is non zero (#1661 ) * Skip void-C kernels in the profiler when beta is non zero CUTLASS profiler will only skip disposition for void-C kernels when beta is non zero, when it makes more sense to skip running it in the first place. Not all users are aware of void-C kernels (as far as I know it wasn't a thing in 2.X), and not everyone remembers to filter out voidC kernels when running the profiler with a non zero beta. The easiest solution (and as far as I can tell correct way of handling this) is that `can_implement` return `false` when beta is non zero (or whatever argument indicates an epilogue source) but we have a void-C kernel. Profiler already includes functionality to skip running kernels that fail `can_implement`. * Move checks to collectives instead --------- Co-authored-by: Ali Hassani <ahassani@nvidia.com>	2024-07-31 18:11:58 -04:00
dePaul Miller	8b2a0408bd	Profiler docs and argument update for raster order (#1667 )	2024-07-31 16:40:10 -04:00
eqy	fbd116c0e5	fix build on SM 5.2 (#1664 )	2024-07-31 09:54:57 -04:00
Tri Dao	5b283c872c	Add more GMMA shapes (#1630 ) * Add more GMMA shapes * Add more shapes for BF16	2024-07-29 19:09:51 -04:00
Vijay Thakkar	be60a0b272	CUTLASS 3.5.1 (#1623 ) * CUTLASS 3.5.1 * updates, optimizations, fixes	2024-07-29 08:46:24 -04:00
Chengquan Jiang	56b46e2d13	Fix grouped gemm invalid memory access to problem shapes (#1543 )	2024-07-10 11:55:22 -04:00
Kevin Tong	52fb43f30f	fix mbarrier invalidate (#1494 )	2024-07-10 11:35:26 -04:00
Joe Rowell	843adf0408	Fix SMEM index for C in CuTe examples (#1477 )	2024-07-10 11:14:15 -04:00
LiYu Lu	e48c7618e4	[bug] fix device thread `gemm.h` constructor (#1473 )	2024-07-10 11:12:36 -04:00
Ali Hassani	c5239d8312	Add Faster Neighborhood Attention to pubs (#1471 )	2024-07-10 11:09:13 -04:00
Daniel Richard G	d6580c3dc0	Support use of external/system GTest installation (#1469 ) * Support use of system/external GTest installation * Create working directory for tests explicitly	2024-07-10 11:07:57 -04:00
Andy Lo	81b06ee0e0	Fix B operand variable name and comments (#1458 )	2024-07-10 11:06:29 -04:00
Alexander Zinoviev	dbfced05e7	Fix typos in convolution tests (#1433 )	2024-07-10 11:00:52 -04:00
Raul	2448bb56e6	Update gemm_api_3x.md (#1386 ) Fixed what it seems to be an obvious typo.	2024-07-10 10:59:02 -04:00
Nick John Eliopoulos	637b159063	Fix C++17 version detection in helper_macros.hpp (#1479 ) * It seems that __cplusplus can be inconsistent with _MSVC_LANG when discerning C++17 version. See https://github.com/NVIDIA/cutlass/issues/1474. Added switch to check _MSVC_LANG in addition to __cplusplus * Fixed typo. * Oops, another typo. * Changed incorrect logic, ifndef to ifdef * Define CUTLAS_CPLUSPLUS for language version testing Co-authored-by: Mark Hoemmen <mhoemmen@users.noreply.github.com> --------- Co-authored-by: Mark Hoemmen <mhoemmen@users.noreply.github.com>	2024-05-28 11:00:51 -04:00
Manish Gupta	033d9efd2d	[Documentation] Fixes the confusion between concatenated vs. composed layout in CuTe documentation (#1498 ) * Update 02_layout_algebra.md * Update 02_layout_algebra.md	2024-05-02 15:35:12 -04:00
Sin	acc3ee18a1	Fix typos in cute docs (#1486 ) * fix typos in 02_layout_algebra.md * fix typos in 03_tensor.md	2024-05-02 15:34:36 -04:00
djns99	5c447dd84f	Update packed_stride.hpp to add CUTLASS_HOST_DEVICE decorator to new functions (#1495 )	2024-04-19 12:07:57 -04:00
Vijay Thakkar	7d49e6c7e2	Updates for CUTLASS 3.5.0 (#1468 )	2024-04-11 21:33:40 -04:00
Mehdi Yazdani	a40e08e9d5	Update 02_layout_algebra.md (#1451 ) change line 348 to reflect correct layout.	2024-04-10 10:57:57 -04:00
lzw	8e7d9f483d	add missing header for size_t in `numeric_types.h` (#1420 ) * add missing header for size_t in `numeric_types.h` * make nvrtc happy * add missing header for int types in `cutlass/arch/memory.h` --------- Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2024-04-09 14:15:48 -04:00
reed	19f3cc33f1	Fix uint128 operator add (#1400 ) * fix uint128 operator add for 64-bit hilo implemenation * add uint128 test for operator add * make clang happy --------- Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2024-04-02 13:32:18 -04:00
jeromeku	f9ece1b42c	Python `Gemm` `tile_descriptions` fix (#1439 ) * fix python gemm tile descriptions * fix formatting * fix math_operation filtering * fix formatting	2024-03-30 09:00:46 -04:00
reed	28cbacbf64	fix stride compilation warning (#1415 )	2024-03-29 23:50:33 -04:00
Tom Tan	8f7d2789b8	[NFC] improve doc: fix typo in mma doc (#1417 )	2024-03-27 14:07:20 -04:00
seventh	c4e3e122e2	group gemm set stride L = cute::Int<0> (#1416 )	2024-03-20 17:31:14 -04:00
Vijay Thakkar	629f4653c3	CUTLASS 3.5.0 (#1411 )	2024-03-19 17:51:04 -04:00
lorenzo chelini	ffa34e7075	(NFC) improve doc: Add missing verb to sentence (#1377 ) Co-authored-by: lorenzo chelini <lchelini@nvidia.com>	2024-03-04 15:30:10 -05:00
LiYu Lu	a8f2c80db0	fix `tile_size(TiledCopy<Args...> const&)` error (#1357 )	2024-02-24 00:33:01 -05:00
ANIKET SHIVAM	bbe579a9e3	Updates for CUTLASS 3.4.1 (#1346 ) * Updates for CUTLASS 3.4.1 * minor epi change	2024-02-15 15:48:34 -05:00
Driss Guessous	47a3ebbea9	Add a missing platform include (#1328 )	2024-02-03 01:30:32 -05:00
Chenggang Zhao	57e01e1a6b	Fix missing include file (#1318 )	2024-02-03 01:29:32 -05:00
xws117	6e3df975a2	Modify comments in code examples/08_turing_tensorop_gemm/turing_tensorop_gemm.cu (#1325 )	2024-01-31 21:41:30 -05:00
reed	8825fbf1ef	fix unrecognized print format specifier for int8/uint8 (#1303 ) * fix unrecognized print format specifier for int8/uint8 * use c++ static_cast instead of c cast style	2024-01-29 21:22:40 -05:00
reed	092f14db05	fix tile_size_mnk compilation warning (#1294 )	2024-01-29 21:21:15 -05:00
Haicheng Wu	9385141f19	Update PUBLICATIONS.md ptq paper from goog	2024-01-19 14:17:55 -05:00
Haicheng Wu	b4b5b11070	Update PUBLICATIONS.md add odyssey llm paper from metuan	2024-01-18 10:30:21 -05:00
jayhshah	139b93db61	update publications (#1308 )	2024-01-17 14:06:46 -05:00
Aleksandar Samardžić	ca37d632c9	Remove sparse GEMM with row broadcasted bias vector (#1302 ) This reverts commit `d3e72719b4`. Co-authored-by: Aleksandar Samardžić <asamardzic@matf.bg.ac.rs>	2024-01-17 14:06:27 -05:00
Chengquan Jiang	362abbf274	Support ElementD to be void for tma (#1153 ) * Support void D with AuxStore * refine get_element_aux	2024-01-16 18:15:42 -05:00
ANIKET SHIVAM	751eb9a885	Update license year (#1306 )	2024-01-16 14:37:22 -05:00
ANIKET SHIVAM	2f589ffa76	Updates for 3.4 release. (#1305 )	2024-01-16 13:42:51 -05:00
Tianao Ge	acba5beee5	Fix flops calculation and tensor b stride calculation in the example 36 (#1278 ) * Fix flops calculation and tensor b stride calculation in the example 36 * Fix datatype * Update gather_scatter_fusion.cu	2024-01-08 17:27:30 -05:00
Eugene Zhulenev	74d1f3e63a	Fix cute::array<T, 0> iterator (#1273 )	2024-01-08 17:10:09 -05:00
Kun Wu	8ac2edc810	expose stream API in python kernel call interfaces (#1287 ) * expose stream API in python kernel call interfaces * add stream to ReductionArguments; document stream arg * add stream argument to GemmGroupedArguments	2024-01-05 08:27:45 -05:00
Ali Hassani	d4be5ab5d7	Allow per-column bias in EpilogueTensorBroadcast (#1275 ) * Allow per-column bias in EpilogueTensorBroadcast EpilogueTensorBroadcast only supports per-row vector broadcast, because the bias stride is hardcoded. It can easily support both if the bias stride is made conditional, and the original behavior is maintained by defaulting to per-row. * Add unit test for EpilogueTensorBroadcast with per-col bias --------- Co-authored-by: Ali Hassani <ahassanijr@gmail.com> Co-authored-by: Ali Hassani <ali@hippoml.com>	2024-01-04 12:48:31 -05:00

1 2 3 4 5 ...

518 Commits