cutlass

Author	SHA1	Message	Date
Eugene Zhulenev	74d1f3e63a	Fix cute::array<T, 0> iterator (#1273 )	2024-01-08 17:10:09 -05:00
Ali Hassani	d4be5ab5d7	Allow per-column bias in EpilogueTensorBroadcast (#1275 ) * Allow per-column bias in EpilogueTensorBroadcast EpilogueTensorBroadcast only supports per-row vector broadcast, because the bias stride is hardcoded. It can easily support both if the bias stride is made conditional, and the original behavior is maintained by defaulting to per-row. * Add unit test for EpilogueTensorBroadcast with per-col bias --------- Co-authored-by: Ali Hassani <ahassanijr@gmail.com> Co-authored-by: Ali Hassani <ali@hippoml.com>	2024-01-04 12:48:31 -05:00
Aleksandar Samardžić	5c756eb774	Add support for sparse GEMM with visitor epilogue (#1189 ) * Add support for sparse GEMM with visitor epilogue * Refactor changes at the kernel level	2024-01-04 12:38:11 -05:00
Pradeep Ramani	8236f30675	CUTLASS 3.4.0 (#1286 ) * CUTLASS 3.4.0 * Update CHANGELOG.md --------- Co-authored-by: Pradeep Ramani <prramani@nvidia.com>	2023-12-29 15:21:31 -05:00
Christian Sigg	b7508e3379	Fix inline ptx escaping for predicates. (#1264 ) * Fix inline ptx escaping for predicates. Prevents `error: invalid % escape in inline assembly string` when compiling with clang. * More double-quoting.	2023-12-14 11:16:15 -05:00
Gregory Meyer (gregjm)	f60786b536	Remove undefined behavior from default constructor of PredicatedTileAccessIteratorParams. (#1258 ) Currently, the default constructor of `PredicatedTileAccessIteratorParams` will invoke undefined behavior in its invocation of the `initialize` function. Specifically, it will attempt to read from the uninitialized variables `desc.element_size_bits` and `desc.advance_rank`. This commit changes the default constructors of both `Params` and `Desc` to zero-initialize all uninitialized members.	2023-12-11 23:01:53 -05:00
Christian Sigg	e1483d5fa0	Collection of changes to fix clang build. (#1200 ) * Remove unused variables * Qualify calls to make_fragment_? from templated base class. Fixes clang build error. * Add missing `#include <cstdio>` * Various changes to fix clang compile errors. * More changes to fix clang build. Remaining issues: - `params` initializer of `CollectiveEpilogue`. - `ops` initializer of `Sm90VisitorImplBase`. - `__usAtomicCAS` needs to be added to clang upstream. * Fix remaining clang build issues. * Qualify `cute::rank()` calls. * Qualify some more calls that are otherwise ambiguous between `cute` and `std` namespace. * Double-escape special registers in inline asm. * small change --------- Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2023-12-08 14:42:12 -05:00
Ali Hassani	f4a0216601	Fix bug in single source GEMM with residual + streamk (#1249 ) Followup to #1224. A change in the stream-k threadblock swizzle ctor since 3.3 breaks single source GEMM with fused epilogue and stream-k. Multi-source was already corrected. Co-authored-by: Ali Hassani <ahassanijr@gmail.com>	2023-12-07 11:12:02 -05:00
Ali Hassani	a75b4ac483	Fix Stream-K reduce bug in epilogue with broadcast (#1224 ) Co-authored-by: Ali Hassani <ahassanijr@gmail.com>	2023-12-05 15:35:41 -05:00
Pradeep Ramani	e9e30c2304	Updates and Bug fixes to CUTLASS 3.3 (#1232 )	2023-12-05 09:50:49 -05:00
Haicheng Wu	4a1709e17e	Fixed illegal PTX syntax (#1225 )	2023-12-01 12:29:48 -05:00
Christian Sigg	bef1fbcbe6	Add missing `#include <cstdio>` (#1197 ) * Add missing `#include <cstdio>` * move to non nvrtc part --------- Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2023-12-01 11:58:53 -05:00
Christian Sigg	2375a07d01	Qualify calls to make_fragment_? from templated base class. (#1196 ) Fixes clang build error.	2023-12-01 09:52:57 -05:00
cyyever	10b850f9c7	Fix some sign conversion warnings (#1172 ) * Fix sign conversion warnings * Fix type conversion warnings * Fix sign conversion warnings * Change smem_size_ to constexpr * clang warnings * undo cast change * one miss change * missing part --------- Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2023-11-30 00:28:40 -05:00
Christian Sigg	99c4eebe3b	Explicitly cast `blockIdx` to `uint3` (#1192 ) This works around a clang issue where blockIdx is of a different type.	2023-11-30 00:26:23 -05:00
reed	eb01d5449d	fix cp.async L2 prefetch typo (#1187 )	2023-11-28 16:58:04 -05:00
Sergey Klevtsov	b5d8a5d9cc	Allow SM90 pingpong kernel to use custom tile schedulers (#1194 ) Co-authored-by: Sergey Klevtsov <sklevtsov@nvidia.com>	2023-11-15 13:45:17 -05:00
reed	6e60b9b17c	enable L2::128B prefetch for cp.async by default (#1177 )	2023-11-13 13:30:13 -05:00
Changho Hwang	1ab6cc7b68	Fix `std::abs` overloading for `bfloat16_t` (#1179 )	2023-11-13 13:29:45 -05:00
reed	39c6a83f23	fix missing return warning (#1173 )	2023-11-03 22:42:59 -04:00
wang-y-z	557be3ab0e	Fix several typos (#1169 ) Co-authored-by: isaacw <isaacw@nvidia.com>	2023-11-02 23:54:46 -04:00
Pradeep Ramani	c008b4aea8	CUTLASS 3.3.0 (#1167 ) * Release 3.3.0 Adds support for mixed precision GEMMs On Hopper and Ampere Adds support for < 16B aligned GEMMs on Hopper Enhancements to EVT Enhancements to Python interface Enhancements to Sub-byte type handling in CuTe Several other bug-fixes and performance improvements. * minor doc update	2023-11-02 11:09:05 -04:00
reed	922fb5108b	clean the format (#1140 )	2023-10-24 22:59:06 -04:00
cyyever	7a7796afae	Fix is_zero (#1147 ) * Fix is_zero * Use constexpr * Add CUTLASS_PRAGMA_UNROLL to loops * Avoid if branches in is_zero	2023-10-23 12:09:37 -04:00
reed	fa8dfe631f	fix missing return warning for repeat and axpby (#1124 )	2023-10-12 00:05:45 -04:00
Krzysztof Lecki	4082fed85a	Add missing int64 and uint64 overloads for conj (#1127 ) Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>	2023-10-05 20:01:44 -04:00
Fabian Schuetze	5f13dcad78	set kIsHeavy member variables (#1012 ) * set kIsHeavy member variables * correct kIsHeavy value for Tanh * set kIsHeavy=false for HardSwish --------- Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2023-10-04 12:38:36 -04:00
Kyle Gerard Felker	61a38f83dc	Add #include <limits> to platform.h (#1121 ) Closes #1118	2023-10-02 21:41:25 -04:00
Manish Gupta	7d8317a63e	Support for Mixed Input TensorOp (#1084 ) * Passing warp-level mixed input F16(S8/U8) tests passing device-level mixed input F16(S8/U8) tests add to profiler - I8 (111 TFLOPs), U (123 TFLOPs) * fast numeric conversions (I8 = 132 TFLOPs, U8 = 148 TFLOPs) * Speedup reference compilation (REVERT THIS COMMIT) * wider_add.u32_packed_sub.f16x2 (I8 = 132TFLOP/s, U8 = 170 TFLOP/s) * Improve s8->f16 cvt and support bf16u8 @158 TFLOPs BF16 * S8 (142 TFLOPs) * Handle mixed-input upcast on OperandA (Support [S8\|U8][F16\|BF16] rename OpMultiplyAddMixedInput to OpMultiplyAddMixedInputUpcast * Add device-level test and profiler support for upcast on operand A * Move shfl before the cvt and reduce #shfls by 1/2 * fix smem_usage calculation for mixed_input types * uncomment the stuff (getting ready for merge) * profiler changes and mixed-input reference * mixed input reference are in a new file * use platform instead of std * comments and typo only * Use CreateGemmOperator and delete CreateMixedInputGemmOperator * copyright for new files * rebase follow-up	2023-09-27 11:18:30 -04:00
xuhaoran	67ae8e0603	Change the position of minus sign in line1549 array.h (#1091 ) when I use cutlass::epilogue:🧵:LinearCombinationSigmoid, I encounter the this error: cutlass/include/cutlass/array.h(1549): error: no operator "-" matches these operands Moving operator "-" from line 1549 to 1548 can solve this error	2023-09-26 17:26:39 -04:00
ZCHNO	14f69bddc8	[fix] fix comparison operator for integer_subbyte (#1090 )	2023-09-26 17:26:12 -04:00
ANIKET SHIVAM	90d3b0fb18	CUTLASS 3.2.1 (#1113 ) * Updates for 3.2.1 release. * Minor fix in gemm op profiler for raster order. * Add scheduler mapping for raster order in the kernels.	2023-09-26 17:24:26 -04:00
reed	e0aaa3c3b3	fix GmmaDescriptor print format string error (#1102 )	2023-09-19 23:27:58 -04:00
Driss Guessous	88c0d7c726	make only visible on device (#1071 )	2023-09-07 13:00:46 -04:00
Aman Gupta Karmani	34fd98056b	fix cinttypes issue with STDC_FORMAT_MACROS (#1068 ) * fix cinttypes issue with STDC_FORMAT_MACROS * Update mma_sm90_desc.hpp * Update mma_sm90_desc.hpp --------- Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com>	2023-08-29 14:59:33 -04:00
reed	6673df0e48	fix typos (#1059 )	2023-08-27 00:49:26 -04:00
Lufang Chen	7618e9bfd8	Fix numeric conversion warning (#1021 ) * fix numeric conversion unused var * update --------- Co-authored-by: Lufang CHEN 陈橹方 <lufang.chen@nio.com>	2023-08-27 00:42:44 -04:00
ANIKET SHIVAM	a88c41cf8d	Updates for 3.2 release (#1065 )	2023-08-25 23:05:46 -04:00
Allard Hendriksen	2a9fa23e06	Avoid cute::print compiler warnings with -Wformat-security (#1041 ) Fixes issue #1040.	2023-08-18 14:38:27 -04:00
Haibin Lin	7e5ee8b7bf	[doc] fix: fix typos in the comment (#1049 )	2023-08-16 11:39:25 -04:00
ANIKET SHIVAM	4575443d44	CUTLASS 3.2 (#1024 ) * CUTLASS 3.2	2023-08-07 20:50:32 -04:00
Sophia Wisdom	d20f3a9542	spelling (#1007 ) logicial -> logical	2023-07-20 14:41:11 -04:00
ChangyouSiom	e066ced33b	fix epilogue iterator error (#995 ) * fix epilogue iterator error * fix epilogue iterator error --------- Co-authored-by: maxiao <maxiao@cowarobot.com>	2023-07-10 21:30:31 -04:00
Jack Kosaian	87349d3496	Add grouped b2b GEMM (#970 )	2023-06-05 17:16:57 -04:00
Jack Kosaian	7dbf423763	Add conversion from ElementBias to ElementCompute (#961 )	2023-05-26 23:08:36 -04:00
Aleksandar Samardžić	d3e72719b4	Add support for sparse GEMM with row broadcasted bias vector (#951 )	2023-05-24 10:25:05 -04:00
ANIKET SHIVAM	f079619f5e	More updates for 3.1 (#958 ) * Updates for 3.1 * Minor change * doc link fix * Minor updates	2023-05-24 10:17:16 -04:00
Ali Hassani	13f413493a	Stream-K with broadcast (#892 ) * [WIP] GEMM StreamK w/ Fused Epilogue * Adds Gemm Streamk with Fused Epilogue kernel level struct. * Mostly based on Gemm with Fused Epilogue, * Requires a new epilogue * Work in progress * [WIP] StreamK support for GemmUniversalWithBroadcast * Just based off of how StreamK is allowed in GemmUniversal * Untested and a work in progress * Minor fixes * [WIP] It compiles! It is almost certainly incorrect, but we're past getting the templates to match, so checkpointing. * Correction to reference kernel * Fix typo * Added MSE measurement * Switch back to reference kernel + host for loop Still WIP. Now we're getting even a larger MSE, but it's both on basic Split-K and Stream-K. * Fix typos * Fix broadcast vector + requested changes * Comment typo * Small int option and more * Fix incorrect condition on source needed * Requested changes * I think I got it? * Bias vector should be stride 0 * Two source added! * Typos * Merge examples * Bring back vector row offset Just to ensure consistency with universal gemm with fused epilogue * Base arguments and params structs for StreamK * StreamK epilogue with broadcast now inherits the original * undo params_streamk_base.h --------- Co-authored-by: Ali Hassani <ahassanijr@gmail.com> Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2023-05-22 19:05:06 -04:00
wll	19c4a4815e	replace division with multiplication in GELU (#942 )	2023-05-12 10:57:18 -04:00
Gregory Meyer (gregjm)	fcfbd23e26	Fix host compilation of cute::cast_smem_ptr_to_uint. (#940 ) * Remove references to device-only intrinsics when compiling for host. Currently, we attempt to use the `__device__`-only functions `__cvta_generic_to_shared` and `__nvvm_get_smem_pointer` when compiling `cute::cast_smem_ptr_to_uint` for the host on Clang. This results in a compilation error, as expected. This commit changes the definition of the `_ACTIVATED` macros so that they are only true when `__CUDA_ARCH__` is defined; that is, when compiling for the device. Additionally, the declaration of `__nvvm_get_smem_pointer` is currently only visible during the device compilation pass when compiling with NVCC; this commit makes the declaration visible during host compilation with the `__device__` annotation. Annotate cute::cast_smem_ptr_to_uint as device-only. The implementation of `cute::cast_smem_ptr_to_uint` is currently an unchecked failure on host code, and the only host implementation I can think of -- casting a probably-64-bit pointer to 32 bits somehow -- doesn't make sense to implement. This commit marks this function as device-only so that it can't be accidentally used on host code. * small change --------- Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2023-05-10 00:06:54 -04:00

1 2 3 4 5

201 Commits