cutlass

Author	SHA1	Message	Date
Manish Gupta	ff02da2667	Fx parallel split-k (#1116 )	2023-10-06 12:02:40 -04:00
Krzysztof Lecki	4082fed85a	Add missing int64 and uint64 overloads for conj (#1127 ) Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>	2023-10-05 20:01:44 -04:00
Fabian Schuetze	5f13dcad78	set kIsHeavy member variables (#1012 ) * set kIsHeavy member variables * correct kIsHeavy value for Tanh * set kIsHeavy=false for HardSwish --------- Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2023-10-04 12:38:36 -04:00
Kyle Gerard Felker	61a38f83dc	Add #include <limits> to platform.h (#1121 ) Closes #1118	2023-10-02 21:41:25 -04:00
masahi	ff61a49dd1	Allow changing epsilon parameter in RMS norm kernel (#1112 )	2023-10-02 20:40:28 -04:00
Lequn Chen	26986bbc60	Fix type typo in rmsnorm (#1119 ) Initially the variable `h4` is `half4`, but its last two fields are not used. Based on the semantics and the context, I believe it should be `half2`.	2023-10-02 20:40:04 -04:00
Manish Gupta	7d8317a63e	Support for Mixed Input TensorOp (#1084 ) * Passing warp-level mixed input F16(S8/U8) tests passing device-level mixed input F16(S8/U8) tests add to profiler - I8 (111 TFLOPs), U (123 TFLOPs) * fast numeric conversions (I8 = 132 TFLOPs, U8 = 148 TFLOPs) * Speedup reference compilation (REVERT THIS COMMIT) * wider_add.u32_packed_sub.f16x2 (I8 = 132TFLOP/s, U8 = 170 TFLOP/s) * Improve s8->f16 cvt and support bf16u8 @158 TFLOPs BF16 * S8 (142 TFLOPs) * Handle mixed-input upcast on OperandA (Support [S8\|U8][F16\|BF16] rename OpMultiplyAddMixedInput to OpMultiplyAddMixedInputUpcast * Add device-level test and profiler support for upcast on operand A * Move shfl before the cvt and reduce #shfls by 1/2 * fix smem_usage calculation for mixed_input types * uncomment the stuff (getting ready for merge) * profiler changes and mixed-input reference * mixed input reference are in a new file * use platform instead of std * comments and typo only * Use CreateGemmOperator and delete CreateMixedInputGemmOperator * copyright for new files * rebase follow-up	2023-09-27 11:18:30 -04:00
Manish Gupta	5cd735c48e	Fix Parallel Split-K on Gemm Operation Profiler (#1109 ) * Debug and fix for parallel split-k in profiler * restore debug files and remove prints	2023-09-26 17:28:00 -04:00
xuhaoran	67ae8e0603	Change the position of minus sign in line1549 array.h (#1091 ) when I use cutlass::epilogue:🧵:LinearCombinationSigmoid, I encounter the this error: cutlass/include/cutlass/array.h(1549): error: no operator "-" matches these operands Moving operator "-" from line 1549 to 1548 can solve this error	2023-09-26 17:26:39 -04:00
ZCHNO	14f69bddc8	[fix] fix comparison operator for integer_subbyte (#1090 )	2023-09-26 17:26:12 -04:00
ANIKET SHIVAM	90d3b0fb18	CUTLASS 3.2.1 (#1113 ) * Updates for 3.2.1 release. * Minor fix in gemm op profiler for raster order. * Add scheduler mapping for raster order in the kernels.	2023-09-26 17:24:26 -04:00
reed	e0aaa3c3b3	fix GmmaDescriptor print format string error (#1102 )	2023-09-19 23:27:58 -04:00
Vadim Markovtsev	8783c41851	Replace 0x1f with 0xffffffff in __shfl_sync (#1097 ) This fixes compatibility with H100 and resolves #1094	2023-09-18 19:58:19 -04:00
Yujia Zhai	6407bcdf0a	fix matrix B indices (#1089 )	2023-09-12 14:04:18 -04:00
tpoisonooo	a77b2c9cb8	style(examples): typo (#1080 ) * Update ampere_tensorop_conv2dfprop.cu learning cutlass, PR a typo. * Update ampere_gemm_operand_reduction_fusion.cu	2023-09-11 10:13:22 -04:00
ANIKET SHIVAM	34bbadd3ff	standarize fp8 generator (#1078 )	2023-09-07 14:36:33 -04:00
Driss Guessous	88c0d7c726	make only visible on device (#1071 )	2023-09-07 13:00:46 -04:00
Vijay Thakkar	e01b9b5029	Shard gemm reference templates into multiple TUs for parallel compilation (#1043 ) * Split apart gemm reference templates into multiple TUs for parallel compilation * remove old files * better balancing of ref kernels across TUs * remove 3 new added refcheck kernels and some un-necessary fp8 library instances to reduce lib size * remove auto fp8 kernels * remove some redundant kernels	2023-08-30 16:46:30 -04:00
Aman Gupta Karmani	34fd98056b	fix cinttypes issue with STDC_FORMAT_MACROS (#1068 ) * fix cinttypes issue with STDC_FORMAT_MACROS * Update mma_sm90_desc.hpp * Update mma_sm90_desc.hpp --------- Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com>	2023-08-29 14:59:33 -04:00
Ying Zhang	3a8f57a3c8	Add simple hash and eq methods for gemm_operations. (#1053 )	2023-08-27 20:41:57 -04:00
reed	6673df0e48	fix typos (#1059 )	2023-08-27 00:49:26 -04:00
Lufang Chen	7618e9bfd8	Fix numeric conversion warning (#1021 ) * fix numeric conversion unused var * update --------- Co-authored-by: Lufang CHEN 陈橹方 <lufang.chen@nio.com>	2023-08-27 00:42:44 -04:00
ANIKET SHIVAM	a88c41cf8d	Updates for 3.2 release (#1065 )	2023-08-25 23:05:46 -04:00
reed	27de343535	Add one Publication which is inspired by cutlass (#1022 )	2023-08-22 10:00:17 -04:00
Allard Hendriksen	2a9fa23e06	Avoid cute::print compiler warnings with -Wformat-security (#1041 ) Fixes issue #1040.	2023-08-18 14:38:27 -04:00
zhu jianjiang	2e56cfabee	fix typo (#1047 )	2023-08-18 14:08:26 -04:00
lorenzo chelini	3930f709ce	Fix typo in `0x_gemm_tutorial.md` (#1035 )	2023-08-17 10:52:20 -04:00
Haibin Lin	7e5ee8b7bf	[doc] fix: fix typos in the comment (#1049 )	2023-08-16 11:39:25 -04:00
Sophia Wisdom	2d9a557427	torch.bfloat16 support in cutlass python (#1037 ) * torch.bfloat16 support in cutlass python * Update datatypes.py	2023-08-16 11:38:53 -04:00
ANIKET SHIVAM	4575443d44	CUTLASS 3.2 (#1024 ) * CUTLASS 3.2	2023-08-07 20:50:32 -04:00
Xianyao Zhang	a0d787b746	Fix one publication (#1019 )	2023-07-28 11:40:17 -04:00
Sophia Wisdom	d20f3a9542	spelling (#1007 ) logicial -> logical	2023-07-20 14:41:11 -04:00
Tianqi Zhang (张天启)	8e85580859	fix layout bug (#1006 )	2023-07-19 14:26:01 -04:00
dan_the_3rd	146d314057	Update fMHA kernels (#992 ) * Update fMHA kernels Upstream recent changes to fMHA that we did in xFormers. Previous version in CUTLASS: facebookresearch/xformers@b6be33a Updating to: facebookresearch/xformers@55a4798 * minor changes * make var work --------- Co-authored-by: danthe3rd <danthe3rd> Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2023-07-12 22:30:46 -04:00
masahi	f679663224	Add RMS norm (#979 )	2023-07-10 21:31:27 -04:00
ChangyouSiom	e066ced33b	fix epilogue iterator error (#995 ) * fix epilogue iterator error * fix epilogue iterator error --------- Co-authored-by: maxiao <maxiao@cowarobot.com>	2023-07-10 21:30:31 -04:00
Nathan Wang	9b923dd4c4	fix minor typos (#984 )	2023-07-05 09:23:01 -04:00
q.yao	f6d42f2dd0	add library_dirs (#977 )	2023-06-14 12:09:12 -04:00
ANIKET SHIVAM	473a67073e	Fix Int8 and TF32 generator (#976 )	2023-06-12 12:32:52 -04:00
Jack Kosaian	87349d3496	Add grouped b2b GEMM (#970 )	2023-06-05 17:16:57 -04:00
Vijay Thakkar	fde824af21	Update Hopper performance plot for CUTLASS 3.1 + CTK 12.1 (#967 )	2023-06-01 14:52:40 -04:00
Jack Kosaian	7dbf423763	Add conversion from ElementBias to ElementCompute (#961 )	2023-05-26 23:08:36 -04:00
Haicheng Wu	6f47420213	Update README.md	2023-05-24 12:40:31 -04:00
Haicheng Wu	4638250469	Update CHANGELOG.md	2023-05-24 12:39:42 -04:00
Haicheng Wu	7859fe322a	Update PUBLICATIONS.md	2023-05-24 12:36:12 -04:00
Aleksandar Samardžić	d3e72719b4	Add support for sparse GEMM with row broadcasted bias vector (#951 )	2023-05-24 10:25:05 -04:00
Ali Hassani	b4ab501767	Adds CUDA path for x86-64 (#957 )	2023-05-24 10:21:25 -04:00
ANIKET SHIVAM	f079619f5e	More updates for 3.1 (#958 ) * Updates for 3.1 * Minor change * doc link fix * Minor updates	2023-05-24 10:17:16 -04:00
Ali Hassani	13f413493a	Stream-K with broadcast (#892 ) * [WIP] GEMM StreamK w/ Fused Epilogue * Adds Gemm Streamk with Fused Epilogue kernel level struct. * Mostly based on Gemm with Fused Epilogue, * Requires a new epilogue * Work in progress * [WIP] StreamK support for GemmUniversalWithBroadcast * Just based off of how StreamK is allowed in GemmUniversal * Untested and a work in progress * Minor fixes * [WIP] It compiles! It is almost certainly incorrect, but we're past getting the templates to match, so checkpointing. * Correction to reference kernel * Fix typo * Added MSE measurement * Switch back to reference kernel + host for loop Still WIP. Now we're getting even a larger MSE, but it's both on basic Split-K and Stream-K. * Fix typos * Fix broadcast vector + requested changes * Comment typo * Small int option and more * Fix incorrect condition on source needed * Requested changes * I think I got it? * Bias vector should be stride 0 * Two source added! * Typos * Merge examples * Bring back vector row offset Just to ensure consistency with universal gemm with fused epilogue * Base arguments and params structs for StreamK * StreamK epilogue with broadcast now inherits the original * undo params_streamk_base.h --------- Co-authored-by: Ali Hassani <ahassanijr@gmail.com> Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2023-05-22 19:05:06 -04:00
Haicheng Wu	6fbc0d3380	Update layout.md	2023-05-17 20:12:58 -04:00

1 2 3 4 5 ...

431 Commits