cutlass

Author	SHA1	Message	Date
xuhaoran	67ae8e0603	Change the position of minus sign in line1549 array.h (#1091 ) when I use cutlass::epilogue:🧵:LinearCombinationSigmoid, I encounter the this error: cutlass/include/cutlass/array.h(1549): error: no operator "-" matches these operands Moving operator "-" from line 1549 to 1548 can solve this error	2023-09-26 17:26:39 -04:00
ZCHNO	14f69bddc8	[fix] fix comparison operator for integer_subbyte (#1090 )	2023-09-26 17:26:12 -04:00
ANIKET SHIVAM	90d3b0fb18	CUTLASS 3.2.1 (#1113 ) * Updates for 3.2.1 release. * Minor fix in gemm op profiler for raster order. * Add scheduler mapping for raster order in the kernels.	2023-09-26 17:24:26 -04:00
reed	e0aaa3c3b3	fix GmmaDescriptor print format string error (#1102 )	2023-09-19 23:27:58 -04:00
Driss Guessous	88c0d7c726	make only visible on device (#1071 )	2023-09-07 13:00:46 -04:00
Aman Gupta Karmani	34fd98056b	fix cinttypes issue with STDC_FORMAT_MACROS (#1068 ) * fix cinttypes issue with STDC_FORMAT_MACROS * Update mma_sm90_desc.hpp * Update mma_sm90_desc.hpp --------- Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com>	2023-08-29 14:59:33 -04:00
reed	6673df0e48	fix typos (#1059 )	2023-08-27 00:49:26 -04:00
Lufang Chen	7618e9bfd8	Fix numeric conversion warning (#1021 ) * fix numeric conversion unused var * update --------- Co-authored-by: Lufang CHEN 陈橹方 <lufang.chen@nio.com>	2023-08-27 00:42:44 -04:00
ANIKET SHIVAM	a88c41cf8d	Updates for 3.2 release (#1065 )	2023-08-25 23:05:46 -04:00
Allard Hendriksen	2a9fa23e06	Avoid cute::print compiler warnings with -Wformat-security (#1041 ) Fixes issue #1040.	2023-08-18 14:38:27 -04:00
Haibin Lin	7e5ee8b7bf	[doc] fix: fix typos in the comment (#1049 )	2023-08-16 11:39:25 -04:00
ANIKET SHIVAM	4575443d44	CUTLASS 3.2 (#1024 ) * CUTLASS 3.2	2023-08-07 20:50:32 -04:00
Sophia Wisdom	d20f3a9542	spelling (#1007 ) logicial -> logical	2023-07-20 14:41:11 -04:00
ChangyouSiom	e066ced33b	fix epilogue iterator error (#995 ) * fix epilogue iterator error * fix epilogue iterator error --------- Co-authored-by: maxiao <maxiao@cowarobot.com>	2023-07-10 21:30:31 -04:00
Jack Kosaian	87349d3496	Add grouped b2b GEMM (#970 )	2023-06-05 17:16:57 -04:00
Jack Kosaian	7dbf423763	Add conversion from ElementBias to ElementCompute (#961 )	2023-05-26 23:08:36 -04:00
Aleksandar Samardžić	d3e72719b4	Add support for sparse GEMM with row broadcasted bias vector (#951 )	2023-05-24 10:25:05 -04:00
ANIKET SHIVAM	f079619f5e	More updates for 3.1 (#958 ) * Updates for 3.1 * Minor change * doc link fix * Minor updates	2023-05-24 10:17:16 -04:00
Ali Hassani	13f413493a	Stream-K with broadcast (#892 ) * [WIP] GEMM StreamK w/ Fused Epilogue * Adds Gemm Streamk with Fused Epilogue kernel level struct. * Mostly based on Gemm with Fused Epilogue, * Requires a new epilogue * Work in progress * [WIP] StreamK support for GemmUniversalWithBroadcast * Just based off of how StreamK is allowed in GemmUniversal * Untested and a work in progress * Minor fixes * [WIP] It compiles! It is almost certainly incorrect, but we're past getting the templates to match, so checkpointing. * Correction to reference kernel * Fix typo * Added MSE measurement * Switch back to reference kernel + host for loop Still WIP. Now we're getting even a larger MSE, but it's both on basic Split-K and Stream-K. * Fix typos * Fix broadcast vector + requested changes * Comment typo * Small int option and more * Fix incorrect condition on source needed * Requested changes * I think I got it? * Bias vector should be stride 0 * Two source added! * Typos * Merge examples * Bring back vector row offset Just to ensure consistency with universal gemm with fused epilogue * Base arguments and params structs for StreamK * StreamK epilogue with broadcast now inherits the original * undo params_streamk_base.h --------- Co-authored-by: Ali Hassani <ahassanijr@gmail.com> Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2023-05-22 19:05:06 -04:00
wll	19c4a4815e	replace division with multiplication in GELU (#942 )	2023-05-12 10:57:18 -04:00
Gregory Meyer (gregjm)	fcfbd23e26	Fix host compilation of cute::cast_smem_ptr_to_uint. (#940 ) * Remove references to device-only intrinsics when compiling for host. Currently, we attempt to use the `__device__`-only functions `__cvta_generic_to_shared` and `__nvvm_get_smem_pointer` when compiling `cute::cast_smem_ptr_to_uint` for the host on Clang. This results in a compilation error, as expected. This commit changes the definition of the `_ACTIVATED` macros so that they are only true when `__CUDA_ARCH__` is defined; that is, when compiling for the device. Additionally, the declaration of `__nvvm_get_smem_pointer` is currently only visible during the device compilation pass when compiling with NVCC; this commit makes the declaration visible during host compilation with the `__device__` annotation. Annotate cute::cast_smem_ptr_to_uint as device-only. The implementation of `cute::cast_smem_ptr_to_uint` is currently an unchecked failure on host code, and the only host implementation I can think of -- casting a probably-64-bit pointer to 32 bits somehow -- doesn't make sense to implement. This commit marks this function as device-only so that it can't be accidentally used on host code. * small change --------- Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2023-05-10 00:06:54 -04:00
Gregory Meyer (gregjm)	b250faccd3	Make operator() const-correct and add missing static functions. (#936 ) * Make operator() const-correct and add missing static functions. Currently, `Converter::operator()` requires a mutable object to invoke, and there are missing `static result_type convert(source_type const & source)` overloads for certain partial specializations of `Converter` objects. This commit makes `operator()` const-correct and adds missing function overloads where appropriate. * minor changes * format --------- Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2023-05-09 16:33:01 -04:00
Janusz Lisiecki	24c8b7d8a2	Fix cuTE compilation with clang (#939 ) - clang 1.14 complains about missing function from a host call: cutlass/include/cute/arch/util.hpp:106:32: error: no matching function for call to '__cvta_generic_to_shared' return static_cast<uint32_t>(__cvta_generic_to_shared(ptr)); - fixes this by defining CUTE_HOST_DEVICE for clang as well Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>	2023-05-09 09:51:45 -04:00
ANIKET SHIVAM	7c04f95415	Updates for 3.1 (#932 )	2023-04-29 09:34:27 -04:00
Gregory Meyer (gregjm)	6f8596ce3f	Add missing #include directive to get access to cutlass::epilogue:🧵:ScaleType. (#925 ) Currently, the `LinearCombinationClamp` header file is not standalone, and must have the definition of `cutlass::epilogue:🧵:ScaleType` already available when it is `#include`d.	2023-04-28 20:02:41 -04:00
Adnan Akhundov	fe2f491dd7	Get SM count with cudaDeviceGetAttribute in KernelHardwareInfo (#927 )	2023-04-28 13:23:23 -04:00
Jakub Szuppe	180c5629bf	Add missing checks for NVRTC in CuTe (#921 )	2023-04-25 12:52:43 -04:00
Guray Ozen	43cfbe0086	Allow L2 prefect for clang compiler (#914 )	2023-04-15 01:23:22 -04:00
ANIKET SHIVAM	d572cc1aab	CUTLASS 3.1 (#915 ) Co-authored-by: Aniket Shivam <ashivam@nvidia.com>	2023-04-14 23:19:34 -04:00
Adnan Akhundov	0435979f59	Remove const from 3.x GemmUniversalAdapter::operator() (#905 )	2023-04-03 20:30:51 -04:00
Gregory Meyer (gregjm)	ecbd24566c	Enable shared memory intrinsics and ldmatrix PTX on Clang. (#754 ) * Enable shared memory intrinsics and ldmatrix PTX on Clang. This commit adds preprocessor checks to enable the shared memory intrinsics `__cvta_generic_to_shared` and `__nvvm_get_smem_pointer`, as well as the `ldmatrix` PTX instructions, on Clang. Preventing these intrinsics from being used is a significant latency regression on Clang. * refine the macro --------- Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2023-03-31 21:42:24 -04:00
Feng Shijie	bc36122c3f	[layout] Fix AffineRank2ColumnMajor::packed() (#879 ) * [layout] Fix AffineRank2ColumnMajor::packed() * correct affine2row::packed --------- Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2023-03-29 11:59:48 -04:00
Vijay Thakkar	15d9d31f1f	CUTLASS 3.0 Hopper GEMMs are GETTs in disguise (#897 )	2023-03-29 10:42:40 -04:00
ptrblck	1eef5c3cf1	add guards for __CUDA_ARCH__ >= 530 (#891 ) * add guards for sm>=70 * drop guard to 530	2023-03-28 17:47:10 -04:00
Alexander Zinoviev	42290f5d1c	Fix for dangling pointers (#885 )	2023-03-25 01:15:14 -04:00
Vijay Thakkar	209faf7b94	remove spurious comma (#871 )	2023-03-20 17:25:27 -04:00
Jack Kosaian	6116706c96	Set batch_strides on Params::update (#883 )	2023-03-20 17:07:47 -04:00
Nikita Shulga	2670b973dd	Fix sign-compare warning in `reorder_array` (#869 ) `std::vector<T>::size_type` is unsigned type, so let's iterate over unsigned type as well Discovered, while trying to enable PyTorch building without `-Wno-sign-compare` warning suppression, see https://github.com/pytorch/pytorch/actions/runs/4418987999/jobs/7746850762#step:10:10532	2023-03-20 17:07:24 -04:00
Vijay Thakkar	af332d4aa9	Add missing comma in cutlass/arch/mma_sm90.h (#862 )	2023-03-14 12:04:28 -04:00
Edward Rees	86cae03cea	expose StoreT parameter for potential speed (#838 ) * expose StoreT parameter for potential speed * add storeT to more elementwise --------- Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2023-03-10 12:58:17 -05:00
Stepan Tezyunichev	29801e348a	Hide streams and typinfo from nvrtc (#853 ) * Hide streams and typinfo from nvrtc * Use __CUDACC_RTC__ instead CUDA_ARCH for guard	2023-03-09 23:24:47 -05:00
Alexander Pivovarov	7e370c9637	Fix typos 2 (#842 ) Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com>	2023-03-09 23:22:56 -05:00
ANIKET SHIVAM	c4f6b8c6bc	Updates for 3.0 (#857 ) Co-authored-by: Aniket Shivam <ashivam@nvidia.com>	2023-03-09 15:27:40 -05:00
psaab	a31b43b3f3	Re-enable aarch64 support lost in `277bd6e537` (#846 )	2023-03-02 11:17:21 -05:00
dan_the_3rd	f396cdd15c	ex24[gemm_grouped]: Allow to change layout/dtype (#841 ) * ex24[gemm_grouped]: Allow to change layout/dtype * Address suggestion from @jackkosaian --------- Co-authored-by: danthe3rd <danthe3rd>	2023-03-01 07:13:51 -05:00
Alexander Pivovarov	92ebbf1dc4	Fix typos (#839 )	2023-02-27 11:17:58 -05:00
Haicheng Wu	65688c2a87	streamk fix (#836 ) Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2023-02-23 16:35:08 -05:00
Yuxin Wu	95f673ecf7	Update base_grouped.h (#832 )	2023-02-21 14:48:30 -05:00
Haicheng Wu	91b8de8d32	streamk fix (#830 ) Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2023-02-20 11:03:16 -05:00
Sujan Kumar Gonugondla	d8359c804b	Changes to iterators to support s8 gemm with f16 outputs (#812 ) * Changes to iterators to support s8 gemm with f16 outputs * should work --------- Co-authored-by: Sujan Gonugondla <gsujan@amaon.com> Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2023-02-16 18:37:51 -05:00

1 2 3 4

172 Commits