cutlass

Author	SHA1	Message	Date
ANIKET SHIVAM	d572cc1aab	CUTLASS 3.1 (#915 ) Co-authored-by: Aniket Shivam <ashivam@nvidia.com>	2023-04-14 23:19:34 -04:00
Adnan Akhundov	0435979f59	Remove const from 3.x GemmUniversalAdapter::operator() (#905 )	2023-04-03 20:30:51 -04:00
Gregory Meyer (gregjm)	ecbd24566c	Enable shared memory intrinsics and ldmatrix PTX on Clang. (#754 ) * Enable shared memory intrinsics and ldmatrix PTX on Clang. This commit adds preprocessor checks to enable the shared memory intrinsics `__cvta_generic_to_shared` and `__nvvm_get_smem_pointer`, as well as the `ldmatrix` PTX instructions, on Clang. Preventing these intrinsics from being used is a significant latency regression on Clang. * refine the macro --------- Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2023-03-31 21:42:24 -04:00
Feng Shijie	bc36122c3f	[layout] Fix AffineRank2ColumnMajor::packed() (#879 ) * [layout] Fix AffineRank2ColumnMajor::packed() * correct affine2row::packed --------- Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2023-03-29 11:59:48 -04:00
Vijay Thakkar	15d9d31f1f	CUTLASS 3.0 Hopper GEMMs are GETTs in disguise (#897 )	2023-03-29 10:42:40 -04:00
ptrblck	1eef5c3cf1	add guards for __CUDA_ARCH__ >= 530 (#891 ) * add guards for sm>=70 * drop guard to 530	2023-03-28 17:47:10 -04:00
Alexander Zinoviev	42290f5d1c	Fix for dangling pointers (#885 )	2023-03-25 01:15:14 -04:00
Vijay Thakkar	209faf7b94	remove spurious comma (#871 )	2023-03-20 17:25:27 -04:00
Jack Kosaian	6116706c96	Set batch_strides on Params::update (#883 )	2023-03-20 17:07:47 -04:00
Nikita Shulga	2670b973dd	Fix sign-compare warning in `reorder_array` (#869 ) `std::vector<T>::size_type` is unsigned type, so let's iterate over unsigned type as well Discovered, while trying to enable PyTorch building without `-Wno-sign-compare` warning suppression, see https://github.com/pytorch/pytorch/actions/runs/4418987999/jobs/7746850762#step:10:10532	2023-03-20 17:07:24 -04:00
Vijay Thakkar	af332d4aa9	Add missing comma in cutlass/arch/mma_sm90.h (#862 )	2023-03-14 12:04:28 -04:00
Edward Rees	86cae03cea	expose StoreT parameter for potential speed (#838 ) * expose StoreT parameter for potential speed * add storeT to more elementwise --------- Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2023-03-10 12:58:17 -05:00
Stepan Tezyunichev	29801e348a	Hide streams and typinfo from nvrtc (#853 ) * Hide streams and typinfo from nvrtc * Use __CUDACC_RTC__ instead CUDA_ARCH for guard	2023-03-09 23:24:47 -05:00
Alexander Pivovarov	7e370c9637	Fix typos 2 (#842 ) Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com>	2023-03-09 23:22:56 -05:00
ANIKET SHIVAM	c4f6b8c6bc	Updates for 3.0 (#857 ) Co-authored-by: Aniket Shivam <ashivam@nvidia.com>	2023-03-09 15:27:40 -05:00
psaab	a31b43b3f3	Re-enable aarch64 support lost in `277bd6e537` (#846 )	2023-03-02 11:17:21 -05:00
dan_the_3rd	f396cdd15c	ex24[gemm_grouped]: Allow to change layout/dtype (#841 ) * ex24[gemm_grouped]: Allow to change layout/dtype * Address suggestion from @jackkosaian --------- Co-authored-by: danthe3rd <danthe3rd>	2023-03-01 07:13:51 -05:00
Alexander Pivovarov	92ebbf1dc4	Fix typos (#839 )	2023-02-27 11:17:58 -05:00
Haicheng Wu	65688c2a87	streamk fix (#836 ) Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2023-02-23 16:35:08 -05:00
Yuxin Wu	95f673ecf7	Update base_grouped.h (#832 )	2023-02-21 14:48:30 -05:00
Haicheng Wu	91b8de8d32	streamk fix (#830 ) Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2023-02-20 11:03:16 -05:00
Sujan Kumar Gonugondla	d8359c804b	Changes to iterators to support s8 gemm with f16 outputs (#812 ) * Changes to iterators to support s8 gemm with f16 outputs * should work --------- Co-authored-by: Sujan Gonugondla <gsujan@amaon.com> Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2023-02-16 18:37:51 -05:00
Haicheng Wu	9fb38ac048	fix alignmentC=8 for imma N=128 (#822 ) Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2023-02-15 12:06:00 -05:00
Shuai Shao	ce8597dc14	Fix type bug in conv2d/gemm with broadcast (#796 ) add ElementVector --------- Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2023-02-09 20:53:25 -05:00
Jack Kosaian	5ff5209ed5	Add acc2smem in epilogue/threadblock/epilogue.h (#806 )	2023-02-06 22:04:16 -05:00
Jack Kosaian	5921043981	Re-enable all alignments for int accumulators (#807 )	2023-02-06 22:01:15 -05:00
Mark Hoemmen	add4ba622f	Fix 8.4 + CUDA 11.4 build (#789 ) Work around a likely GCC 8.x issue with fold expressions and generic lambdas. Only use the work-around when the host compiler is GCC 8.x. This avoids any concerns about the work-around possibly hindering inlining for a critical CuTe function (product). Users can experiment with the work-around for other compilers or compiler versions by defining the following macro. CUTE_FOLD_GENERIC_LAMBDA_WORKAROUND Fixes https://github.com/NVIDIA/cutlass/issues/788 Co-authored-by: Mark Hoemmen <mhoemmen@nvidia.com>	2023-01-27 09:18:59 -05:00
Vijay Thakkar	277bd6e537	CUTLASS 3.0.0 (#786 ) * CUTLASS 3.0.0	2023-01-23 20:55:28 -05:00
ANIKET SHIVAM	66d9cddc83	New updates for 2.11 (#775 ) * New updates. * Minor profiler updates Co-authored-by: Aniket Shivam <ashivam@nvidia.com>	2023-01-20 16:32:57 -05:00
psaab	d49bef88f9	Enable aarch64 support (#779 )	2023-01-20 15:51:58 -05:00
Haicheng Wu	764b840d6f	streamk example and performance tuning (#760 ) * streamk example and performance tuning * one missing file Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2023-01-10 16:10:02 -05:00
Ali Hassani	a1046d49c1	Adds missing semicolon (#759 )	2023-01-09 21:50:46 -05:00
Gregory Meyer (gregjm)	7bdba07310	Add definitions for tag structs. (#752 ) This commit changes the declarations of MMA operator class (SIMT, Tensor Core, WMMA Tensor Core) and operator type (multiply-add and so on) to definitions. This is done so that these tag structs are no longer incomplete types, which allows the `typeid` operator to be used on these tag structs. This is necessary for these tag structs to be used as type parameters in [GoogleTest typed tests](https://google.github.io/googletest/advanced.html#typed-tests).	2023-01-06 09:46:52 -05:00
Gregory Meyer (gregjm)	c54ede3a9e	Add const overloads for iterator functions. (#753 ) This commit adds `const`-correct overloads for `Array::{begin,end,rbegin,rend}`. These overloads are necessary for usage with [the GMock Container Matchers](http://google.github.io/googletest/reference/matchers.html#container-matchers), which cast the `Container` argument to a constant reference.	2023-01-06 09:46:34 -05:00
Haicheng Wu	ff6e733fe1	restore the old epilogue for everything except streamk (#749 ) Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2023-01-04 11:02:55 -05:00
Haicheng Wu	1e64f153b3	improve streamk load balance (#743 ) Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2022-12-25 13:56:33 -05:00
Gregory Meyer (gregjm)	b85865d1ad	Add missing #include directives (#741 ) This commit adds two `#include` directives so that the definitions of `cutlass::gemm::warp::WarpSize` from "cutlass/gemm/warp/mma.h" and `cutlass::arch::OpClassSimt` from "cutlass/arch/mma.h" are visible to "cutlass/epilogue/threadblock/default_epilogue_simt.h". Without them, there are compiler errors when building the header standalone: ``` In file included from cutlass/include/cutlass/epilogue/threadblock/default_epilogue_simt.cu:1: ./cutlass/include/cutlass/epilogue/threadblock/default_epilogue_simt.h:351:32: error: no member named 'warp' in namespace 'cutlass::gemm'; did you mean simply 'warp'? static int const kWarpSize = cutlass::gemm::warp::WarpSize<arch::OpClassSimt>::value; ^ ./cutlass/include/cutlass/epilogue/warp/tile_iterator_simt.h:49:11: note: 'warp' declared here namespace warp { ^ In file included from cutlass/include/cutlass/epilogue/threadblock/default_epilogue_simt.cu:1: ./cutlass/include/cutlass/epilogue/threadblock/default_epilogue_simt.h:351:53: error: no member named 'WarpSize' in namespace 'cutlass::epilogue::warp' static int const kWarpSize = cutlass::gemm::warp::WarpSize<arch::OpClassSimt>::value; ~~~~~~^ ./cutlass/include/cutlass/epilogue/threadblock/default_epilogue_simt.h:351:68: error: no member named 'OpClassSimt' in namespace 'cutlass::arch' static int const kWarpSize = cutlass::gemm::warp::WarpSize<arch::OpClassSimt>::value; ~~~~~~^ ./cutlass/include/cutlass/epilogue/threadblock/default_epilogue_simt.h:351:82: error: no member named 'value' in the global namespace static int const kWarpSize = cutlass::gemm::warp::WarpSize<arch::OpClassSimt>::value; ~~^ ./cutlass/include/cutlass/epilogue/threadblock/default_epilogue_simt.h:367:5: error: use of class template 'OutputTileThreadMap' requires template arguments OutputTileThreadMap, ^ ./cutlass/include/cutlass/epilogue/threadblock/output_tile_thread_map.h:134:8: note: template is declared here struct OutputTileThreadMap : public OutputTileThreadMapHelpers<Iterations_, Delta_> { ^ In file included from cutlass/include/cutlass/epilogue/threadblock/default_epilogue_simt.cu:1: ./cutlass/include/cutlass/epilogue/threadblock/default_epilogue_simt.h:391:5: error: use of class template 'OutputTileThreadMap' requires template arguments OutputTileThreadMap, ^ ./cutlass/include/cutlass/epilogue/threadblock/output_tile_thread_map.h:134:8: note: template is declared here struct OutputTileThreadMap : public OutputTileThreadMapHelpers<Iterations_, Delta_> { ^ In file included from cutlass/include/cutlass/epilogue/threadblock/default_epilogue_simt.cu:1: ./cutlass/include/cutlass/epilogue/threadblock/default_epilogue_simt.h:405:5: error: unknown type name 'OutputTileIterator'; did you mean 'WarpTileIterator'? OutputTileIterator, ^ ./cutlass/include/cutlass/epilogue/threadblock/default_epilogue_simt.h:380:9: note: 'WarpTileIterator' declared here using WarpTileIterator = cutlass::epilogue::warp::TileIteratorSimtDirect2dConv< ^ ./cutlass/include/cutlass/epilogue/threadblock/default_epilogue_simt.h:408:5: error: use of class template 'SharedLoadIterator' requires template arguments SharedLoadIterator, ^ ./cutlass/include/cutlass/epilogue/threadblock/shared_load_iterator.h:67:7: note: template is declared here class SharedLoadIterator { ^ ```	2022-12-21 11:40:20 -05:00
ANIKET SHIVAM	38193d76e3	Updates for stream-k (#728 ) Co-authored-by: Aniket Shivam <ashivam@nvidia.com>	2022-12-08 23:48:10 -05:00
Gregory Meyer (gregjm)	1d7772f218	Add missing #include directive (#727 )	2022-12-08 18:58:31 -05:00
Mike Iovine	d6117ca362	Relax stream K gemm alignment constraints (#717 ) * Relax stream K gemm alignment constraints The current alignment requirements are too strict. Make them identical to the checks for the regular universal gemm. * Revert "Relax stream K gemm alignment constraints" This reverts commit 31e80a250e2b0ac4bda2e4b437b39dc5bcd5e845. * Relax stream K gemm alignment constraints The current alignment requirements are too strict. Make them identical to the checks for the regular universal gemm. Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2022-12-07 11:17:49 -05:00
Ali Hassani	9c0518608e	Fix typos in conv problem sizes (#720 ) * Fix typos in conv problem sizes * Typos	2022-12-05 15:54:58 -05:00
Haicheng Wu	9f1f37aa21	misc (#719 ) * misc * minor Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2022-12-05 12:07:20 -05:00
Wenzhuo Liu	84213b0b8e	fix: make arch.h self contained (#714 )	2022-12-01 19:25:48 -05:00
Aditya Atluri	c975e2ccbb	releaase 2.11 (#703 )	2022-11-19 09:02:15 -05:00
seventh	06eb90cc0d	Fix identity sigmoid activation (#659 ) * activation support Identity * fix Sigmoid activation operator() with CUTLASS_HOST_DEVICE	2022-11-09 14:42:23 -05:00
Haicheng Wu	012c62c748	bug fixes and enharcement to gemm reductionK fusion (#682 ) * add two missing files * fix bunch of bugs of gemm-reducek fusion and add a device interface * small changes Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2022-11-03 11:07:50 -04:00
dan_the_3rd	1b4e24470a	Example 43 - DualGemm (#670 ) * Ex50 wip * IS_PROFILING mode * MultiStage2 - but is slower * Add SwiGLU * Support SplitKSerial reduction Support not storing D0/D1 Cleanup code * Option to disable bias * Renumber example * Fix build * Remove references to pb_size_0 / pb_size_1 * Add support for bf16 inputs with float accum * small changes Co-authored-by: danthe3rd <danthe3rd> Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2022-10-26 14:04:42 -04:00
hlu1	9b47403b2d	Add missing CUTLASS_HOST_DEVICE (#671 )	2022-10-21 22:20:38 -04:00
dan_the_3rd	4db6a6140e	ex42: Fused MHA imported from xFormers (#662 ) * ex42: Fused MHA imported from xFormers * Remove std:: references * Support K>128 in the example * Support causal option * Support different head size for V, and different seqlength for KV * Update FLOPS counter * Remove bit_cast * fix build: Replace M_LOG2E * Add doc * Revert "Remove bit_cast" This reverts commit 9662fa86bb7c57c1a015ac0bf52cb52940fbbf80. * Explicit casts to int32_t for windows build Co-authored-by: danthe3rd <danthe3rd>	2022-10-17 10:49:33 -04:00
Ying Zhang	dadc881a96	Bug fix for gemm broadcast (#650 ) * gemm_universal_with_broadcast, +2 sources. * Revert "gemm_universal_with_broadcast, +2 sources." This reverts commit fb063251f2144a091f12c9abfce7e1713f2d1c9e. * gemm broadcast bug fix	2022-09-30 10:00:38 -04:00

1 2 3

144 Commits