cutlass

Author	SHA1	Message	Date
Haicheng Wu	764b840d6f	streamk example and performance tuning (#760 ) * streamk example and performance tuning * one missing file Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2023-01-10 16:10:02 -05:00
Ali Hassani	a1046d49c1	Adds missing semicolon (#759 )	2023-01-09 21:50:46 -05:00
Gregory Meyer (gregjm)	7bdba07310	Add definitions for tag structs. (#752 ) This commit changes the declarations of MMA operator class (SIMT, Tensor Core, WMMA Tensor Core) and operator type (multiply-add and so on) to definitions. This is done so that these tag structs are no longer incomplete types, which allows the `typeid` operator to be used on these tag structs. This is necessary for these tag structs to be used as type parameters in [GoogleTest typed tests](https://google.github.io/googletest/advanced.html#typed-tests).	2023-01-06 09:46:52 -05:00
Gregory Meyer (gregjm)	c54ede3a9e	Add const overloads for iterator functions. (#753 ) This commit adds `const`-correct overloads for `Array::{begin,end,rbegin,rend}`. These overloads are necessary for usage with [the GMock Container Matchers](http://google.github.io/googletest/reference/matchers.html#container-matchers), which cast the `Container` argument to a constant reference.	2023-01-06 09:46:34 -05:00
Haicheng Wu	ff6e733fe1	restore the old epilogue for everything except streamk (#749 ) Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2023-01-04 11:02:55 -05:00
Haicheng Wu	1e64f153b3	improve streamk load balance (#743 ) Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2022-12-25 13:56:33 -05:00
Gregory Meyer (gregjm)	b85865d1ad	Add missing #include directives (#741 ) This commit adds two `#include` directives so that the definitions of `cutlass::gemm::warp::WarpSize` from "cutlass/gemm/warp/mma.h" and `cutlass::arch::OpClassSimt` from "cutlass/arch/mma.h" are visible to "cutlass/epilogue/threadblock/default_epilogue_simt.h". Without them, there are compiler errors when building the header standalone: ``` In file included from cutlass/include/cutlass/epilogue/threadblock/default_epilogue_simt.cu:1: ./cutlass/include/cutlass/epilogue/threadblock/default_epilogue_simt.h:351:32: error: no member named 'warp' in namespace 'cutlass::gemm'; did you mean simply 'warp'? static int const kWarpSize = cutlass::gemm::warp::WarpSize<arch::OpClassSimt>::value; ^ ./cutlass/include/cutlass/epilogue/warp/tile_iterator_simt.h:49:11: note: 'warp' declared here namespace warp { ^ In file included from cutlass/include/cutlass/epilogue/threadblock/default_epilogue_simt.cu:1: ./cutlass/include/cutlass/epilogue/threadblock/default_epilogue_simt.h:351:53: error: no member named 'WarpSize' in namespace 'cutlass::epilogue::warp' static int const kWarpSize = cutlass::gemm::warp::WarpSize<arch::OpClassSimt>::value; ~~~~~~^ ./cutlass/include/cutlass/epilogue/threadblock/default_epilogue_simt.h:351:68: error: no member named 'OpClassSimt' in namespace 'cutlass::arch' static int const kWarpSize = cutlass::gemm::warp::WarpSize<arch::OpClassSimt>::value; ~~~~~~^ ./cutlass/include/cutlass/epilogue/threadblock/default_epilogue_simt.h:351:82: error: no member named 'value' in the global namespace static int const kWarpSize = cutlass::gemm::warp::WarpSize<arch::OpClassSimt>::value; ~~^ ./cutlass/include/cutlass/epilogue/threadblock/default_epilogue_simt.h:367:5: error: use of class template 'OutputTileThreadMap' requires template arguments OutputTileThreadMap, ^ ./cutlass/include/cutlass/epilogue/threadblock/output_tile_thread_map.h:134:8: note: template is declared here struct OutputTileThreadMap : public OutputTileThreadMapHelpers<Iterations_, Delta_> { ^ In file included from cutlass/include/cutlass/epilogue/threadblock/default_epilogue_simt.cu:1: ./cutlass/include/cutlass/epilogue/threadblock/default_epilogue_simt.h:391:5: error: use of class template 'OutputTileThreadMap' requires template arguments OutputTileThreadMap, ^ ./cutlass/include/cutlass/epilogue/threadblock/output_tile_thread_map.h:134:8: note: template is declared here struct OutputTileThreadMap : public OutputTileThreadMapHelpers<Iterations_, Delta_> { ^ In file included from cutlass/include/cutlass/epilogue/threadblock/default_epilogue_simt.cu:1: ./cutlass/include/cutlass/epilogue/threadblock/default_epilogue_simt.h:405:5: error: unknown type name 'OutputTileIterator'; did you mean 'WarpTileIterator'? OutputTileIterator, ^ ./cutlass/include/cutlass/epilogue/threadblock/default_epilogue_simt.h:380:9: note: 'WarpTileIterator' declared here using WarpTileIterator = cutlass::epilogue::warp::TileIteratorSimtDirect2dConv< ^ ./cutlass/include/cutlass/epilogue/threadblock/default_epilogue_simt.h:408:5: error: use of class template 'SharedLoadIterator' requires template arguments SharedLoadIterator, ^ ./cutlass/include/cutlass/epilogue/threadblock/shared_load_iterator.h:67:7: note: template is declared here class SharedLoadIterator { ^ ```	2022-12-21 11:40:20 -05:00
ANIKET SHIVAM	38193d76e3	Updates for stream-k (#728 ) Co-authored-by: Aniket Shivam <ashivam@nvidia.com>	2022-12-08 23:48:10 -05:00
Gregory Meyer (gregjm)	1d7772f218	Add missing #include directive (#727 )	2022-12-08 18:58:31 -05:00
Mike Iovine	d6117ca362	Relax stream K gemm alignment constraints (#717 ) * Relax stream K gemm alignment constraints The current alignment requirements are too strict. Make them identical to the checks for the regular universal gemm. * Revert "Relax stream K gemm alignment constraints" This reverts commit 31e80a250e2b0ac4bda2e4b437b39dc5bcd5e845. * Relax stream K gemm alignment constraints The current alignment requirements are too strict. Make them identical to the checks for the regular universal gemm. Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2022-12-07 11:17:49 -05:00
Ali Hassani	9c0518608e	Fix typos in conv problem sizes (#720 ) * Fix typos in conv problem sizes * Typos	2022-12-05 15:54:58 -05:00
Haicheng Wu	9f1f37aa21	misc (#719 ) * misc * minor Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2022-12-05 12:07:20 -05:00
Wenzhuo Liu	84213b0b8e	fix: make arch.h self contained (#714 )	2022-12-01 19:25:48 -05:00
Aditya Atluri	c975e2ccbb	releaase 2.11 (#703 )	2022-11-19 09:02:15 -05:00
seventh	06eb90cc0d	Fix identity sigmoid activation (#659 ) * activation support Identity * fix Sigmoid activation operator() with CUTLASS_HOST_DEVICE	2022-11-09 14:42:23 -05:00
Haicheng Wu	012c62c748	bug fixes and enharcement to gemm reductionK fusion (#682 ) * add two missing files * fix bunch of bugs of gemm-reducek fusion and add a device interface * small changes Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2022-11-03 11:07:50 -04:00
dan_the_3rd	1b4e24470a	Example 43 - DualGemm (#670 ) * Ex50 wip * IS_PROFILING mode * MultiStage2 - but is slower * Add SwiGLU * Support SplitKSerial reduction Support not storing D0/D1 Cleanup code * Option to disable bias * Renumber example * Fix build * Remove references to pb_size_0 / pb_size_1 * Add support for bf16 inputs with float accum * small changes Co-authored-by: danthe3rd <danthe3rd> Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2022-10-26 14:04:42 -04:00
hlu1	9b47403b2d	Add missing CUTLASS_HOST_DEVICE (#671 )	2022-10-21 22:20:38 -04:00
dan_the_3rd	4db6a6140e	ex42: Fused MHA imported from xFormers (#662 ) * ex42: Fused MHA imported from xFormers * Remove std:: references * Support K>128 in the example * Support causal option * Support different head size for V, and different seqlength for KV * Update FLOPS counter * Remove bit_cast * fix build: Replace M_LOG2E * Add doc * Revert "Remove bit_cast" This reverts commit 9662fa86bb7c57c1a015ac0bf52cb52940fbbf80. * Explicit casts to int32_t for windows build Co-authored-by: danthe3rd <danthe3rd>	2022-10-17 10:49:33 -04:00
Ying Zhang	dadc881a96	Bug fix for gemm broadcast (#650 ) * gemm_universal_with_broadcast, +2 sources. * Revert "gemm_universal_with_broadcast, +2 sources." This reverts commit fb063251f2144a091f12c9abfce7e1713f2d1c9e. * gemm broadcast bug fix	2022-09-30 10:00:38 -04:00
Wenzhuo Liu	cd37e82492	change unused class member to local var (#646 )	2022-09-28 23:52:35 -04:00
Wenzhuo Liu	7a458f00a6	fix(permute.h): incorrect comment in `Tensor5DPermute20314` (#637 ) * fix(permute.h): incorrect comment in `Tensor5DPermute20314` * typo in usage in example 39	2022-09-22 09:21:13 -04:00
Tianqi Zhang (张天启)	9f2e3faa69	fix call of GELU_Taylor in LinearCombinationGeneric (#634 )	2022-09-20 21:00:55 -04:00
Ying Zhang	a821280dc7	Gemm broadcast (#632 ) * gemm_universal_with_broadcast, +2 sources. * Revert "gemm_universal_with_broadcast, +2 sources." This reverts commit fb063251f2144a091f12c9abfce7e1713f2d1c9e. * gemm_universal_with_broadcast separated version. * Update copyright banner. * update banner	2022-09-20 10:37:12 -04:00
Andrew Kerr	fc9ebc645b	CUTLASS 2.10 bug fixes and minor updates. (#626 )	2022-09-15 16:20:33 -04:00
alexfreudenberg	2cc2c7ba1f	Add set_k_partition function (#624 ) A member function set_k_partition is required for the instatiation of cutlass::gemm::kernel::Gemm, even though SplitKSerial is false	2022-09-13 22:34:20 -04:00
ANIKET SHIVAM	e773429f7e	CUTLASS 2.10 updates (#622 ) Co-authored-by: Aniket Shivam <ashivam@nvidia.com>	2022-09-12 21:26:30 -04:00
Jack Kosaian	f29d8f7ca9	Include vector in base_grouped.h (#618 )	2022-09-06 13:21:23 -04:00
ANIKET SHIVAM	b72cbf957d	CUTLASS 2.10 (#615 ) Co-authored-by: Aniket Shivam <ashivam@nvidia.com>	2022-09-03 18:48:46 -04:00
Cliff Burdick	ca23ff7924	Fixed typo in class name (#608 )	2022-08-29 20:51:52 -04:00
Cliff Burdick	1c3d400b14	Added `value_type` trait to complex to make it an easier drop-in replacement for std::complex. (#607 )	2022-08-28 01:12:40 -04:00
Cliff Burdick	abafbf2afd	Missing comma in trmm header (#604 )	2022-08-25 16:07:33 -04:00
Haicheng Wu	497b499d9d	Add residual support for shmem staging iterator used in back-to-back GEMM fusion. This allows support of problem_size_0_n that is not multiple of 32. (#590 ) Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2022-08-15 11:19:24 -04:00
dan_the_3rd	25ebf15d02	Ensure all arch::Mma specializations have ElementC set (#576 ) Co-authored-by: danthe3rd <danthe3rd@users.noreply.github.com>	2022-07-22 23:53:03 -04:00
Haicheng Wu	e7a61c761a	fix race condition when h < stride_h or w < stride_w (#562 ) Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2022-07-12 16:37:08 -04:00
seventh	fb379eaa5b	epilogue leaky relu support ScaleType (#564 ) Co-authored-by: xuweiqi <xuweiqi117@gmail.com>	2022-07-11 17:30:55 -04:00
Bing Xu	1eb6355182	[activation] tanh (#550 ) Co-authored-by: Bing Xu <bingxu@fb.com>	2022-07-02 08:00:45 -04:00
Yujia Zhai	04a9777b87	Softmax (#546 ) * add test layernorm g-mem version * Delete include/configure directory * Delete examples/test_layernorm directory * Update gemm_with_softmax.h * Update gemm_softmax.cu * Update linear_combination.h * Update fast_math.h * remove redundant vars Co-authored-by: yujia.zhai <yujia.zhai@bytedance.com> Co-authored-by: yuzhai <yuzhai@nvidia.com>	2022-07-02 01:19:18 -04:00
Haicheng Wu	e45e773436	Update linear_combination_generic.h (#472 ) add `skip_elementwise_` to support serial splitk in linear_combination_generic.h`	2022-06-28 07:29:38 -04:00
Haicheng Wu	9ab9110168	add leaky relu (#542 ) Authored-by: Haicheng Wu <haichengw@nvidia.com>	2022-06-26 10:07:50 -04:00
Jack Kosaian	fa56763c25	Fix occupancy calculation for grouped GEMM (#532 )	2022-06-18 19:53:59 -04:00
LiuWei	25e26a6e51	fix bugs in linear_combination_generic.h missing include cutlass/epilogue/thread/scale_type.h (#531 )	2022-06-17 23:35:14 -04:00
Pei Sun	dceefe4f64	Increment stride correctly in warp iterator. (#516 ) Co-authored-by: peisun1115 <peis@google.com>	2022-06-06 12:33:36 -04:00
Pei Sun	c3881d097e	Fix a comment about LDSM layout. (#514 ) Co-authored-by: peisun1115 <peis@google.com>	2022-06-04 23:04:00 -04:00
Pei Sun	a29dfb1c63	Fix a bug to increment stride tile correctly (#503 ) * Fix a bug to increment stride tile correctly * Update regular_tile_access_iterator_tensor_op.h Co-authored-by: peisun1115 <peis@google.com> Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com>	2022-06-03 22:54:52 -04:00
Mike Iovine	c4cf0dad82	Fix init-self compiler warnings (#493 ) Fix a few errors caused by trying to initialize a class member with itself. These errors can turn into errors if you compile with `-Winit-self`.	2022-05-11 00:35:28 -04:00
TonyZhao	ddd8f9cf41	update float < int32_t * 4 (#488 ) Co-authored-by: 赵俊涛 <zhaojuntao@zhaojuntaos-MacBook-Pro.local>	2022-05-04 13:36:05 -04:00
Haicheng Wu	ec2b4fd85d	b2b bias vector support (#482 ) * b2b bias vector support * add files Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2022-04-30 04:16:15 -07:00
Stepan Tezyunichev	86ce09aed1	2.9 fixes for nvrtc (#480 ) * Use platform::is_same instead of std::is_same * Don't hide cuComplex include from nvrtc * Typo fixed * Remove comment rename	2022-04-29 09:06:52 -04:00
Janusz Lisiecki	8c339ac039	Fix compilation in clang (#478 ) - adds missing commas - adjusts misaligned usage of CUTLASS_DEVICE between template declaration and specializations Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>	2022-04-28 14:22:06 -04:00

1 2 3

114 Commits