cutlass

Author	SHA1	Message	Date
ANIKET SHIVAM	90d3b0fb18	CUTLASS 3.2.1 (#1113 ) * Updates for 3.2.1 release. * Minor fix in gemm op profiler for raster order. * Add scheduler mapping for raster order in the kernels.	2023-09-26 17:24:26 -04:00
Vadim Markovtsev	8783c41851	Replace 0x1f with 0xffffffff in __shfl_sync (#1097 ) This fixes compatibility with H100 and resolves #1094	2023-09-18 19:58:19 -04:00
Yujia Zhai	6407bcdf0a	fix matrix B indices (#1089 )	2023-09-12 14:04:18 -04:00
tpoisonooo	a77b2c9cb8	style(examples): typo (#1080 ) * Update ampere_tensorop_conv2dfprop.cu learning cutlass, PR a typo. * Update ampere_gemm_operand_reduction_fusion.cu	2023-09-11 10:13:22 -04:00
ANIKET SHIVAM	a88c41cf8d	Updates for 3.2 release (#1065 )	2023-08-25 23:05:46 -04:00
ANIKET SHIVAM	4575443d44	CUTLASS 3.2 (#1024 ) * CUTLASS 3.2	2023-08-07 20:50:32 -04:00
dan_the_3rd	146d314057	Update fMHA kernels (#992 ) * Update fMHA kernels Upstream recent changes to fMHA that we did in xFormers. Previous version in CUTLASS: facebookresearch/xformers@b6be33a Updating to: facebookresearch/xformers@55a4798 * minor changes * make var work --------- Co-authored-by: danthe3rd <danthe3rd> Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2023-07-12 22:30:46 -04:00
Jack Kosaian	87349d3496	Add grouped b2b GEMM (#970 )	2023-06-05 17:16:57 -04:00
ANIKET SHIVAM	f079619f5e	More updates for 3.1 (#958 ) * Updates for 3.1 * Minor change * doc link fix * Minor updates	2023-05-24 10:17:16 -04:00
Ali Hassani	13f413493a	Stream-K with broadcast (#892 ) * [WIP] GEMM StreamK w/ Fused Epilogue * Adds Gemm Streamk with Fused Epilogue kernel level struct. * Mostly based on Gemm with Fused Epilogue, * Requires a new epilogue * Work in progress * [WIP] StreamK support for GemmUniversalWithBroadcast * Just based off of how StreamK is allowed in GemmUniversal * Untested and a work in progress * Minor fixes * [WIP] It compiles! It is almost certainly incorrect, but we're past getting the templates to match, so checkpointing. * Correction to reference kernel * Fix typo * Added MSE measurement * Switch back to reference kernel + host for loop Still WIP. Now we're getting even a larger MSE, but it's both on basic Split-K and Stream-K. * Fix typos * Fix broadcast vector + requested changes * Comment typo * Small int option and more * Fix incorrect condition on source needed * Requested changes * I think I got it? * Bias vector should be stride 0 * Two source added! * Typos * Merge examples * Bring back vector row offset Just to ensure consistency with universal gemm with fused epilogue * Base arguments and params structs for StreamK * StreamK epilogue with broadcast now inherits the original * undo params_streamk_base.h --------- Co-authored-by: Ali Hassani <ahassanijr@gmail.com> Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2023-05-22 19:05:06 -04:00
ANIKET SHIVAM	7c04f95415	Updates for 3.1 (#932 )	2023-04-29 09:34:27 -04:00
Alexander Zinoviev	e36912f961	Fix for dangling references in the MHA example (#918 )	2023-04-19 21:35:46 -04:00
Aleksandr Pivovar	4a68cf748e	added support of b2b bmm (#849 ) * added support of b2b bmm * fixed arguments and params structures * added batch_count argument * removed SplitKSerial and added new test case with b2b bmm * fixed support of Kbatched and added new test case with batch stride * added batch support for bias and scale * make test * small changes --------- Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2023-04-14 23:20:02 -04:00
ANIKET SHIVAM	d572cc1aab	CUTLASS 3.1 (#915 ) Co-authored-by: Aniket Shivam <ashivam@nvidia.com>	2023-04-14 23:19:34 -04:00
dan_the_3rd	9b8166e3f0	fMHA: Add backward pass (#844 ) * fMHA: Add backward pass * Better checks for strides/alignments * Remove fb-internal URL * torch.Tensor.untyped_storage requires pytorch 2.0+ * minor changes * make test --------- Co-authored-by: danthe3rd <danthe3rd> Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2023-04-06 20:44:58 -04:00
Adnan Akhundov	2ba1ef10be	Increase max dynamic SMEM size in GemmSoftmax (#903 )	2023-04-03 10:01:12 -04:00
Vijay Thakkar	15d9d31f1f	CUTLASS 3.0 Hopper GEMMs are GETTs in disguise (#897 )	2023-03-29 10:42:40 -04:00
Alexander Pivovarov	7e370c9637	Fix typos 2 (#842 ) Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com>	2023-03-09 23:22:56 -05:00
dan_the_3rd	f396cdd15c	ex24[gemm_grouped]: Allow to change layout/dtype (#841 ) * ex24[gemm_grouped]: Allow to change layout/dtype * Address suggestion from @jackkosaian --------- Co-authored-by: danthe3rd <danthe3rd>	2023-03-01 07:13:51 -05:00
dan_the_3rd	f303889ed9	fMHA: Sync FW with xFormers (#828 ) * fMHA: Add support for bias+dropout in FW * Remove 'getMaximumSharedMemoryPerBlockKb' * fix comments --------- Co-authored-by: danthe3rd <danthe3rd> Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2023-02-22 23:25:31 -05:00
Haicheng Wu	34bed24af3	Update helper.h copyright banner	2023-02-16 16:50:04 -05:00
Haicheng Wu	8f5c242426	Update dual_gemm_common.h fix the copyright of a new file.	2023-02-13 15:35:33 -05:00
Adnan Akhundov	3c995c7606	Extend DualGemm: support batched mode + decouple B0/B1 layouts (#790 ) * Fix MHA kernel Summary: ATT Test Plan: Reviewers: Subscribers: Tasks: Tags: * Extend DualGemm to support batched mode (#5) Following the GemmUniversalMode::kBatched implementation, batched mode is added to the DualGemm (under examples/45_dual_gemm). DualGemmMode::kBatched and SplitKSerial are not compatible: Status::kErrorInvalidProblem is returned if both are set. * Decouple LayoutB0 and LayoutB1 in DualGemm The DualGemm template assumed the same layout, LayoutB, for both right operand matrices B0 and B1. This is problematic if the layout of the two matrices is different. In particular, this may be the case when one of the matrices is row-major, while the other is a (column) vector that has to be broadcasted in column-major with zero stride (e.g., as {B1.device_data(), 0}) for the DualGemm implementation to be able to process B0 and B1 simultaneously. In this commit, LayoutB0 and LayoutB1 are decoupled throughout the DualGemm code (device, kernel, and mma). Additionally, the batch strides of B0 and B1 are also decoupled to accommodate the column vector B1 case described above. * Remove comment as no longer relevant * Revert Fix MHA kernel --------- Co-authored-by: mikeiovine <mikeiovine@fb.com>	2023-02-13 15:27:13 -05:00
dan_the_3rd	2e10404d26	xFormer updates to fMHA FW (#773 ) * xFormer updates to fMHA FW * Convert format to BMHK for '41_fused_multi_head_attention_fixed_seqlen' * Add missing files * Remove xFormers specific code * Update fused_multihead_attention_fixed_seqlen.cu * rebase and solve conflicts * remove white space --------- Co-authored-by: danthe3rd <danthe3rd> Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2023-02-08 23:00:10 -05:00
Vijay Thakkar	277bd6e537	CUTLASS 3.0.0 (#786 ) * CUTLASS 3.0.0	2023-01-23 20:55:28 -05:00
ANIKET SHIVAM	66d9cddc83	New updates for 2.11 (#775 ) * New updates. * Minor profiler updates Co-authored-by: Aniket Shivam <ashivam@nvidia.com>	2023-01-20 16:32:57 -05:00
Haicheng Wu	8b42e751c6	streamk paper link (#765 ) Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2023-01-10 22:10:43 -05:00
Haicheng Wu	764b840d6f	streamk example and performance tuning (#760 ) * streamk example and performance tuning * one missing file Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2023-01-10 16:10:02 -05:00
Haicheng Wu	3f2bb17722	minor chagnes (#730 ) Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2022-12-10 14:44:53 -05:00
Jack Kosaian	df81d847d7	Make Python interface work for non-SM80 targets (#726 ) * Make Python interface work for non-SM80 targets * Remove line in README	2022-12-07 21:53:33 -05:00
Aditya Atluri	c975e2ccbb	releaase 2.11 (#703 )	2022-11-19 09:02:15 -05:00
Wenzhuo Liu	3c90f6aea6	add `#pragma once` for header file in example 42 (#698 )	2022-11-15 22:50:24 -05:00
Haicheng Wu	012c62c748	bug fixes and enharcement to gemm reductionK fusion (#682 ) * add two missing files * fix bunch of bugs of gemm-reducek fusion and add a device interface * small changes Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2022-11-03 11:07:50 -04:00
dan_the_3rd	1b4e24470a	Example 43 - DualGemm (#670 ) * Ex50 wip * IS_PROFILING mode * MultiStage2 - but is slower * Add SwiGLU * Support SplitKSerial reduction Support not storing D0/D1 Cleanup code * Option to disable bias * Renumber example * Fix build * Remove references to pb_size_0 / pb_size_1 * Add support for bf16 inputs with float accum * small changes Co-authored-by: danthe3rd <danthe3rd> Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2022-10-26 14:04:42 -04:00
Jack Kosaian	8c1bf9b784	Bump CUTLASS Python container version (#672 ) * Update example 40 README * Update CUTLASS Python README	2022-10-22 21:09:39 -04:00
Yuriy Chernyshov	7d0dd6706e	Remove excessive includes from examples/41_multi_head_attention (#669 ) The rationale behind this change is explained in #563	2022-10-21 22:23:15 -04:00
dan_the_3rd	4db6a6140e	ex42: Fused MHA imported from xFormers (#662 ) * ex42: Fused MHA imported from xFormers * Remove std:: references * Support K>128 in the example * Support causal option * Support different head size for V, and different seqlength for KV * Update FLOPS counter * Remove bit_cast * fix build: Replace M_LOG2E * Add doc * Revert "Remove bit_cast" This reverts commit 9662fa86bb7c57c1a015ac0bf52cb52940fbbf80. * Explicit casts to int32_t for windows build Co-authored-by: danthe3rd <danthe3rd>	2022-10-17 10:49:33 -04:00
Wenzhuo Liu	7a458f00a6	fix(permute.h): incorrect comment in `Tensor5DPermute20314` (#637 ) * fix(permute.h): incorrect comment in `Tensor5DPermute20314` * typo in usage in example 39	2022-09-22 09:21:13 -04:00
Wenzhuo Liu	f73374a1eb	fix:comment typo in example 23 (#633 )	2022-09-19 09:54:14 -04:00
Yujia Zhai	faab7536fc	add comment (#628 )	2022-09-17 21:40:30 -04:00
ANIKET SHIVAM	e773429f7e	CUTLASS 2.10 updates (#622 ) Co-authored-by: Aniket Shivam <ashivam@nvidia.com>	2022-09-12 21:26:30 -04:00
Yujia Zhai	b1d3f9b2fd	upstream internal updates (#616 ) Co-authored-by: yuzhai <yuzhai@nvidia.com>	2022-09-04 23:05:09 -04:00
ANIKET SHIVAM	b72cbf957d	CUTLASS 2.10 (#615 ) Co-authored-by: Aniket Shivam <ashivam@nvidia.com>	2022-09-03 18:48:46 -04:00
Haicheng Wu	497b499d9d	Add residual support for shmem staging iterator used in back-to-back GEMM fusion. This allows support of problem_size_0_n that is not multiple of 32. (#590 ) Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2022-08-15 11:19:24 -04:00
Jack Kosaian	e66bfcb1f8	Fix for #596 (typo in example 03) (#597 ) * [examples] Fix typos in SYRK and TRMM examples * Fix typo in example 03	2022-08-09 09:58:36 -04:00
Michaël Benesty	1617685a77	fix: fix types in example 06 (#587 )	2022-07-29 12:46:06 -04:00
Shang Zhang	5d05808072	fix gather example (#574 )	2022-07-19 16:18:17 -04:00
Ivan Komarov	0b8cacd6f1	Remove redundant <fstream> includes (#563 ) * Remove redundant <fstream> includes * Fix fstream in examples/ * Fix <fstream> in test/ * Use consistent order for <fstream> (always after <iostream>) * Remove an unneeded include in a file where std::ofstream usage is commented out Co-authored-by: Ivan Komarov <dfyz@yandex-team.ru>	2022-07-19 15:23:54 -04:00
Yujia Zhai	04a9777b87	Softmax (#546 ) * add test layernorm g-mem version * Delete include/configure directory * Delete examples/test_layernorm directory * Update gemm_with_softmax.h * Update gemm_softmax.cu * Update linear_combination.h * Update fast_math.h * remove redundant vars Co-authored-by: yujia.zhai <yujia.zhai@bytedance.com> Co-authored-by: yuzhai <yuzhai@nvidia.com>	2022-07-02 01:19:18 -04:00
Jack Kosaian	fa56763c25	Fix occupancy calculation for grouped GEMM (#532 )	2022-06-18 19:53:59 -04:00

1 2

86 Commits