cutlass/examples
Ali Hassani 13f413493a
Stream-K with broadcast (#892)
* [WIP] GEMM StreamK w/ Fused Epilogue

* Adds Gemm Streamk with Fused Epilogue kernel level struct.
  * Mostly based on Gemm with Fused Epilogue,
  * Requires a new epilogue
  * Work in progress

* [WIP] StreamK support for GemmUniversalWithBroadcast

* Just based off of how StreamK is allowed in GemmUniversal
  * Untested and a work in progress

* Minor fixes

* [WIP] It compiles!

It is almost certainly incorrect, but we're past getting the templates
to match, so checkpointing.

* Correction to reference kernel

* Fix typo

* Added MSE measurement

* Switch back to reference kernel + host for loop

Still WIP. Now we're getting even a larger MSE, but it's both on
basic Split-K and Stream-K.

* Fix typos

* Fix broadcast vector + requested changes

* Comment typo

* Small int option and more

* Fix incorrect condition on source needed

* Requested changes

* I think I got it?

* Bias vector should be stride 0

* Two source added!

* Typos

* Merge examples

* Bring back vector row offset

Just to ensure consistency with universal gemm with fused epilogue

* Base arguments and params structs for StreamK

* StreamK epilogue with broadcast now inherits the original

* undo params_streamk_base.h

---------

Co-authored-by: Ali Hassani <ahassanijr@gmail.com>
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2023-05-22 19:05:06 -04:00
..
00_basic_gemm Fix typos 2 (#842) 2023-03-09 23:22:56 -05:00
01_cutlass_utilities New updates for 2.11 (#775) 2023-01-20 16:32:57 -05:00
02_dump_reg_shmem New updates for 2.11 (#775) 2023-01-20 16:32:57 -05:00
03_visualize_layout New updates for 2.11 (#775) 2023-01-20 16:32:57 -05:00
04_tile_iterator New updates for 2.11 (#775) 2023-01-20 16:32:57 -05:00
05_batched_gemm New updates for 2.11 (#775) 2023-01-20 16:32:57 -05:00
06_splitK_gemm New updates for 2.11 (#775) 2023-01-20 16:32:57 -05:00
07_volta_tensorop_gemm Fix typos 2 (#842) 2023-03-09 23:22:56 -05:00
08_turing_tensorop_gemm Updates for 3.1 (#932) 2023-04-29 09:34:27 -04:00
09_turing_tensorop_conv2dfprop CUTLASS 3.1 (#915) 2023-04-14 23:19:34 -04:00
10_planar_complex CUTLASS 3.0.0 (#786) 2023-01-23 20:55:28 -05:00
11_planar_complex_array CUTLASS 3.0.0 (#786) 2023-01-23 20:55:28 -05:00
12_gemm_bias_relu Updates for 3.1 (#932) 2023-04-29 09:34:27 -04:00
13_two_tensor_op_fusion added support of b2b bmm (#849) 2023-04-14 23:20:02 -04:00
14_ampere_tf32_tensorop_gemm New updates for 2.11 (#775) 2023-01-20 16:32:57 -05:00
15_ampere_sparse_tensorop_gemm New updates for 2.11 (#775) 2023-01-20 16:32:57 -05:00
16_ampere_tensorop_conv2dfprop CUTLASS 3.1 (#915) 2023-04-14 23:19:34 -04:00
17_fprop_per_channel_bias New updates for 2.11 (#775) 2023-01-20 16:32:57 -05:00
18_ampere_fp64_tensorop_affine2_gemm New updates for 2.11 (#775) 2023-01-20 16:32:57 -05:00
19_tensorop_canonical New updates for 2.11 (#775) 2023-01-20 16:32:57 -05:00
20_simt_canonical New updates for 2.11 (#775) 2023-01-20 16:32:57 -05:00
21_quaternion_gemm New updates for 2.11 (#775) 2023-01-20 16:32:57 -05:00
22_quaternion_conv CUTLASS 3.1 (#915) 2023-04-14 23:19:34 -04:00
23_ampere_gemm_operand_reduction_fusion Fix typos 2 (#842) 2023-03-09 23:22:56 -05:00
24_gemm_grouped CUTLASS 3.1 (#915) 2023-04-14 23:19:34 -04:00
25_ampere_fprop_mainloop_fusion New updates for 2.11 (#775) 2023-01-20 16:32:57 -05:00
26_ampere_wgrad_mainloop_fusion New updates for 2.11 (#775) 2023-01-20 16:32:57 -05:00
27_ampere_3xtf32_fast_accurate_tensorop_gemm New updates for 2.11 (#775) 2023-01-20 16:32:57 -05:00
28_ampere_3xtf32_fast_accurate_tensorop_fprop New updates for 2.11 (#775) 2023-01-20 16:32:57 -05:00
29_ampere_3xtf32_fast_accurate_tensorop_complex_gemm CUTLASS 3.1 (#915) 2023-04-14 23:19:34 -04:00
30_wgrad_split_k New updates for 2.11 (#775) 2023-01-20 16:32:57 -05:00
31_basic_syrk Updates for 3.1 (#932) 2023-04-29 09:34:27 -04:00
32_basic_trmm Updates for 3.1 (#932) 2023-04-29 09:34:27 -04:00
33_ampere_3xtf32_tensorop_symm New updates for 2.11 (#775) 2023-01-20 16:32:57 -05:00
34_transposed_conv2d New updates for 2.11 (#775) 2023-01-20 16:32:57 -05:00
35_gemm_softmax Increase max dynamic SMEM size in GemmSoftmax (#903) 2023-04-03 10:01:12 -04:00
36_gather_scatter_fusion New updates for 2.11 (#775) 2023-01-20 16:32:57 -05:00
37_gemm_layernorm_gemm_fusion CUTLASS 3.0.0 (#786) 2023-01-23 20:55:28 -05:00
38_syr2k_grouped New updates for 2.11 (#775) 2023-01-20 16:32:57 -05:00
39_gemm_permute CUTLASS 3.1 (#915) 2023-04-14 23:19:34 -04:00
40_cutlass_py CUTLASS 3.1 (#915) 2023-04-14 23:19:34 -04:00
41_fused_multi_head_attention Fix for dangling references in the MHA example (#918) 2023-04-19 21:35:46 -04:00
42_ampere_tensorop_group_conv New updates for 2.11 (#775) 2023-01-20 16:32:57 -05:00
43_ell_block_sparse_gemm New updates for 2.11 (#775) 2023-01-20 16:32:57 -05:00
44_multi_gemm_ir_and_codegen Fix typos 2 (#842) 2023-03-09 23:22:56 -05:00
45_dual_gemm CUTLASS 3.1 (#915) 2023-04-14 23:19:34 -04:00
46_depthwise_simt_conv2dfprop CUTLASS 3.1 (#915) 2023-04-14 23:19:34 -04:00
47_ampere_gemm_universal_streamk Stream-K with broadcast (#892) 2023-05-22 19:05:06 -04:00
48_hopper_warp_specialized_gemm CUTLASS 3.1 (#915) 2023-04-14 23:19:34 -04:00
49_hopper_gemm_with_collective_builder Updates for 3.1 (#932) 2023-04-29 09:34:27 -04:00
50_hopper_gemm_with_epilogue_swizzle CUTLASS 3.1 (#915) 2023-04-14 23:19:34 -04:00
51_hopper_gett CUTLASS 3.1 (#915) 2023-04-14 23:19:34 -04:00
60_cutlass_import New updates for 2.11 (#775) 2023-01-20 16:32:57 -05:00
common CUTLASS 3.1 (#915) 2023-04-14 23:19:34 -04:00
cute CUTLASS 3.1 (#915) 2023-04-14 23:19:34 -04:00
python CUTLASS 3.1 (#915) 2023-04-14 23:19:34 -04:00
CMakeLists.txt CUTLASS 3.1 (#915) 2023-04-14 23:19:34 -04:00