flash-attention/flash_fwd_hdim64_e4m3_sm90.cu at cc1690d9d6397afb2b2844b39a189c8e2374903f - flash-attention - Gitea: Git with a cup of tea

squall/flash-attention

jayhshah 5018ac6ac5

Fp8 kernel with "in-kernel" transpose of V in producer (#1100 )

* base version

* restructure pipelines, add special fp8 epilogue

* add variants

* add fp8 causal and modify dynamic tile scheduler

* better causal schedule

* maintain two schedules for non causal and causal

* removing macros

* fix regression

* clean up unneeded methods and variants

* fix mistake with NumProducerThreads

* base version

* restructure pipelines, add special fp8 epilogue

* add variants

* add fp8 causal and modify dynamic tile scheduler

* better causal schedule

* maintain two schedules for non causal and causal

* removing macros

* fix regression

* clean up unneeded methods and variants

* fix mistake with NumProducerThreads

* use seqlen traits

* add fp8 .cu files and benchmark script

* fix merge issue

* fix merge issue

* fix merge issue

* remove duplicate code

* fix regression with varseqlen

* move varseqlen init in constexpr

* fix test script

* more constexpr on varseqlen and add max offset

* add back test cases

2024-07-30 14:14:14 -07:00

10 lines

334 B

Plaintext

Raw Blame History

 // Copyright (c) 2024, Tri Dao.
 // Splitting the different head dimensions to different files to speed up compilation.
 #include "flash_fwd_launch_template.h"
 template<>
 void run_mha_fwd_<cutlass::float_e4m3_t, 64>(Flash_fwd_params &params, cudaStream_t stream) {
     run_mha_fwd_hdim64_fp8<cutlass::float_e4m3_t>(params, stream);
 }