Commit Graph

19 Commits

Author SHA1 Message Date
Cameron Shinn
3cea2fb6ee
Add ArchTag to pre/postprocess bwd kernels (#1180)
* Add ArchTag to pre/postprocess bwd kernels

* Type-dependent CC check for bwd pre/postprocess

* Fix CC >= 90 for bwd postprocess

---------

Co-authored-by: Cameron Shinn <cshinn@nvidia.com>
2024-08-28 00:20:47 -07:00
jayhshah
c92ca63268
FA3 FP8 qkv descales + restore max offset for h128 causal + added sync for producer WG (#1173) 2024-08-25 12:18:04 -07:00
Ying Zhang
53537da422 add a unittest 2024-08-17 13:23:50 -07:00
Ying Zhang
a3a257c71d Fix out-of-bound writes for var-seq-len zero-length KVs 2024-08-16 01:17:40 -07:00
Ying Zhang
3669b25206
bwd benchmark + small fixes (#1129) 2024-08-05 21:27:52 -07:00
Tri Dao
5d5bfbb619 Remove contiguous checks 2024-08-05 14:47:07 -07:00
Tri Dao
3f6ff1c1c5 Remove struct : cute::aligned_struct to avoid error with gcc 12 2024-08-02 00:59:35 -07:00
Tri Dao
c33de664a1 Fix import in test 2024-08-01 02:14:25 -07:00
Tri Dao
bafe253042 [FA3] Bwd 2024-08-01 01:57:06 -07:00
Ying Zhang
c7f20a2d31 add cudnn benchmark for var-len 2024-07-31 22:33:29 -07:00
jayhshah
5018ac6ac5
Fp8 kernel with "in-kernel" transpose of V in producer (#1100)
* base version

* restructure pipelines, add special fp8 epilogue

* add variants

* add fp8 causal and modify dynamic tile scheduler

* better causal schedule

* maintain two schedules for non causal and causal

* removing macros

* fix regression

* clean up unneeded methods and variants

* fix mistake with NumProducerThreads

* base version

* restructure pipelines, add special fp8 epilogue

* add variants

* add fp8 causal and modify dynamic tile scheduler

* better causal schedule

* maintain two schedules for non causal and causal

* removing macros

* fix regression

* clean up unneeded methods and variants

* fix mistake with NumProducerThreads

* use seqlen traits

* add fp8 .cu files and benchmark script

* fix merge issue

* fix merge issue

* fix merge issue

* remove duplicate code

* fix regression with varseqlen

* move varseqlen init in constexpr

* fix test script

* more constexpr on varseqlen and add max offset

* add back test cases
2024-07-30 14:14:14 -07:00
Tri Dao
3aae9c18c1 Revert "Changes For FP8 (#1075)"
This reverts commit 1899c970c8.
2024-07-25 01:28:44 -07:00
ganeshcolfax
1899c970c8
Changes For FP8 (#1075)
* adding files for fp8 changes.

* removed contiguous check.

* enable all tests except odd-seq-lengths, where it crashes now.

* undid clang formatting.

* change to correct tile size for headdim=128.

* fixed odd-seq-len-k.

* minor formatting.

* minor reformatting.

---------

Co-authored-by: Tri Dao <tridao@users.noreply.github.com>
2024-07-23 13:51:14 -07:00
janEbert
3c4053b75c
Make FA3 externally importable (#1053)
Library name to import is `flash_attn_interface`, which matches the
test.
2024-07-22 21:34:56 -07:00
Ying Zhang
dfe1a59e4b
Add var-seq-len to FA3 fp16 / bf16 fwd (#1072)
* fwd var-seq-len

* fixes

* benchmark

* fixes

---------

Co-authored-by: Tri Dao <tridao@users.noreply.github.com>
2024-07-22 21:32:41 -07:00
Cameron Shinn
cb516f855b
Remove torchlib dependency from cpp files (#1083) 2024-07-22 16:47:09 -07:00
youkaichao
ef3e358a25
remove lambda (#1056) 2024-07-21 23:24:38 -07:00
Tri Dao
74b0761ff7 [FA3] BF16 forward 2024-07-14 23:39:46 -07:00
Tri Dao
7f67966cc7 FA3 initial code release 2024-07-11 09:53:36 -07:00