Tri Dao
db2f80692c
Write zero to out / grad if seqlen_q or seqlen_k is zero
2023-11-19 22:20:01 -08:00
Tri Dao
43bb6d8aaa
Update cutlass to 3.2.2
2023-11-19 21:43:48 -08:00
Driss Guessous
dc4b9ad6c4
add checks ( #640 )
2023-11-19 20:43:27 -08:00
Tri Dao
5a83425442
Change constexpr int to constexpr static int
2023-10-08 16:26:33 -07:00
Tri Dao
e279bf8ed9
[Gen] Accept cache_batch_idx to index into the KV cache
2023-10-03 16:27:26 -07:00
Tri Dao
083e8f525f
Implement local attention
...
Co-authored-by: Timothee Lacroix <t@mistral.ai>
2023-09-26 16:31:08 -07:00
Tri Dao
812cb1c990
Switch cutlass to newer commit to avoid compilation warning
2023-09-24 00:42:50 -07:00
Tri Dao
65c234ed90
Don't over-allocate dq_accum in case of varlen
2023-09-24 00:36:07 -07:00
Tri Dao
1879e089c7
Reduce number of templates for headdim > 128
2023-09-23 22:24:30 -07:00
Tri Dao
2d8ea9a530
Swap seqlen_q and ngroups when seqlen_q=1 (h/t Daniel Haziza)
2023-09-20 23:38:22 -07:00
Tri Dao
dfe29f5e2b
[Gen] Don't use ft_attention, use flash_attn_with_kvcache instead
2023-09-18 15:29:06 -07:00
Tri Dao
3250ff3d82
Swap seqlen_q, nheads for MQA when seqlen_q=1 for fwd (h/t Daniel H)
2023-09-18 14:52:16 -07:00
Tri Dao
43617deab9
Remove template for (IsEvenMN=T, IsEvenK=F) to speed up compilation
2023-09-18 12:21:36 -07:00
Tri Dao
c984208ddb
Set block size to 64 x 64 for kvcache to avoid nvcc segfaults
2023-09-17 16:14:58 -07:00
Tri Dao
ccbb14f38e
Implement rotary embedding in flash_attn_with_kvcache
2023-09-16 01:20:16 -07:00
Tri Dao
5400fdc4ac
[CE] Implement CrossEntropyLoss in Triton
2023-09-15 20:05:28 -07:00
Tri Dao
56b7fc6ee0
Simplify the implementation of KVcache attn by appending KV first
2023-09-13 15:55:48 -07:00
Tri Dao
bb9beb3645
Remove some unused headers
2023-09-12 12:37:10 -07:00
Tri Dao
ee77b931b9
Swap seqlen_q and nheads for MQA to speed it up (h/t Daniel Haziza)
2023-09-10 22:56:33 -07:00
Tri Dao
37c6e05406
Implement flash_attn_with_kvcache
2023-09-04 00:11:44 -07:00
Tri Dao
6a89b2f121
Remove constexpr in launch template to fix CI compilation
2023-09-03 22:59:41 -07:00
Tri Dao
97ba7a62e9
Try switching back to Cutlass 3.2.0
2023-09-03 22:45:35 -07:00
Tri Dao
1dc1b6c8f2
Bump to v2.1.2
2023-09-03 22:23:05 -07:00
Tri Dao
5953c4f58c
Remove unused sdPsum in dot_do_o function
2023-09-03 20:44:07 -07:00
Tri Dao
26d7d92f3d
Fix splitKV combine function when local LSEs are all -inf
2023-09-03 11:39:09 -07:00
Sophia Wisdom
37e32febba
Remove commented out code in bwd ( #512 )
...
* Remove lots of comments
* Remove unused traits
2023-09-01 16:43:58 -07:00
Sophia Wisdom
dd8a754915
Remove old code in utils.h ( #511 )
2023-09-01 15:32:09 -07:00
Aman Gupta Karmani
866a9d33f9
bump cutlass submodule ( #504 )
2023-08-30 10:32:04 -07:00
Tri Dao
31920dda5f
Fix typo with lse_max == -INFINITY
2023-08-29 21:48:59 -07:00
Tri Dao
b1fbbd8337
Implement splitKV attention
2023-08-29 00:58:29 -07:00
Tri Dao
7a983df742
Use generate_kernels.py script from Driss Guessous
2023-08-28 13:34:12 -07:00
dan_the_3rd
c3f2a632aa
[ft_attention] Fix for seqlen=8136 ( #488 )
...
When seqlen=8136, `smem_sz = 48840`, and apparently starting the kernel returns an `invalid argument` CUDA error.
`48840 < 48 * 1024` but apparently it's still above the limit somehow..?
Tested on A100
2023-08-28 10:00:22 -07:00
Tri Dao
757058d4d3
Update Cutlass to v3.2.0
2023-08-27 23:47:28 -07:00
Tri Dao
9e5e8bc91e
Change causal mask to be aligned to bottom-right instead of top-left
2023-08-24 23:41:07 -07:00
BoxiangW
e07aa036db
Support flash attention 2 with causal masking when KV's seq length is longer than Q's seq length. ( #436 )
2023-08-24 16:42:34 -07:00
Tri Dao
bcfa7c9751
[FusedDense] Run black on fused_dense.py
2023-08-16 23:41:36 -07:00
Tri Dao
c65b5106ac
Fix Bwd NaN for varlen when seqlen_q >> seqlen_k and causal
2023-08-16 15:12:36 -07:00
Tri Dao
dbd7923782
Prepare for Cutlass 3.2
2023-08-13 15:24:32 -07:00
Tri Dao
3524e13c11
Update to Cutlass 3.1
2023-08-13 13:53:17 -07:00
Tri Dao
1c41d2b0e5
Fix race condition in bwd (overwriting sK)
2023-08-01 09:00:10 -07:00
Tri Dao
a4f148b6ab
Fix masking of bwd when seqlen is not divisible by 128
2023-07-31 17:46:34 -07:00
Kirthi Shankar Sivamani
a03f6f8e9e
Enable CUDA graphs ( #386 )
...
* Add RNG state to kernel launch params
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
* Save seed and offset for backward
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
* Single thread write to global mem
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
* compute_dq_dk_dv_1colblock get seed and offset from launch params
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
* compute_dq_dk_dv_1rowblock get seed and offset from launch params
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
* Change forward c++ APIs to save RNG state for backward
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
* Change backward c++ APIs to set RNG state for bprop launcher
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
* Bug fixes
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
* Python side API changes
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
* Bug fix; only save seeds instead of full offset
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
* Account for 3D grid size
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
---------
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
2023-07-27 16:11:34 -07:00
Joel Lamy-Poirier
767b71ccf0
Fix random state for dropout_layer_norm ( #315 )
2023-07-23 15:05:13 -07:00
Tri Dao
a157cc8c9b
[FT] Implement MQA/GQA
2023-07-22 23:47:01 -07:00
Tri Dao
9ee0ff1d9b
Fix using dO stride for O, which can cause memory error in bwd
2023-07-20 17:39:57 -07:00
Ikko Eltociear Ashimine
dfc60f6b7d
[LayerNorm] Fix typo in ln_api.cpp
...
unintialized -> uninitialized
2023-07-20 01:16:16 +09:00
danthe3rd
538d570c96
Fix compile error on MSVC
...
See also: https://stackoverflow.com/questions/55136414/constexpr-variable-captured-inside-lambda-loses-its-constexpr-ness
2023-07-19 08:04:57 +00:00
Tri Dao
4f285b3547
FlashAttention-2 release
2023-07-17 06:21:34 -07:00
Tri Dao
2800efc71f
[FT] rotary_cos/sin should have batch_size dimension
2023-07-06 15:33:33 -07:00
Tri Dao
3a9bfd076f
[FT] rotary_cos/sin should have shape (dim) instead of (seqlen, dim)
2023-07-03 09:41:04 -07:00