Ivan Komarov
f692b98d80
Fix spurious re-compilations of rotary_kernel ( #911 )
...
All integer parameters are specialized by default, so the two parameters
removed in this commit could lead to kernel re-compilation, even if
they were completely unused.
2024-04-05 13:40:41 -07:00
Grigory Sizov
2a15840f09
Enable paged attention in varlen forward ( #831 )
...
* Enable paged attention in varlen forward
* Format + fix padding
2024-03-15 00:48:19 -07:00
Tri Dao
2406f28805
Enable headdim 256 backward on consumer GPUs (Ampere, Ada)
2024-02-21 15:56:19 -08:00
Tri Dao
54e80a3829
Implement page KV cache
...
Co-authored-by: ljss <450993438@qq.com>
2024-01-22 22:47:30 -08:00
Curtis "Fjord" Hawthorne
d8aacc510c
return z_loss ( #768 )
2024-01-21 15:23:41 -08:00
Tri Dao
10dad61277
apply_dropout now takes tensor of rowcol layout
2024-01-14 01:03:23 -08:00
Tri Dao
a7b66ae25a
Simplify writing softmax to gmem
2024-01-13 00:25:04 -08:00
Tri Dao
f5b308e258
[LayerNorm] Rename layernorm.py -> layer_norm.py
2024-01-05 00:21:03 -08:00
Tri Dao
665b55e2e2
[LayerNorm] Implement parallel layer norm in Triton
2024-01-04 23:15:35 -08:00
Tri Dao
aa5c6438c5
[LayerNorm] Implement rowscale in Triton layernorm
2024-01-04 01:07:03 -08:00
Tri Dao
73df3be7d5
Add test for BTLM init
2023-12-25 15:16:27 -08:00
Tri Dao
7ffba9a501
Implement BTLM model
2023-12-24 20:35:12 -08:00
Tri Dao
3f7d5786ba
Pass alibi slopes to flash_attn_with_kvcache during generation
2023-12-24 20:31:59 -08:00
Tri Dao
732654583c
Implement deterministic backward (thanks to Meituan)
2023-12-23 17:57:36 -08:00
Tri Dao
2c7d7b7396
Implement norm head for Baichuan2
2023-12-22 16:55:40 -08:00
Tri Dao
c3b2196652
Add Alibi to MHA, test with Baichuan-13B
2023-12-21 22:49:55 -08:00
Tri Dao
5ab9b3667b
Clean up alibi, implement non-causal alibi
2023-12-21 22:27:40 -08:00
Sanghun Cho
e4f726fc44
Support alibi, by Sanghun Cho from Kakao Brain
...
* hard-code alibi in fwd
* use params.h as hun_heads
* hard-code alibi in bwd
* add alibi on/off option
* compute alibi_start, ratio outside of kernels
* fix minor merge conflict
* add test_alibi.py
* change apply_alibi() location before masking
* add alibi in splitkv kernel
* fix backward func # of returns
* add out-of-bound check in apply_alibi()
* update test_alibi.py
* update test_alibi.py for kvcache
* simplify alibi parameter interface
* fix performance issue
by computing alibi outside of branch
* update test_flash_attn_varlen_func() for left padding
* implement alibi_slopes (b, nh) loading
* optimize apply_alibi() a bit
* update test cases for alibi_slopes loading
* reflect stylistic comments
* disable "seqlenq_ngroups_swapped" when using alibi
---------
Co-authored-by: monk.detective <monk.detective@kakaobrain.com>
2023-12-19 22:56:06 -08:00
Tri Dao
cd089597fd
[LayerNorm] Implement dropout in fused residual + LN/RMSNorm
2023-12-19 16:26:07 -08:00
Tri Dao
713bd3aa9a
[CrossEntropy] Test longer sequences
2023-12-16 19:11:23 -08:00
Tri Dao
08124c8f9c
[CrossEntropy] Implement logit_scale option
2023-12-16 18:39:37 -08:00
Tri Dao
9356a1c038
[LayerNorm] Implement layer_norm_linear
2023-11-30 21:46:07 -08:00
Tri Dao
aaa1474129
[CrossEntropy] Simplify the case of large vocab with Tensor Parallel
2023-11-19 23:19:36 -08:00
Shijie
abf04a56e1
fix flash ce mp large vocab ( #673 )
2023-11-19 23:01:07 -08:00
Tri Dao
017716451d
[LayerNorm] Add postnorm residual + LayerNorm/RMSNorm in Triton
2023-11-13 22:37:55 -08:00
Tri Dao
79bd1a2d5d
[LayerNorm] Implement residual + LayerNorm/RMSNorm in Triton
2023-11-13 02:04:49 -08:00
Tri Dao
e279bf8ed9
[Gen] Accept cache_batch_idx to index into the KV cache
2023-10-03 16:27:26 -07:00
Tri Dao
083e8f525f
Implement local attention
...
Co-authored-by: Timothee Lacroix <t@mistral.ai>
2023-09-26 16:31:08 -07:00
Tri Dao
65c234ed90
Don't over-allocate dq_accum in case of varlen
2023-09-24 00:36:07 -07:00
Tri Dao
2d8ea9a530
Swap seqlen_q and ngroups when seqlen_q=1 (h/t Daniel Haziza)
2023-09-20 23:38:22 -07:00
Tri Dao
0705d2718d
[Llama] Fix some tests, add tests for Llama 2 and CodeLlama
2023-09-20 23:36:46 -07:00
Tri Dao
e0fbaa7016
[Gen] Simplify decode_speculative
2023-09-19 22:20:22 -07:00
Tri Dao
e6a8026489
[Gen] Rename max_sequence_len->max_seqlen, sequence_len_offset->seqlen_offset
2023-09-19 22:20:22 -07:00
Kevin Hu
42832575d4
Fix Llama GQA/MQA ( #546 )
...
* Fix llama MQA
* Fix permute shape
* Update llama.py
2023-09-19 22:15:59 -07:00
Tri Dao
dfe29f5e2b
[Gen] Don't use ft_attention, use flash_attn_with_kvcache instead
2023-09-18 15:29:06 -07:00
Tri Dao
3250ff3d82
Swap seqlen_q, nheads for MQA when seqlen_q=1 for fwd (h/t Daniel H)
2023-09-18 14:52:16 -07:00
Tri Dao
ccbb14f38e
Implement rotary embedding in flash_attn_with_kvcache
2023-09-16 01:20:16 -07:00
Tri Dao
5400fdc4ac
[CE] Implement CrossEntropyLoss in Triton
2023-09-15 20:05:28 -07:00
Tri Dao
56b7fc6ee0
Simplify the implementation of KVcache attn by appending KV first
2023-09-13 15:55:48 -07:00
Tri Dao
d0032700d1
Add tests for Pythia, GPT-JT, and RedPajama models
2023-09-13 01:10:39 -07:00
Kevin Hu
07005806ff
Add BigCode converters ( #532 )
2023-09-10 17:24:50 -07:00
Tri Dao
8a733cbd53
[Gen] Fix calling update_graph_cache in tests
2023-09-10 17:22:37 -07:00
Kevin Hu
4c91621a5e
Inverse state dict for BERT ( #527 )
2023-09-09 01:44:21 -07:00
Tri Dao
a86442f0f3
[Gen] Use flash_attn_with_kvcache in generation
2023-09-07 08:24:43 -07:00
Tri Dao
9795159082
[Rotary] Set device before launching Triton kernel to avoid error
2023-09-05 21:29:03 -07:00
Tri Dao
fd20f16a4e
Support cache_seqlens being integer
2023-09-05 11:27:48 -07:00
Tri Dao
913922cac5
[Gen] Refactor decoding function
2023-09-04 17:01:38 -07:00
Tri Dao
37c6e05406
Implement flash_attn_with_kvcache
2023-09-04 00:11:44 -07:00
Tri Dao
0c04943fa2
Require CUDA 11.6+, clean up setup.py
2023-09-03 21:24:56 -07:00
Tri Dao
798858f9f1
Fix test_baichuan
2023-09-03 21:01:37 -07:00