flash-attention

Author	SHA1	Message	Date
Tri Dao	e0fbaa7016	[Gen] Simplify decode_speculative	2023-09-19 22:20:22 -07:00
Tri Dao	e6a8026489	[Gen] Rename max_sequence_len->max_seqlen, sequence_len_offset->seqlen_offset	2023-09-19 22:20:22 -07:00
Kevin Hu	42832575d4	Fix Llama GQA/MQA (#546 ) * Fix llama MQA * Fix permute shape * Update llama.py	2023-09-19 22:15:59 -07:00
Tri Dao	dfe29f5e2b	[Gen] Don't use ft_attention, use flash_attn_with_kvcache instead	2023-09-18 15:29:06 -07:00
Tri Dao	3250ff3d82	Swap seqlen_q, nheads for MQA when seqlen_q=1 for fwd (h/t Daniel H)	2023-09-18 14:52:16 -07:00
Tri Dao	43617deab9	Remove template for (IsEvenMN=T, IsEvenK=F) to speed up compilation	2023-09-18 12:21:36 -07:00
Federico Berto	fa3ddcbaaa	[Minor] add nvcc note on bare_metal_version `RuntimeError` (#552 ) * Add nvcc note on bare_metal_version `RuntimeError` * Run Black formatting	2023-09-18 11:48:15 -07:00
Tri Dao	799f56fa90	Don't compile for Pytorch 2.1 on CUDA 12.1 due to nvcc segfaults	2023-09-17 22:15:38 -07:00
Tri Dao	c984208ddb	Set block size to 64 x 64 for kvcache to avoid nvcc segfaults	2023-09-17 16:14:58 -07:00
Tri Dao	8c8b4d36e1	Bump to v2.2.3	2023-09-16 01:47:01 -07:00
Tri Dao	ccbb14f38e	Implement rotary embedding in flash_attn_with_kvcache	2023-09-16 01:20:16 -07:00
Tri Dao	5400fdc4ac	[CE] Implement CrossEntropyLoss in Triton	2023-09-15 20:05:28 -07:00
Tri Dao	56b7fc6ee0	Simplify the implementation of KVcache attn by appending KV first	2023-09-13 15:55:48 -07:00
Tri Dao	d0032700d1	Add tests for Pythia, GPT-JT, and RedPajama models	2023-09-13 01:10:39 -07:00
Tri Dao	bb9beb3645	Remove some unused headers	2023-09-12 12:37:10 -07:00
Tri Dao	08c295c043	Bump to v2.2.2	2023-09-10 23:48:12 -07:00
Tri Dao	ee77b931b9	Swap seqlen_q and nheads for MQA to speed it up (h/t Daniel Haziza)	2023-09-10 22:56:33 -07:00
Kevin Hu	07005806ff	Add BigCode converters (#532 )	2023-09-10 17:24:50 -07:00
Tri Dao	8a733cbd53	[Gen] Fix calling update_graph_cache in tests	2023-09-10 17:22:37 -07:00
Kevin Hu	4c91621a5e	Inverse state dict for BERT (#527 )	2023-09-09 01:44:21 -07:00
Tri Dao	a86442f0f3	[Gen] Use flash_attn_with_kvcache in generation	2023-09-07 08:24:43 -07:00
Tri Dao	a1576ad1e8	Bump to v2.2.1	2023-09-06 02:19:55 -07:00
Tri Dao	9795159082	[Rotary] Set device before launching Triton kernel to avoid error	2023-09-05 21:29:03 -07:00
Tri Dao	6d673cd961	Bump to v2.2.0	2023-09-05 11:34:13 -07:00
Kyeongpil Kang	8e893f0950	Create __init__.py for ops/triton dir (#516 )	2023-09-05 11:29:03 -07:00
Tri Dao	fd20f16a4e	Support cache_seqlens being integer	2023-09-05 11:27:48 -07:00
Tri Dao	913922cac5	[Gen] Refactor decoding function	2023-09-04 17:01:38 -07:00
Tri Dao	3557e0bb8f	[MLP] Implement SwiGLU with torch jiterator	2023-09-04 15:43:53 -07:00
Tri Dao	37c6e05406	Implement flash_attn_with_kvcache	2023-09-04 00:11:44 -07:00
Tri Dao	4976650f74	Set single threaded compilation for CUDA 12.2 so CI doesn't OOM	2023-09-03 23:42:55 -07:00
Tri Dao	6a89b2f121	Remove constexpr in launch template to fix CI compilation	2023-09-03 22:59:41 -07:00
Tri Dao	97ba7a62e9	Try switching back to Cutlass 3.2.0	2023-09-03 22:45:35 -07:00
Tri Dao	1dc1b6c8f2	Bump to v2.1.2	2023-09-03 22:23:05 -07:00
Tri Dao	0c04943fa2	Require CUDA 11.6+, clean up setup.py	2023-09-03 21:24:56 -07:00
Tri Dao	798858f9f1	Fix test_baichuan	2023-09-03 21:01:37 -07:00
Tri Dao	7b33743a72	[Gen] Add back num_last_tokens in gpt.py	2023-09-03 20:44:40 -07:00
Tri Dao	5953c4f58c	Remove unused sdPsum in dot_do_o function	2023-09-03 20:44:07 -07:00
Tri Dao	b28ec236df	[Rotary] Implement varlen rotary	2023-09-03 17:57:10 -07:00
Tri Dao	861c82577d	[Rotary] Clean up rotary Triton implementation a bit	2023-09-03 16:41:17 -07:00
Tri Dao	1c523c1ce1	[Rotary] Speed up rotary kernel when interleaved=True	2023-09-03 16:24:37 -07:00
Tri Dao	26d7d92f3d	Fix splitKV combine function when local LSEs are all -inf	2023-09-03 11:39:09 -07:00
Tri Dao	de2949f37d	[Rotary] Pass max_seqlen from mha.py to rotary during inference	2023-09-03 11:37:06 -07:00
Tri Dao	942fcbf046	[Rotary] Implement rotary in Triton	2023-09-03 02:51:58 -07:00
Tri Dao	08e9847176	[CI] Add CUDA 12.2	2023-09-03 02:45:42 -07:00
Sophia Wisdom	37e32febba	Remove commented out code in bwd (#512 ) * Remove lots of comments * Remove unused traits	2023-09-01 16:43:58 -07:00
Sophia Wisdom	dd8a754915	Remove old code in utils.h (#511 )	2023-09-01 15:32:09 -07:00
Aman Gupta Karmani	866a9d33f9	bump cutlass submodule (#504 )	2023-08-30 10:32:04 -07:00
dan_the_3rd	c9d4a816fa	Support LLaMa2 and CodeLLaMa (#491 ) Co-authored-by: danthe3rd <danthe3rd>	2023-08-30 10:31:14 -07:00
dan_the_3rd	011ec323d6	Support MQA + MP for decoding (#490 ) Co-authored-by: danthe3rd <danthe3rd>	2023-08-30 10:29:54 -07:00
GAOXinyu	0cb595ad94	[bugfix] handle_x not define when using checkpoint_lvl = 2 (#502 ) when using checkpoint_lvl=2, we all_gather_raw(x) without async_op=True. So we don't need to wait for handle. Just skip.	2023-08-29 23:46:10 -07:00

... 3 4 5 6 7 ...

671 Commits