flash-attention

Author	SHA1	Message	Date
Tri Dao	dfe29f5e2b	[Gen] Don't use ft_attention, use flash_attn_with_kvcache instead	2023-09-18 15:29:06 -07:00
Tri Dao	ccbb14f38e	Implement rotary embedding in flash_attn_with_kvcache	2023-09-16 01:20:16 -07:00
Tri Dao	a86442f0f3	[Gen] Use flash_attn_with_kvcache in generation	2023-09-07 08:24:43 -07:00
Tri Dao	fd20f16a4e	Support cache_seqlens being integer	2023-09-05 11:27:48 -07:00
Tri Dao	913922cac5	[Gen] Refactor decoding function	2023-09-04 17:01:38 -07:00
dan_the_3rd	011ec323d6	Support MQA + MP for decoding (#490 ) Co-authored-by: danthe3rd <danthe3rd>	2023-08-30 10:29:54 -07:00
Tri Dao	9f42cb6e7a	[Gen] Clone logits before returning when cg=True	2023-08-27 23:19:58 -07:00
Tri Dao	f8aea6ead0	[GPT] Generalize last_token_only arg to num_last_tokens	2023-08-26 20:47:53 -07:00
Tri Dao	371e20658c	[GPT] Test generation when passing in multiple tokens	2023-08-26 13:56:41 -07:00
Tri Dao	c000c3a2c0	[GPT] Move more tests to test_gpt.py	2023-08-26 13:00:40 -07:00
Tri Dao	9b713872ea	[GPT] Move GPT and OPT generation tests to test_{gpt,opt}.py	2023-08-26 12:55:02 -07:00
Tri Dao	0e8c46ae08	Run isort and black on test files	2023-08-18 20:59:35 -07:00
Tri Dao	4d87e4d875	Implement GPT-J	2023-03-22 16:16:58 -07:00
Tri Dao	88173a1aaf	[FusedDense] Support relu, rename FusedDenseGeluDense -> FusedMLP	2023-01-17 18:12:27 -08:00
Tri Dao	ff34123bd4	Reorder LN in Block, support OPT	2023-01-15 22:14:31 -08:00
Tri Dao	63670fd84a	Implement generation for GPT	2022-12-27 21:01:50 -08:00
Tri Dao	9d797d8848	Support loading GPT2 weights from Huggingface	2022-12-27 11:22:48 -08:00

17 Commits