flash-attention

Author	SHA1	Message	Date
Tri Dao	9f42cb6e7a	[Gen] Clone logits before returning when cg=True	2023-08-27 23:19:58 -07:00
Tri Dao	f8aea6ead0	[GPT] Generalize last_token_only arg to num_last_tokens	2023-08-26 20:47:53 -07:00
Tri Dao	7a3bd55f1a	[Gen] Fix decode function not using top_p during iterative decoding	2023-08-26 15:14:41 -07:00
Tri Dao	847abe653c	[Gen] Refactor decode function a bit	2023-08-26 14:47:25 -07:00
Tri Dao	f1a73d0740	Run isort and black on python files	2023-08-18 14:22:11 -07:00
Tri Dao	fcab93b43a	[Gen] Minor tweak to allocate_inference_cache	2023-04-21 11:56:47 -07:00
Tri Dao	ba2fe7f378	[Gen] Move allocate_inference_cache to within the model	2023-04-20 18:15:12 -07:00
Tri Dao	3da42d24b1	[GPT] Add option to only return the logit for the last token	2023-04-20 17:21:08 -07:00
Tri Dao	311d6606bf	[Gen] Fix FT kernel smem size, CG when batch size changed	2023-04-20 17:03:13 -07:00
Tri Dao	605655bc66	[Gen] Fix FT kernel when using CG	2023-04-14 16:50:01 -07:00
Tri Dao	1c9ef9b399	[Gen] Measure prompt processing + decoding time, not just decoding	2023-04-13 15:39:56 -07:00
Tri Dao	f5d0fbd468	[FT] Fix FT's single query attention for bf16 hdim128 rotary	2023-03-28 21:27:00 -07:00
Tri Dao	4d87e4d875	Implement GPT-J	2023-03-22 16:16:58 -07:00
Tri Dao	78b7a1dc18	[OPT] Load fp16 weights on CPU before moving to GPU	2023-01-22 17:01:32 -08:00
Tri Dao	f68d41ec77	[Gen] Add OPT to generation test	2023-01-17 19:59:06 -08:00
Tri Dao	7c2191542a	[Gen] Make generation work with Tensor Parallel	2023-01-15 11:34:27 -08:00
Tri Dao	f95c2fc108	[Gen] Remove commented code	2023-01-07 19:06:39 -08:00
Tri Dao	b48599002a	[Gen] Add timing option	2023-01-07 19:05:09 -08:00
Tri Dao	e02fd588aa	[Gen] Implement top-k and top-p sampling	2023-01-07 17:00:02 -08:00
Tri Dao	a668890fcd	[Gen] Add option to run generation with FT attention kernel	2023-01-03 22:10:31 -08:00
Tri Dao	a6ec1782dc	Bump to v0.2.6	2022-12-27 22:05:20 -08:00
Tri Dao	63670fd84a	Implement generation for GPT	2022-12-27 21:01:50 -08:00

22 Commits