flash-attention

Author	SHA1	Message	Date
Tri Dao	9e5e8bc91e	Change causal mask to be aligned to bottom-right instead of top-left	2023-08-24 23:41:07 -07:00
Aman Gupta Karmani	e0b09891c6	add llama support to GPTPreTrainedModel.from_pretrained (#479 )	2023-08-24 16:31:16 -07:00
Tri Dao	6711b3bc40	Bump version to 2.0.9	2023-08-22 00:21:14 -07:00
Tri Dao	ef6d8c75d9	[GPT] Fix loading weights from HF hub	2023-08-21 22:56:02 -07:00
GAOXinyu	a8c35b4f57	FEAT: add codes which supporting for baichuan-inc/Baichuan-7B (#425 )	2023-08-21 11:05:06 -07:00
Xuechen Li	25d6b1dbcb	handle uneven heads across ranks when combining state_dicts; resolves #467 (#468 ) * q * add comment.	2023-08-20 14:57:34 -07:00
Tri Dao	d431f16751	Import torch before flash_attn_2_cuda	2023-08-19 21:07:33 -07:00
Xuechen Li	7fcd3e6a04	map custom model state_dict back to huggingface format (#465 ) * fix name. * set inv function. * add map back function. * handle gqa. * add type annotation to avoid confusion. * fix docstr. * test inverse remap logic.	2023-08-18 20:51:39 -07:00
Tri Dao	f1a73d0740	Run isort and black on python files	2023-08-18 14:22:11 -07:00
Xuechen Li	bb4cded17b	support when num_heads is not divisible by world_size; resolves #459 (#461 ) * uneql rank. * trim. * enable passing in number of heads for each rank. * simplify. * simplify. * cleanup. * fix col parallel. * fix bug with row parallel. * fit out proj. * refac. * fix sharding logic. * refac sharding. * refac. * support multiple of. * make fn reuseable. * fix bug in dimensions. * scaffold. * test uneven heads. * fix test by adding barrier. * refac. * reuse code. * clean up.	2023-08-18 14:10:35 -07:00
Tri Dao	ada4710d70	[ViT] Run black on vit.py	2023-08-17 17:45:09 -07:00
Tri Dao	a81900d4c1	[ViT] Minor fix so it runs	2023-08-17 17:25:34 -07:00
Tri Dao	4b661a569d	[GPT] Run black on gpt.py	2023-08-16 23:47:50 -07:00
Tri Dao	bec5b3d374	[MHA] Run black on mha.py	2023-08-16 23:47:13 -07:00
Tri Dao	cb0daccc41	[FusedDense] Allow Row/ColumnParallelLinear to have uneven split	2023-08-16 23:43:35 -07:00
Tri Dao	bcfa7c9751	[FusedDense] Run black on fused_dense.py	2023-08-16 23:41:36 -07:00
Tri Dao	c65b5106ac	Fix Bwd NaN for varlen when seqlen_q >> seqlen_k and causal	2023-08-16 15:12:36 -07:00
Xuechen Li	0f7853c6a1	enable loading hf llama checkpoints for training (#446 ) * prelim. * add hf convertion fn. * mlp. * change name. * fix bug. * inverse permute. * change comment. * revert style changes. * fix. * add doc. * revert. * enable load safe. * fix safe load. * fix import. * fix typing-related lints. * fix ckpt loading logic. * make single gpu work. * test with parallel. * ckpt format. * enable pretrained state dict. * remove unused imports. * remove unused. * mark idea related.	2023-08-15 08:33:15 -07:00
Tri Dao	c60851a825	Bump to v2.0.7	2023-08-14 14:55:35 -07:00
Tri Dao	f8dccfc90a	[CI] Fix MATRIX_CUDA_VERSION check	2023-08-14 10:27:26 -07:00
Tri Dao	9c531bdc0a	Use single thread compilation for cuda12.1, torch2.1 to avoid OOM CI	2023-08-14 10:03:31 -07:00
Tri Dao	67ae6fd74b	Bump to v2.0.6	2023-08-13 16:52:48 -07:00
Tri Dao	c5e87b11e9	Bump to v2.0.5	2023-08-13 13:55:04 -07:00
Tri Dao	364a5b4a71	[MLP] Change the check for out_features being None	2023-08-10 00:04:38 -07:00
Tri Dao	d30f2e1cd5	Bump to v2.0.4	2023-08-01 09:01:07 -07:00
Tri Dao	a4e5d1eddd	Bump to v2.0.3	2023-07-31 17:49:23 -07:00
Tri Dao	8f4cd4c16b	[Docs] Fix docstring about Q nheads being divisible by KV nheads	2023-07-31 17:47:03 -07:00
Tri Dao	184b992dcb	[GPT] Implement parallel LLaMa	2023-07-28 15:52:48 -10:00
Tri Dao	840f7925a0	[Docs] Fix mention of MQA/GQA in qkvpacked functions	2023-07-28 12:26:29 -10:00
Tri Dao	60499abcfd	[Benchmark] Add script to benchmark FlashAttention	2023-07-28 00:26:52 -10:00
Kirthi Shankar Sivamani	32a953f486	Request for v2.0.2 (#388 ) * Bump version to 2.0.2 Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Update version in Dockerfile Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>	2023-07-28 02:46:03 -07:00
Kirthi Shankar Sivamani	a03f6f8e9e	Enable CUDA graphs (#386 ) * Add RNG state to kernel launch params Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Save seed and offset for backward Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Single thread write to global mem Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * compute_dq_dk_dv_1colblock get seed and offset from launch params Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * compute_dq_dk_dv_1rowblock get seed and offset from launch params Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Change forward c++ APIs to save RNG state for backward Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Change backward c++ APIs to set RNG state for bprop launcher Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Bug fixes Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Python side API changes Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Bug fix; only save seeds instead of full offset Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Account for 3D grid size Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>	2023-07-27 16:11:34 -07:00
Tri Dao	4c98d0b41f	[MLP] Edit ParallelGatedMlp	2023-07-26 09:39:37 -10:00
Haodong Lyu	8ee62efca3	Implement ParallelGatedMlp (#251 )	2023-07-26 12:14:15 -07:00
Tri Dao	b252072409	Bump to v2.0.1	2023-07-23 12:33:42 -10:00
Tri Dao	d38357dd2f	[GPT] Implement Falcon	2023-07-23 10:32:29 -07:00
Kiarash Jamali	684196b8c5	Allow rotary embeddings for Bert (#363 )	2023-07-23 00:21:45 -07:00
Tri Dao	425dbcb6c6	[MHA] Implement MQA/GQA	2023-07-23 00:06:58 -07:00
Tri Dao	ec9f74ab9a	[Rotary] Don't store inv_freq in state_dict	2023-07-22 23:52:42 -07:00
Tri Dao	75e334d407	[MLP] Add ParallelMLP	2023-07-22 23:45:51 -07:00
Tri Dao	b3177dfaf6	[GPT] Enable FlashAttention for GPT-J	2023-07-21 17:29:10 -07:00
Tri Dao	6fc1e07da2	[Block] Re-enable DropPath	2023-07-21 16:39:23 -07:00
Tri Dao	b4cc152e97	Make sure dout is contiguous	2023-07-17 21:54:44 -07:00
Tri Dao	4f285b3547	FlashAttention-2 release	2023-07-17 06:21:34 -07:00
Tri Dao	6d48e14a6c	Bump to v1.0.9	2023-07-17 03:16:40 -07:00
Volodymyr Kyrylov	70ab266a56	rotary: update cos/sin cache when switching from inference mode This resolves RuntimeErrors after running evaluation in inference mode: ``` File "/home/proger/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(args, kwargs) File "/home/proger/.local/lib/python3.10/site-packages/flash_attn/modules/mha.py", line 492, in forward qkv = self.rotary_emb(qkv) File "/home/proger/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(args, *kwargs) File "/home/proger/.local/lib/python3.10/site-packages/flash_attn/layers/rotary.py", line 229, in forward return apply_rotary_emb_qkv_( File "/home/proger/.local/lib/python3.10/site-packages/torch/autograd/function.py", line 506, in apply return super().apply(args, **kwargs) # type: ignore[misc] RuntimeError: Inference tensors cannot be saved for backward. To work around you can make a clone to get a normal tensor and use it in autograd. ```	2023-07-08 12:01:07 +02:00
Tri Dao	d2f4324f4c	[LayerNorm] Make sure memory addresses are aligned to 16 bytes	2023-07-04 14:53:12 -07:00
Tri Dao	e8a0b4acdd	[Doc] Change total -> total_q	2023-07-02 17:23:52 -07:00
Tri Dao	9610114ce8	Bump to v1.0.8	2023-07-02 17:04:54 -07:00
Tri Dao	62e9814466	[Rotary] Make sure frequency calculation is in fp32	2023-07-02 16:39:39 -07:00
ljss	8e44c0eefb	Fix a bug	2023-06-02 13:46:19 +08:00
Tri Dao	85b51d61ee	Bump version to 1.0.7	2023-05-30 14:18:44 -07:00
Tri Dao	48bc6eacd6	[Gen] Add rotary base as an argument to FT attention kernel	2023-05-30 13:38:34 -07:00
Kirthi Shankar Sivamani	dd9c3a1fc2	bump to v1.0.6 Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>	2023-05-26 17:44:10 -07:00
Max H. Gerlach	31f78a9814	Allow adding an optional local version to the package version	2023-05-19 17:27:41 +02:00
Federico Berto	69f5f7d0a2	[BugFix] cannot unpack non-iterable NoneType object	2023-05-07 03:07:44 +09:00
Federico Berto	3889ba168b	[BugFix] cannot unpack non-iterable NoneType object	2023-05-07 03:07:30 +09:00
Tri Dao	a9a4b4e4f2	[LLaMa] Fix last norm layer to use RMSNorm instead of LayerNorm	2023-05-04 23:39:43 -07:00
Tri Dao	fcab93b43a	[Gen] Minor tweak to allocate_inference_cache	2023-04-21 11:56:47 -07:00
Tri Dao	ba2fe7f378	[Gen] Move allocate_inference_cache to within the model	2023-04-20 18:15:12 -07:00
Tri Dao	3da42d24b1	[GPT] Add option to only return the logit for the last token	2023-04-20 17:21:08 -07:00
Tri Dao	311d6606bf	[Gen] Fix FT kernel smem size, CG when batch size changed	2023-04-20 17:03:13 -07:00
Tri Dao	96d10f6545	Implement LLaMa	2023-04-18 21:51:35 -07:00
Tri Dao	b630aef53f	Implement GatedMlp	2023-04-18 03:37:14 -07:00
Tri Dao	ac3b684cdb	Have a separate nn.Dropout module in SelfAttention module	2023-04-17 22:34:05 -07:00
Kirthi Shankar Sivamani	a0997bc77c	Merge branch 'HazyResearch:main' into enable_cuda_graph_capture	2023-04-14 21:45:37 -07:00
Tri Dao	605655bc66	[Gen] Fix FT kernel when using CG	2023-04-14 16:50:01 -07:00
Kirthi Shankar Sivamani	081c2b012a	Merge branch 'HazyResearch:main' into enable_cuda_graph_capture	2023-04-13 19:36:45 -07:00
Tri Dao	1c9ef9b399	[Gen] Measure prompt processing + decoding time, not just decoding	2023-04-13 15:39:56 -07:00
Tri Dao	6f6e9a9aaf	[FusedDense] Enable sqrelu activation in FusedMLP	2023-04-13 15:29:32 -07:00
Kirthi Shankar Sivamani	7d25a4ec4f	Handle FlashAttnQKVPackedSplitFunc by making rng_state optional in backward Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>	2023-04-13 06:25:52 +00:00
Kirthi Shankar Sivamani	315fd31f0c	Merge branch 'HazyResearch:main' into enable_cuda_graph_capture	2023-04-12 22:42:24 -07:00
Zhiyuan Chen	8c42415664	make mlp hidden_features defaults to 4*in_features	2023-04-13 11:08:21 +08:00
Kirthi Shankar Sivamani	31018c5fa0	Support CUDA graph capture Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>	2023-04-12 16:53:22 -07:00
Tri Dao	d6fc860573	Merge pull request #147 from ksivaman/add_deterministic_execution_option Add option for deterministic execution	2023-03-31 17:32:50 -04:00
Tri Dao	393882bc08	[LayerNorm] Implement LN with parallel residual, support dim 8k	2023-03-31 14:23:45 -07:00
Kirthi Shankar Sivamani	b6aa059bbf	Add option for deterministic execution	2023-03-30 18:23:35 -07:00
Tri Dao	993d12448e	Implement GPT-NeoX	2023-03-29 01:21:25 -07:00
Tri Dao	f5d0fbd468	[FT] Fix FT's single query attention for bf16 hdim128 rotary	2023-03-28 21:27:00 -07:00
Tri Dao	4d87e4d875	Implement GPT-J	2023-03-22 16:16:58 -07:00
Tri Dao	5d079fdd7a	[Triton] Fix benchmark_causal, mention Triton version	2023-03-22 00:51:16 -07:00
Tri Dao	dc08ea1c33	Support H100 for other CUDA extensions	2023-03-15 16:59:27 -07:00
Vik Paruchuri	3165398074	Remove unused kwargs in flashattention	2023-03-15 10:36:19 -07:00
Tri Dao	e45a46a5b7	[Rotary] Implement GPT-J style (interleaved) rotary	2023-03-14 14:35:53 -07:00
Tri Dao	78b7a1dc18	[OPT] Load fp16 weights on CPU before moving to GPU	2023-01-22 17:01:32 -08:00
Tri Dao	eb33e587e9	[LayerNorm] Rename x1 -> residual	2023-01-19 13:07:27 -08:00
Tri Dao	f68d41ec77	[Gen] Add OPT to generation test	2023-01-17 19:59:06 -08:00
Tri Dao	88173a1aaf	[FusedDense] Support relu, rename FusedDenseGeluDense -> FusedMLP	2023-01-17 18:12:27 -08:00
Tri Dao	780e8eeabb	[ViT] Support timm checkpoint, add tests	2023-01-16 01:20:34 -08:00
Tri Dao	2ec7d3f72c	Merge pull request #105 from jamaliki/patch-1 Change default dropout value in documentation	2023-01-15 23:01:20 -08:00
Tri Dao	ef085cfcda	[ViT] Fix extra norm_0, use new LN order in Block	2023-01-15 22:58:56 -08:00
Tri Dao	ff34123bd4	Reorder LN in Block, support OPT	2023-01-15 22:14:31 -08:00
Tri Dao	7c2191542a	[Gen] Make generation work with Tensor Parallel	2023-01-15 11:34:27 -08:00
Kiarash Jamali	41cb909741	Change default dropout value in documentation Documentation says default is 0.1, but the code has attention_dropout default at 0.0	2023-01-13 10:50:07 +00:00
Tri Dao	f95c2fc108	[Gen] Remove commented code	2023-01-07 19:06:39 -08:00
Tri Dao	b48599002a	[Gen] Add timing option	2023-01-07 19:05:09 -08:00
Tri Dao	0938298e4c	[Gen] Adjust shape of kv_cache when using FT	2023-01-07 17:27:54 -08:00
Tri Dao	e02fd588aa	[Gen] Implement top-k and top-p sampling	2023-01-07 17:00:02 -08:00
Tri Dao	11be742aa3	[Gen] Test generation with rotary embedding	2023-01-07 14:37:54 -08:00
Tri Dao	8d9674ed08	Merge pull request #102 from Lamikins/main fixed cross attention typeerror	2023-01-07 13:56:20 -08:00
Tri Dao	93383bd55b	[TP] Implement TensorParallel without sequence parallel	2023-01-07 13:45:22 -08:00
Darius Lam	aec35fd67c	fixed cross attention typeerror	2023-01-07 12:58:41 -08:00
Tri Dao	6738d9477d	[LayerNorm] Implement RMS Norm	2023-01-06 17:34:22 -08:00
Tri Dao	a668890fcd	[Gen] Add option to run generation with FT attention kernel	2023-01-03 22:10:31 -08:00
Tri Dao	4cab4de5ea	[TP] Put parallel embeddings in separate modules	2023-01-02 08:47:48 -08:00
Tri Dao	1ec09ebd90	[FusedDense] Limit matrix dims to 2M (instead of 64k)	2023-01-01 17:06:39 -08:00
Tri Dao	714c1b4f0f	[Bert] Fix embedding layer norm before embedding dropout	2023-01-01 10:38:05 -08:00
Tri Dao	ef1ba918c6	[GPT] Refactor function to shard state_dict for TensorParallel	2023-01-01 00:09:33 -08:00
Tri Dao	65b4064b2a	[FusedDense] Kick off input all_gather before weight dtype conversion	2022-12-31 22:47:34 -08:00
Tri Dao	85b8e3d334	[Docs] Mention that XPos's scale_base is recommended to be 512	2022-12-29 20:25:02 -08:00
Tri Dao	a6ec1782dc	Bump to v0.2.6	2022-12-27 22:05:20 -08:00
Tri Dao	63670fd84a	Implement generation for GPT	2022-12-27 21:01:50 -08:00
Tri Dao	9d797d8848	Support loading GPT2 weights from Huggingface	2022-12-27 11:22:48 -08:00
Tri Dao	c6ecd40a59	Tweak CrossEntropyLoss to take process_group in init	2022-12-27 10:47:43 -08:00
Tri Dao	b4018a5028	Implement Tensor Parallel for GPT model	2022-12-26 16:22:43 -08:00
Tri Dao	78225c5366	Implement Tensor Parallel for GPT2Embeddings	2022-12-25 14:29:53 -08:00
Tri Dao	a8cfe51551	Implement Tensor Parallel for transformer Block	2022-12-25 14:08:21 -08:00
Tri Dao	1e712ea8b0	Implement TensorParallel for MHA	2022-12-25 11:39:55 -08:00
Tri Dao	226a1b721d	Implement TensorParallel for FusedDense and FusedDenseGeluDense	2022-12-24 11:48:56 -08:00
Tri Dao	dff68c2b22	Add smoothing for CrossEntropyParallel, rename to CrossEntropyLoss	2022-12-23 14:51:08 -08:00
Tri Dao	e68ebbe89a	Simplify FusedDense	2022-12-22 21:25:31 -08:00
Tri Dao	496e4f528c	Implement XPos (Sun et al.)	2022-12-21 14:17:58 -08:00
Tri Dao	13cdceb377	Implement last_layer_subset optimization for BERT	2022-12-19 22:18:46 -08:00
Tri Dao	5fb6df0e04	Implement BERT	2022-12-18 21:47:27 -08:00
Alexander Ploshkin	ee8984d2be	add asserts for sin shape	2022-12-17 13:34:57 +04:00
Alexander Ploshkin	c7c66976cc	fix slicing dimensions	2022-12-16 15:39:06 +04:00
Alexander Ploshkin	96656b9323	Remove redundant shape asserts in rotary embeddings	2022-12-15 18:13:21 +04:00
Tri Dao	6b5f271c6d	[Triton] Avoid einops repeat by using Tensor.expand	2022-12-14 14:48:41 -08:00
Tri Dao	88c4e5dbf6	Fix the case when dout is not contiguous	2022-12-13 13:58:17 -08:00
Tri Dao	5db330519a	[LayerNorm] Support taking subset of input or subset of output	2022-12-12 22:16:14 -08:00
Tri Dao	ae137ed17a	[LayerNorm] Fuse LayerScale	2022-12-10 23:28:23 -08:00
Tri Dao	8c6609ae1a	[LayerNorm] Support all dimensions up to 6k (if divisible by 8)	2022-12-09 02:06:22 -08:00
Tri Dao	1feb94265c	[ViT] Use dropout_add_ln for the 1st layer norm	2022-11-23 12:48:56 -08:00
Tri Dao	b8ccd20098	[Triton] Fix variable name from qkv to kv (h/t FrankZijlstra)	2022-11-22 02:07:32 -08:00
Tri Dao	054816177e	Bump version to 0.2.1	2022-11-20 22:35:59 -08:00
Tri Dao	0fa5c0d7ef	Add PatchEmbed	2022-11-17 16:56:06 -08:00
Tri Dao	ece539abd6	Add __init__.py files to subdirectories for installation	2022-11-17 16:55:44 -08:00
Tri Dao	71f674ae23	[Rotary] Customize base, support seqlen_offset	2022-11-17 11:43:36 -08:00
Tri Dao	2e33fc8e36	Add GPT and ViT models	2022-11-13 22:30:23 -08:00
Tri Dao	d4b320b31f	Add MLP, MHA, Block, Embedding modules	2022-11-13 22:06:44 -08:00
Tri Dao	fa6d1ce44f	Add fused_dense and dropout_add_layernorm CUDA extensions	2022-11-13 21:59:20 -08:00
Tri Dao	343492ec30	Make nccl operations async in CrossEntropyLossParallel	2022-11-13 17:27:26 -08:00
Tri Dao	7c9953815a	Add fused cross entropy loss	2022-11-12 21:58:41 -08:00
Tri Dao	55797f32c9	Remove RotaryEmbedding from FlashAttention module To avoid import error if one doesn't have rotary_emb installed	2022-11-10 11:54:36 -08:00
Tri Dao	908a5b2244	Set num_warps=4 for headdim=64 in Triton fw (h/t Michael Benesty)	2022-11-07 08:58:16 -08:00
Tri Dao	7479757191	Fix pipelining bug in Triton bwd with bias_type=matrix	2022-11-06 11:50:35 -08:00
Tri Dao	557781933d	Parallelize CUDA bwd along seqlen_k instead of seqlen_q This is faster since we only need to do atomic adds on dq, instead of atomic adds on both dk and dv.	2022-11-05 16:26:17 -07:00
Tri Dao	ca81f32e04	Implement rotary embedding in CUDA	2022-11-04 22:42:01 -07:00
Tri Dao	62025e1aff	Fix more race condition in Triton bwd when there's bias	2022-11-04 12:53:09 -07:00
Tri Dao	ff78ea4123	Fix race condition in Triton bwd when there's bias	2022-11-04 11:20:27 -07:00

1 2 3 4 5 ...

284 Commits