flash-attention

Author	SHA1	Message	Date
Tri Dao	c60851a825	Bump to v2.0.7	2023-08-14 14:55:35 -07:00
Aman Gupta Karmani	aab603af4f	fix binary wheel installation when nvcc is not available (#448 )	2023-08-14 14:54:26 -07:00
Tri Dao	f8dccfc90a	[CI] Fix MATRIX_CUDA_VERSION check	2023-08-14 10:27:26 -07:00
Tri Dao	9c531bdc0a	Use single thread compilation for cuda12.1, torch2.1 to avoid OOM CI	2023-08-14 10:03:31 -07:00
Tri Dao	67ae6fd74b	Bump to v2.0.6	2023-08-13 16:52:48 -07:00
Tri Dao	2ddeaa406c	Fix wheel building	2023-08-13 16:48:47 -07:00
Tri Dao	d8ec6a2f13	Merge branch 'piercefreeman-feature/demo-wheels' * piercefreeman-feature/demo-wheels: (25 commits) Install standard non-wheel package Remove release creation Build wheel on each push Isolate 2.0.0 & cuda12 Clean setup.py imports Remove builder project Bump version Add notes to github action workflow Add torch dependency to final build Exclude cuda erroring builds Exclude additional disallowed matrix params Full version matrix Add CUDA 11.7 Release is actually unsupported echo OS version Temp disable deploy OS version build numbers Restore full build matrix Refactor and clean of setup.py Strip cuda name from torch version ...	2023-08-13 16:09:38 -07:00
Tri Dao	3c458cff77	Merge branch 'feature/demo-wheels' of https://github.com/piercefreeman/flash-attention into piercefreeman-feature/demo-wheels * 'feature/demo-wheels' of https://github.com/piercefreeman/flash-attention: (25 commits) Install standard non-wheel package Remove release creation Build wheel on each push Isolate 2.0.0 & cuda12 Clean setup.py imports Remove builder project Bump version Add notes to github action workflow Add torch dependency to final build Exclude cuda erroring builds Exclude additional disallowed matrix params Full version matrix Add CUDA 11.7 Release is actually unsupported echo OS version Temp disable deploy OS version build numbers Restore full build matrix Refactor and clean of setup.py Strip cuda name from torch version ...	2023-08-13 16:03:51 -07:00
Tri Dao	dbd7923782	Prepare for Cutlass 3.2	2023-08-13 15:24:32 -07:00
Tri Dao	c5e87b11e9	Bump to v2.0.5	2023-08-13 13:55:04 -07:00
Tri Dao	3524e13c11	Update to Cutlass 3.1	2023-08-13 13:53:17 -07:00
Pierce Freeman	6ef3bd800e	Install standard non-wheel package	2023-08-10 20:12:20 -07:00
Pierce Freeman	ecc6535443	Remove release creation	2023-08-10 19:56:24 -07:00
Pierce Freeman	bc6d4992f2	Build wheel on each push	2023-08-10 19:55:52 -07:00
Pierce Freeman	565615c603	Isolate 2.0.0 & cuda12	2023-08-10 19:54:29 -07:00
Tri Dao	364a5b4a71	[MLP] Change the check for out_features being None	2023-08-10 00:04:38 -07:00
Tri Dao	d30f2e1cd5	Bump to v2.0.4	2023-08-01 09:01:07 -07:00
Tri Dao	1c41d2b0e5	Fix race condition in bwd (overwriting sK)	2023-08-01 09:00:10 -07:00
Tri Dao	a4e5d1eddd	Bump to v2.0.3	2023-07-31 17:49:23 -07:00
Tri Dao	8f4cd4c16b	[Docs] Fix docstring about Q nheads being divisible by KV nheads	2023-07-31 17:47:03 -07:00
Tri Dao	a4f148b6ab	Fix masking of bwd when seqlen is not divisible by 128	2023-07-31 17:46:34 -07:00
Tri Dao	184b992dcb	[GPT] Implement parallel LLaMa	2023-07-28 15:52:48 -10:00
Tri Dao	840f7925a0	[Docs] Fix mention of MQA/GQA in qkvpacked functions	2023-07-28 12:26:29 -10:00
Tri Dao	60499abcfd	[Benchmark] Add script to benchmark FlashAttention	2023-07-28 00:26:52 -10:00
Kirthi Shankar Sivamani	32a953f486	Request for v2.0.2 (#388 ) * Bump version to 2.0.2 Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Update version in Dockerfile Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>	2023-07-28 02:46:03 -07:00
Kirthi Shankar Sivamani	a03f6f8e9e	Enable CUDA graphs (#386 ) * Add RNG state to kernel launch params Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Save seed and offset for backward Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Single thread write to global mem Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * compute_dq_dk_dv_1colblock get seed and offset from launch params Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * compute_dq_dk_dv_1rowblock get seed and offset from launch params Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Change forward c++ APIs to save RNG state for backward Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Change backward c++ APIs to set RNG state for bprop launcher Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Bug fixes Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Python side API changes Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Bug fix; only save seeds instead of full offset Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Account for 3D grid size Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>	2023-07-27 16:11:34 -07:00
Tri Dao	4c98d0b41f	[MLP] Edit ParallelGatedMlp	2023-07-26 09:39:37 -10:00
Haodong Lyu	8ee62efca3	Implement ParallelGatedMlp (#251 )	2023-07-26 12:14:15 -07:00
Tri Dao	56ccaff126	[GPT] Add LLaMa-13B to test	2023-07-26 07:22:22 -10:00
Tri Dao	8e9820a55b	[Rotary] Fix tests when loading state dict with rotary inv_freqs	2023-07-26 07:16:33 -10:00
Tri Dao	b252072409	Bump to v2.0.1	2023-07-23 12:33:42 -10:00
Tri Dao	2a2a3c4bfd	[LayerNorm] Add test for randomness	2023-07-23 12:31:55 -10:00
Joel Lamy-Poirier	767b71ccf0	Fix random state for dropout_layer_norm (#315 )	2023-07-23 15:05:13 -07:00
Tri Dao	d38357dd2f	[GPT] Implement Falcon	2023-07-23 10:32:29 -07:00
Kiarash Jamali	684196b8c5	Allow rotary embeddings for Bert (#363 )	2023-07-23 00:21:45 -07:00
Ian Timmis	cbf982afa5	README syntax highlighting (#365 ) * README syntax highlighting Adds syntax highlighting to README * Update README.md	2023-07-23 00:21:30 -07:00
Tri Dao	425dbcb6c6	[MHA] Implement MQA/GQA	2023-07-23 00:06:58 -07:00
Tri Dao	ec9f74ab9a	[Rotary] Don't store inv_freq in state_dict	2023-07-22 23:52:42 -07:00
Tri Dao	a157cc8c9b	[FT] Implement MQA/GQA	2023-07-22 23:47:01 -07:00
Tri Dao	75e334d407	[MLP] Add ParallelMLP	2023-07-22 23:45:51 -07:00
Tri Dao	b3177dfaf6	[GPT] Enable FlashAttention for GPT-J	2023-07-21 17:29:10 -07:00
Tri Dao	6fc1e07da2	[Block] Re-enable DropPath	2023-07-21 16:39:23 -07:00
Tri Dao	9ee0ff1d9b	Fix using dO stride for O, which can cause memory error in bwd	2023-07-20 17:39:57 -07:00
Tri Dao	2dd87d0609	Merge pull request #360 from chuanli11/fix/dockerfile remove checkout v2.0.0.post1 from dockerfile	2023-07-20 19:41:24 -04:00
chuanli11	30fd8c17d8	remove checkout v2.0.0.post1 from dockerfile	2023-07-20 16:40:15 +00:00
Tri Dao	b8020d73c9	Merge pull request #348 from eltociear/patch-2 [LayerNorm] Fix typo in ln_api.cpp	2023-07-19 17:25:37 -04:00
Ikko Eltociear Ashimine	dfc60f6b7d	[LayerNorm] Fix typo in ln_api.cpp unintialized -> uninitialized	2023-07-20 01:16:16 +09:00
Tri Dao	31ae2488e6	Merge pull request #343 from danthe3rd/if_constexpr Fix compile error with `BOOL_SWITCH`	2023-07-19 04:27:07 -04:00
danthe3rd	538d570c96	Fix compile error on MSVC See also: https://stackoverflow.com/questions/55136414/constexpr-variable-captured-inside-lambda-loses-its-constexpr-ness	2023-07-19 08:04:57 +00:00
Tri Dao	d1a3b52f17	Add instruction about limiting number of ninja jobs	2023-07-17 23:17:47 -07:00

1 2 3 4 5 ...

380 Commits