flash-attention

Author	SHA1	Message	Date
Tri Dao	320fb59487	Update citation	2024-05-26 16:09:03 -07:00
Tri Dao	e2e4333c95	Limit to MAX_JOBS=1 with CUDA 12.2	2024-05-26 15:35:49 -07:00
Tri Dao	ce73503578	Bump to 2.5.9	2024-05-26 14:02:11 -07:00
Tri Dao	d732be1e67	Update to Cutlass 3.5	2024-05-26 12:49:33 -07:00
Tri Dao	af627063e3	[CI] Compile for pytorch 2.4.0.dev20240407 (for nvcr 24.05)	2024-05-26 12:41:17 -07:00
Wongboo	40e667236c	Update for python3.12 (#870 )	2024-05-26 12:34:49 -07:00
Corey James Levinson	beb8b8ba9f	add exception to Timeout Error (#963 ) When timeout connecting, you get URLError: <urlopen error timed out>, In that case, build it from source.	2024-05-26 12:33:03 -07:00
lancerts	22339db185	remove an unused import (#960 )	2024-05-23 11:12:31 -07:00
Wei Ji	9c0e9ee86d	Move packaging and ninja from install_requires to setup_requires (#937 ) Set `packaging` and `ninja` as build time dependencies rather than runtime dependencies.	2024-05-06 09:45:54 -07:00
Tri Dao	9a11f440d3	Bump to v2.5.8	2024-04-26 10:54:52 -07:00
Tri Dao	35060e7450	[CI] Compile for pytorch 2.2.2 and 2.3.0	2024-04-26 10:53:24 -07:00
Tri Dao	ec6d22143b	[CrossEntropy] Change ignored_index -> ignore_index	2024-04-26 10:50:41 -07:00
Tri Dao	85881f547f	Bump to v2.5.7	2024-04-07 20:13:05 -07:00
Tri Dao	2aea958f89	[CI] Compile with torch 2.3.0.dev20240207	2024-04-07 20:11:52 -07:00
Tri Dao	656daef4ea	Use Cute's local_tile to get gQ, gK, gV	2024-04-07 20:10:19 -07:00
Tri Dao	9eb3d099c1	Transpose out when swapping seqlen_q and num_groups	2024-04-07 20:10:19 -07:00
Ivan Komarov	f692b98d80	Fix spurious re-compilations of `rotary_kernel` (#911 ) All integer parameters are specialized by default, so the two parameters removed in this commit could lead to kernel re-compilation, even if they were completely unused.	2024-04-05 13:40:41 -07:00
Driss Guessous	23e8fa5a26	Add the option for the macro and note (#893 )	2024-03-27 19:12:11 -07:00
ljss	3e9414f1c3	Minor fix in compute_attn_1rowblock_splitkv (#900 )	2024-03-27 19:11:45 -07:00
Tri Dao	36587c01cb	[LayerNorm] Update layer_norm_linear	2024-03-18 23:15:33 -07:00
Markus Krimmel	6bbc532388	fix: cast the alibi slopes to torch.float32 (#846 )	2024-03-15 00:49:40 -07:00
Driss Guessous	4a73e903da	Add in, macrosf for defining __grid_constant__ (#852 )	2024-03-15 00:48:54 -07:00
Grigory Sizov	2a15840f09	Enable paged attention in varlen forward (#831 ) * Enable paged attention in varlen forward * Format + fix padding	2024-03-15 00:48:19 -07:00
Arvind Sundararajan	26c9e82743	Support ARM builds (#757 )	2024-03-13 21:57:20 -07:00
Chirag Jain	50896ec574	Make nvcc threads configurable via environment variable (#885 )	2024-03-13 20:46:57 -07:00
Tri Dao	6c9e60de56	Bump to v2.5.6	2024-03-01 22:09:56 -08:00
Tri Dao	6e2fa30797	[CI] Change torch 2.3.0.dev20240126 to 20240105 for nvcr 24.02	2024-03-01 22:08:10 -08:00
Tri Dao	87a1277653	Bump to v2.5.5	2024-02-21 15:58:23 -08:00
Tri Dao	2406f28805	Enable headdim 256 backward on consumer GPUs (Ampere, Ada)	2024-02-21 15:56:19 -08:00
Tri Dao	43950dda45	Bump to v2.5.4	2024-02-20 16:30:16 -08:00
Tri Dao	4d6b794b3c	Update Cutlass to v3.4.1	2024-02-20 16:28:21 -08:00
Tri Dao	b32efb1a4d	Don't need to reduce row_sum during online softmax	2024-02-20 13:33:38 -08:00
Qubitium	f45bbb4c94	Optimize compile to 1: avoid oom 2: minimize swap usage 3: avoid threads starvation when letting ninja decide how many workers to spawn or manual MAX_JOBS "guesses". Logic is to take the min value of MAX_JOBS auto-calculated by two metrics: 1: cpu cores 2: free memory. This should allow flash-attn to compile close to the most efficient manner under any consumer/server env. (#832 )	2024-02-17 18:17:15 -08:00
Tri Dao	5cdabc2809	Bump to v2.5.3	2024-02-10 01:06:27 -08:00
Tri Dao	d9a5cb291c	Fix dv = torch::empty_like(k) for mha_bwd_varlen as well	2024-02-10 01:03:00 -08:00
Tri Dao	a190df011c	Add window_size option to ParallelMHA	2024-02-10 01:02:14 -08:00
Brian Hirsh	2423cca3ad	fix backward for when query and key have different contiguity (#818 )	2024-02-10 01:01:27 -08:00
Grigory Sizov	4687936413	Fix Windows build (#816 )	2024-02-07 17:41:53 -08:00
Tri Dao	61a7772479	Bump to v2.5.2	2024-01-31 02:44:24 -08:00
Tri Dao	6a5c053c3e	[CI] Compile with torch 2.2.0 instead of 2.2.0.dev20231106	2024-01-31 02:43:12 -08:00
Tri Dao	ef0ed10622	Add window_size option to MHA and GPT	2024-01-31 02:42:23 -08:00
Tri Dao	dc72d960a7	[CI] Install torch 2.3 using index	2024-01-30 14:32:29 -08:00
Tri Dao	daf37a9d8a	Bump to v2.5.1	2024-01-29 21:03:38 -08:00
Tri Dao	aa2eb8ddf2	[CI] Compile with pytorch 2.2.0.dev20231106	2024-01-29 20:49:18 -08:00
Jeremy Reizenstein	0658e320f6	Preprocessor switches to control functionality (#788 ) For faster and smaller builds in some simple cases, provide switches to allow disabling -backward -alibi -uneven k -dropout -local attention Co-authored-by: Jeremy Francis Reizenstein <bottler@users.noreply.github.com>	2024-01-29 20:44:23 -08:00
Christian Kadner	290596c544	[CI] Build wheels for Pytorch 2.3 (dev/nightly) (#793 ) * [CI] Build wheels for Pytorch 2.3 (dev/nightly) Resolves #790 Signed-off-by: Christian Kadner <ckadner@us.ibm.com> * update TORCH_CUDA_VERSION Signed-off-by: Christian Kadner <ckadner@us.ibm.com> * revert torch 2.2 back to dev20231130 Signed-off-by: Christian Kadner <ckadner@us.ibm.com> * add link to PyTorch compatibility matrix Signed-off-by: Christian Kadner <ckadner@us.ibm.com> --------- Signed-off-by: Christian Kadner <ckadner@us.ibm.com>	2024-01-29 17:53:38 -08:00
Avelina9X	c94cd09744	Updated missing docstrings for args and returns in bert_padding.py (#795 ) * Updated docstrings of bert_padding.py Added docstrings for missing arguments in the unpad and pad methods. * Update bert_padding.py Fixed spelling mistakes	2024-01-27 09:16:25 -08:00
Tri Dao	ffc8682dd5	Add benchmarking code for Alibi (from Sanghun Cho)	2024-01-23 19:00:49 -08:00
Tao He	204c3c6d1b	Fixes an error in comment (#785 ) Signed-off-by: Tao He <sighingnow@gmail.com>	2024-01-23 12:38:29 -08:00
Tri Dao	197f2083a2	Bump to v2.5.0	2024-01-22 23:40:10 -08:00

1 2 3 4 5 ...

665 Commits