Commit Graph

632 Commits

Author SHA1 Message Date
Tri Dao
dca6d89da4 Don't support softcap and dropout at the same time
These tests are failing so I'm just disabling this case for now
2024-07-10 11:23:12 -07:00
Tri Dao
81e01efd4b More typo fixes 2024-07-10 10:19:17 -07:00
Tri Dao
72e27c6320 Fix typo with softcapping 2024-07-10 00:33:52 -07:00
Tri Dao
3d41db3e2c Only test backward if there's no softcapping 2024-07-10 00:27:45 -07:00
Tri Dao
908511b2b6 Split into more .cu files to speed up compilation 2024-07-10 00:24:04 -07:00
Tri Dao
1d536d7de5 Minor cleanup of softcapping 2024-07-09 22:57:03 -07:00
Tri Dao
beb2bf2a32 Drop support for pytorch 1.12, 1.13, and python 3.7 2024-07-09 22:13:15 -07:00
Phil Wang
f4628b43ec
missing commas and backwards return arguments (#1032)
* missing commas

* another fix
2024-07-09 10:56:29 -07:00
Nicolas Patry
8f873cc6ac
Implement softcapping. (#1025)
* Softcap v2 (fwd only).

* Some missing interface + remove overrides in tests.
2024-07-08 11:24:48 -07:00
Jianwei Dong
4e8d60069f
Add the return_softmax_lse parameter to the flash_attn_with_kvcache function to allow returning the logsumexp of the attention scores. (#989) 2024-07-08 08:29:40 -07:00
muoshuosha
6df7e0a02e
Fix the varlen deterministic test (#1023)
Co-authored-by: moshuosha <moshuosha@qq.com>
2024-07-03 11:07:57 -07:00
66RING
9486635c92
Fix typos of comments about shape. (#837) 2024-06-30 22:40:59 -07:00
JDKWangGuan
0d810cfb73
Fix KeyError handling for non-existing key in state_dict.pop() (#898)
Update handling for KeyError in state_dict.pop() for non-existing keys.
Changed state_dict.pop(f"h.{d}.attn.bias") to state_dict.pop(f"h.{d}.attn.bias", None) to prevent KeyError exceptions.


The following code can re-produce the issue
```
from transformers import AutoTokenizer, GPT2Model, GPT2Config
from flash_attn.models.gpt import GPTLMHeadModel, GPTModel

# >>> transformers.__version__
# '4.38.2'

model_path = 'gpt2'
output_model_path = 'gpt2_model'
config = GPT2Config.from_pretrained(model_path, output_hidden_states=True)
model = GPT2Model.from_pretrained(model_path, from_tf=False, config=config)
'''
model fine-tuning here
'''
# dump the fine-tuned model
model.save_pretrained(output_model_path)

# load the fine-tuned model
config = GPT2Config.from_pretrained(output_model_path, output_hidden_states=True)
model = GPTModel.from_pretrained(output_model_path, config=config, strict=True)  # failed due to KeyError: 'h.0.attn.bias'
model = GPTLMHeadModel.from_pretrained(output_model_path, config=config, strict=True)  # failed due to KeyError: 'h.0.attn.bias'

```
2024-06-30 22:40:03 -07:00
cao lei
6a2a16e994
fix typo (#974) 2024-06-30 22:39:39 -07:00
Nicolas Patry
5bf201966a
Fixing argument checking when using seqlenq_ngroups_swapped. (#976)
When user send `out` as a parameter of the function
`seqlenq_ngroups_swapped` with parameters that trigger,
the CHECK_SHAPE is incorrect (since q shape is modified.)
2024-06-30 22:39:22 -07:00
Liang
ab59ec3590
remove swizzle part of sV.data() to get a completely non-swizzle sVtNoSwizzle (#984)
Co-authored-by: zl <zl@deepseek.com>
2024-06-30 22:38:44 -07:00
Grigory Sizov
f816dee63c
Support unpadded LSE layout (#970)
* Support unpadded LSE layout.

Co-authored-by: Xinfeng Xie <xfxie.ceca@gmail.com>
Co-authored-by: Jianyu Huang <hjyahead@gmail.com>

* Cleanup

* Fix unpadded LSE on split-kv path

* Fix formatting and comments

* Fix inline vs forceinline

---------

Co-authored-by: Xinfeng Xie <xfxie.ceca@gmail.com>
Co-authored-by: Jianyu Huang <hjyahead@gmail.com>
2024-06-27 02:38:13 -07:00
Tri Dao
320fb59487 Update citation 2024-05-26 16:09:03 -07:00
Tri Dao
e2e4333c95 Limit to MAX_JOBS=1 with CUDA 12.2 2024-05-26 15:35:49 -07:00
Tri Dao
ce73503578 Bump to 2.5.9 2024-05-26 14:02:11 -07:00
Tri Dao
d732be1e67 Update to Cutlass 3.5 2024-05-26 12:49:33 -07:00
Tri Dao
af627063e3 [CI] Compile for pytorch 2.4.0.dev20240407 (for nvcr 24.05) 2024-05-26 12:41:17 -07:00
Wongboo
40e667236c
Update for python3.12 (#870) 2024-05-26 12:34:49 -07:00
Corey James Levinson
beb8b8ba9f
add exception to Timeout Error (#963)
When timeout connecting, you get URLError: <urlopen error timed out>, In that case, build it from source.
2024-05-26 12:33:03 -07:00
lancerts
22339db185
remove an unused import (#960) 2024-05-23 11:12:31 -07:00
Wei Ji
9c0e9ee86d
Move packaging and ninja from install_requires to setup_requires (#937)
Set `packaging` and `ninja` as build time dependencies rather than runtime dependencies.
2024-05-06 09:45:54 -07:00
Tri Dao
9a11f440d3 Bump to v2.5.8 2024-04-26 10:54:52 -07:00
Tri Dao
35060e7450 [CI] Compile for pytorch 2.2.2 and 2.3.0 2024-04-26 10:53:24 -07:00
Tri Dao
ec6d22143b [CrossEntropy] Change ignored_index -> ignore_index 2024-04-26 10:50:41 -07:00
Tri Dao
85881f547f Bump to v2.5.7 2024-04-07 20:13:05 -07:00
Tri Dao
2aea958f89 [CI] Compile with torch 2.3.0.dev20240207 2024-04-07 20:11:52 -07:00
Tri Dao
656daef4ea Use Cute's local_tile to get gQ, gK, gV 2024-04-07 20:10:19 -07:00
Tri Dao
9eb3d099c1 Transpose out when swapping seqlen_q and num_groups 2024-04-07 20:10:19 -07:00
Ivan Komarov
f692b98d80
Fix spurious re-compilations of rotary_kernel (#911)
All integer parameters are specialized by default, so the two parameters
removed in this commit could lead to kernel re-compilation, even if
they were completely unused.
2024-04-05 13:40:41 -07:00
Driss Guessous
23e8fa5a26
Add the option for the macro and note (#893) 2024-03-27 19:12:11 -07:00
ljss
3e9414f1c3
Minor fix in compute_attn_1rowblock_splitkv (#900) 2024-03-27 19:11:45 -07:00
Tri Dao
36587c01cb [LayerNorm] Update layer_norm_linear 2024-03-18 23:15:33 -07:00
Markus Krimmel
6bbc532388
fix: cast the alibi slopes to torch.float32 (#846) 2024-03-15 00:49:40 -07:00
Driss Guessous
4a73e903da
Add in, macrosf for defining __grid_constant__ (#852) 2024-03-15 00:48:54 -07:00
Grigory Sizov
2a15840f09
Enable paged attention in varlen forward (#831)
* Enable paged attention in varlen forward

* Format + fix padding
2024-03-15 00:48:19 -07:00
Arvind Sundararajan
26c9e82743
Support ARM builds (#757) 2024-03-13 21:57:20 -07:00
Chirag Jain
50896ec574
Make nvcc threads configurable via environment variable (#885) 2024-03-13 20:46:57 -07:00
Tri Dao
6c9e60de56 Bump to v2.5.6 2024-03-01 22:09:56 -08:00
Tri Dao
6e2fa30797 [CI] Change torch 2.3.0.dev20240126 to 20240105 for nvcr 24.02 2024-03-01 22:08:10 -08:00
Tri Dao
87a1277653 Bump to v2.5.5 2024-02-21 15:58:23 -08:00
Tri Dao
2406f28805 Enable headdim 256 backward on consumer GPUs (Ampere, Ada) 2024-02-21 15:56:19 -08:00
Tri Dao
43950dda45 Bump to v2.5.4 2024-02-20 16:30:16 -08:00
Tri Dao
4d6b794b3c Update Cutlass to v3.4.1 2024-02-20 16:28:21 -08:00
Tri Dao
b32efb1a4d Don't need to reduce row_sum during online softmax 2024-02-20 13:33:38 -08:00
Qubitium
f45bbb4c94
Optimize compile to 1: avoid oom 2: minimize swap usage 3: avoid threads starvation when letting ninja decide how many workers to spawn or manual MAX_JOBS "guesses". Logic is to take the min value of MAX_JOBS auto-calculated by two metrics: 1: cpu cores 2: free memory. This should allow flash-attn to compile close to the most efficient manner under any consumer/server env. (#832) 2024-02-17 18:17:15 -08:00