Commit Graph

  • 7153673c1a
    Fix swiglu backwards return type (#1337) main Neil Tenenholtz 2024-11-15 19:23:40 -0500
  • 641db759ab [CI] Pytorch 2.5.1 does not support python 3.8 Tri Dao 2024-11-12 20:01:43 -0800
  • 7435839e3d Update README for FA3 Tri Dao 2024-11-12 20:01:07 -0800
  • 241c682c9f [CI] Switch back to CUDA 12.4 Tri Dao 2024-11-12 14:24:27 -0800
  • c555642172 Bump to v2.7.0 Tri Dao 2024-11-12 14:11:44 -0800
  • 6ffeb572b1 [CI] Still use CUDA 12.3 but pull the right pytorch version Tri Dao 2024-11-12 14:04:30 -0800
  • 42f2b8be34
    Use CUDA 12.4 in the build system (#1326) Ethan Steinberg 2024-11-12 13:40:38 -0800
  • 2f6c633179 Drop support for Pytorch 2.0 Tri Dao 2024-11-12 11:58:16 -0800
  • 88d1657a14
    [AMD ROCm] Fix KVcache bug and improve performance (#1328) rocking 2024-11-13 03:32:11 +0800
  • 284e2c6e5b
    Make FA3 paged attention ready for upgrade to Cutlass 3.6 (#1331) Kai Londenberg 2024-11-12 14:31:37 -0500
  • b443207c1f
    Paged Attention support for FA3 (#1268) Kai Londenberg 2024-11-10 02:05:01 +0100
  • f0bf3ed9ab
    Feat: Add support for PyTorch 2.5 in workflows (#1284) NanoCode012 2024-11-07 15:37:56 +0700
  • 478ee666cc
    Make namespace comment consistent (#1305) Son Nguyen 2024-10-31 13:32:49 +0800
  • c1d146cbd5
    Fix copy-paste error in hopper tests (#1279) milesvant 2024-10-15 13:54:40 -0700
  • a5a75274bc
    FA3 kvcache + split kv + gqa parallelization (#1236) jayhshah 2024-10-15 00:21:22 -0700
  • bedf877467 [CrossEntropy] Fix where labels address not aligned to 16 bytes Tri Dao 2024-10-05 02:02:24 -0700
  • 53a4f34163
    Hotfix due to change of upstream api (#1239) rocking 2024-09-21 03:45:25 +0800
  • 8476986721
    Fix FAv3 compilation with MSVC (#1240) hlky 2024-09-20 20:44:59 +0100
  • 9cafd4ae14
    Merge pull request #1233 from Dao-AILab/ipiszy/local_attn Ying Zhang 2024-09-19 23:14:45 -0700
  • 1c9717d699 address comments Ying Zhang 2024-09-19 22:00:41 -0700
  • 30e1ef0f79
    minify torch.torch.int32 to torch.int32 (#1237) Zhihao Shen 2024-09-18 15:32:59 +0800
  • 83e41b3ca4
    Add custom ops for compatibility with PT Compile (#1139) Antoni Viros 2024-09-17 19:49:26 -0700
  • be6c1b98c4 small fixes Ying Zhang 2024-09-16 15:50:55 -0700
  • dff976a84a fixes Ying Zhang 2024-08-30 17:20:18 -0700
  • 7b4e68e04f hopper local attention Ying Zhang 2024-08-29 22:34:31 -0700
  • af314d4006
    Merge pull request #1182 from ipiszy/used_q Ying Zhang 2024-09-16 14:57:19 -0700
  • 8cbc8a042f small fixes Ying Zhang 2024-09-16 14:38:43 -0700
  • cdbbe844b1 minor changes to unpad_input test util func Ying Zhang 2024-09-13 17:10:37 -0700
  • db80387343 Add seqused_q in fwd / bwd and seqused_k in bwd. Ying Zhang 2024-08-27 21:41:21 -0700
  • e2182cc21d
    Support page kvcache in AMD ROCm (#1198) rocking 2024-09-16 14:17:28 +0800
  • cc1690d9d6 [Rotary] Add test for rotary when qkv are packed an there's GQA Tri Dao 2024-09-12 22:35:20 -0700
  • 8c20cfef49 [Rotary] Support qkv block layout from GQA Tri Dao 2024-09-11 10:39:18 -0700
  • bdf733be55
    Add q, k, v descales to FA3 interface (#1210) Charlene Yang 2024-09-09 21:53:52 -0700
  • c7f32a8409 [CrossEntropy] Support precomputed LSE Tri Dao 2024-09-08 09:24:18 -0700
  • e371bea04f
    feat: change minimal supported CUDA version to 11.7 (#1206) juejuezi 2024-09-06 01:34:35 +0800
  • 3cea2fb6ee
    Add ArchTag to pre/postprocess bwd kernels (#1180) Cameron Shinn 2024-08-28 00:20:47 -0700
  • c92ca63268
    FA3 FP8 qkv descales + restore max offset for h128 causal + added sync for producer WG (#1173) jayhshah 2024-08-25 12:18:04 -0700
  • d79f9b41a8 [CrossEntropy] Use online softmax to simplify implementation Tri Dao 2024-08-24 17:39:57 -0700
  • 32792d37ec add missing if condition for key_padding_mask in test_util.py Jay Shah 2024-08-19 11:17:17 -0700
  • 28e7f4ddbd
    Merge pull request #1155 from ipiszy/fix Ying Zhang 2024-08-17 13:34:06 -0700
  • 53537da422 add a unittest Ying Zhang 2024-08-17 13:23:50 -0700
  • a3a257c71d Fix out-of-bound writes for var-seq-len zero-length KVs Ying Zhang 2024-08-16 01:13:35 -0700
  • bcd918f275 [LayerNorm] Add option to write result to out and residual_out Tri Dao 2024-08-15 14:43:47 -0700
  • bd82d6c6eb Revert "[LayerNorm] Don't store x + residual if we don't need gradients" Tri Dao 2024-08-15 12:02:39 -0700
  • 800401847e [LayerNorm] Don't store x + residual if we don't need gradients Tri Dao 2024-08-15 11:07:46 -0700
  • 16025d8cc9
    Clearer install instructions for CUDA and ROCm backends (#1147) Garrett Byrd 2024-08-14 01:21:56 -0400
  • 3669b25206
    bwd benchmark + small fixes (#1129) Ying Zhang 2024-08-05 21:27:52 -0700
  • 5d5bfbb619 Remove contiguous checks Tri Dao 2024-08-05 14:46:46 -0700
  • 3f1b4d38e7
    Fix: check the type of max_seqlen_k instead of checking max_seqlen twice (#1127) SueJane 2024-08-05 23:59:23 +0800
  • 3f6ff1c1c5 Remove struct : cute::aligned_struct to avoid error with gcc 12 Tri Dao 2024-08-02 00:59:35 -0700
  • c33de664a1 Fix import in test Tri Dao 2024-08-01 02:14:25 -0700
  • bafe253042 [FA3] Bwd Tri Dao 2024-08-01 01:57:06 -0700
  • abffb0f98c
    Merge pull request #1115 from ipiszy/bench Ying Zhang 2024-07-31 22:42:06 -0700
  • c7f20a2d31 add cudnn benchmark for var-len Ying Zhang 2024-07-31 22:33:29 -0700
  • 5018ac6ac5
    Fp8 kernel with "in-kernel" transpose of V in producer (#1100) jayhshah 2024-07-30 14:14:14 -0700
  • c4b9015d74 Add benchmark_gemm.py Tri Dao 2024-07-27 11:13:18 -0700
  • 418d677192 Bump to v2.6.3 Tri Dao 2024-07-25 01:31:28 -0700
  • 65205d350e [CI] Compile for pytorch 2.4.0 Tri Dao 2024-07-25 01:30:34 -0700
  • 3aae9c18c1 Revert "Changes For FP8 (#1075)" Tri Dao 2024-07-25 01:28:44 -0700
  • 1899c970c8
    Changes For FP8 (#1075) ganeshcolfax 2024-07-23 13:51:14 -0700
  • 59594f2a67 Bump to v2.6.2 Tri Dao 2024-07-23 02:30:05 -0700
  • 299563626f Fix test with alibi and cache_leftpad Tri Dao 2024-07-23 02:04:15 -0700
  • 4488acee8d [CI] Compile with torch 2.4.0.dev20240527 Tri Dao 2024-07-23 01:33:32 -0700
  • 65f723bb9a Split bwd into more .cu files to speed up compilation Tri Dao 2024-07-23 01:32:09 -0700
  • 5ca83a9c71 Clean up softcapping bwd a bit Tri Dao 2024-07-22 23:42:06 -0700
  • 751c762c9c Don't specialize for hdim 224 to speed up compilation Tri Dao 2024-07-22 23:40:31 -0700
  • 1c275eb070
    Fix ima for split-kv kernel (#1085) Driss Guessous 2024-07-22 22:19:46 -0700
  • 3c4053b75c
    Make FA3 externally importable (#1053) janEbert 2024-07-23 04:34:56 +0000
  • d8f104e97a
    Support AMD ROCm on FlashAttention 2 (#1010) rocking 2024-07-23 12:34:37 +0800
  • dfe1a59e4b
    Add var-seq-len to FA3 fp16 / bf16 fwd (#1072) Ying Zhang 2024-07-22 21:32:41 -0700
  • cb516f855b
    Remove torchlib dependency from cpp files (#1083) Cameron Shinn 2024-07-22 16:47:09 -0700
  • 5f1ae4a34b
    backwards for softcapping (#1033) Phil Wang 2024-07-21 23:25:46 -0700
  • ef3e358a25
    remove lambda (#1056) youkaichao 2024-07-21 23:24:38 -0700
  • 4df62e1440
    catch typo (#1058) Jorge António 2024-07-22 07:24:15 +0100
  • 74b0761ff7 [FA3] BF16 forward Tri Dao 2024-07-14 23:39:46 -0700
  • 898dd4bbf2 Pass seqused_k to _flash_attn_varlen_forward Tri Dao 2024-07-13 00:08:27 -0700
  • 7ef24848cf Add FA3 image Tri Dao 2024-07-11 09:54:05 -0700
  • 7f67966cc7 FA3 initial code release Tri Dao 2024-07-11 09:53:36 -0700
  • b4a9dd6c9c Temporarily switch to cutlass fork for more shapes Tri Dao 2024-07-11 09:29:21 -0700
  • 7551202cb2 Bump to v2.6.1 Tri Dao 2024-07-11 08:28:32 -0700
  • 844912dca0 [CI] Switch from CUDA 12.2 to 12.3 Tri Dao 2024-07-11 08:20:09 -0700
  • 40e534a7f6 Implement cache_leftpad Tri Dao 2024-07-11 08:17:15 -0700
  • 116b05f9b0 [CI] Compile with pytorch 2.4.0.dev20240514 Tri Dao 2024-07-11 02:53:30 -0700
  • da11d1b853 Bump v2.6.0 Tri Dao 2024-07-10 21:34:58 -0700
  • d0787acc16 Relax dropout_fraction test Tri Dao 2024-07-10 11:49:40 -0700
  • dca6d89da4 Don't support softcap and dropout at the same time Tri Dao 2024-07-10 11:23:12 -0700
  • 81e01efd4b More typo fixes Tri Dao 2024-07-10 10:19:17 -0700
  • 72e27c6320 Fix typo with softcapping Tri Dao 2024-07-10 00:33:52 -0700
  • 3d41db3e2c Only test backward if there's no softcapping Tri Dao 2024-07-10 00:27:45 -0700
  • 908511b2b6 Split into more .cu files to speed up compilation Tri Dao 2024-07-10 00:24:04 -0700
  • 1d536d7de5 Minor cleanup of softcapping Tri Dao 2024-07-09 22:57:03 -0700
  • beb2bf2a32 Drop support for pytorch 1.12, 1.13, and python 3.7 Tri Dao 2024-07-09 22:13:15 -0700
  • f4628b43ec
    missing commas and backwards return arguments (#1032) Phil Wang 2024-07-09 10:56:29 -0700
  • 8f873cc6ac
    Implement softcapping. (#1025) Nicolas Patry 2024-07-08 20:24:48 +0200
  • 4e8d60069f
    Add the return_softmax_lse parameter to the flash_attn_with_kvcache function to allow returning the logsumexp of the attention scores. (#989) Jianwei Dong 2024-07-08 23:29:40 +0800
  • 6df7e0a02e
    Fix the varlen deterministic test (#1023) muoshuosha 2024-07-04 02:07:57 +0800
  • 9486635c92
    Fix typos of comments about shape. (#837) 66RING 2024-07-01 13:40:59 +0800
  • 0d810cfb73
    Fix KeyError handling for non-existing key in state_dict.pop() (#898) JDKWangGuan 2024-06-30 22:40:03 -0700
  • 6a2a16e994
    fix typo (#974) cao lei 2024-06-30 22:39:39 -0700
  • 5bf201966a
    Fixing argument checking when using seqlenq_ngroups_swapped. (#976) Nicolas Patry 2024-07-01 07:39:22 +0200