Tri Dao
dbd7923782
Prepare for Cutlass 3.2
2023-08-13 15:24:32 -07:00
Tri Dao
c5e87b11e9
Bump to v2.0.5
2023-08-13 13:55:04 -07:00
Tri Dao
3524e13c11
Update to Cutlass 3.1
2023-08-13 13:53:17 -07:00
Pierce Freeman
6ef3bd800e
Install standard non-wheel package
2023-08-10 20:12:20 -07:00
Pierce Freeman
ecc6535443
Remove release creation
2023-08-10 19:56:24 -07:00
Pierce Freeman
bc6d4992f2
Build wheel on each push
2023-08-10 19:55:52 -07:00
Pierce Freeman
565615c603
Isolate 2.0.0 & cuda12
2023-08-10 19:54:29 -07:00
Tri Dao
364a5b4a71
[MLP] Change the check for out_features being None
2023-08-10 00:04:38 -07:00
Tri Dao
d30f2e1cd5
Bump to v2.0.4
2023-08-01 09:01:07 -07:00
Tri Dao
1c41d2b0e5
Fix race condition in bwd (overwriting sK)
2023-08-01 09:00:10 -07:00
Tri Dao
a4e5d1eddd
Bump to v2.0.3
2023-07-31 17:49:23 -07:00
Tri Dao
8f4cd4c16b
[Docs] Fix docstring about Q nheads being divisible by KV nheads
2023-07-31 17:47:03 -07:00
Tri Dao
a4f148b6ab
Fix masking of bwd when seqlen is not divisible by 128
2023-07-31 17:46:34 -07:00
Tri Dao
184b992dcb
[GPT] Implement parallel LLaMa
2023-07-28 15:52:48 -10:00
Tri Dao
840f7925a0
[Docs] Fix mention of MQA/GQA in qkvpacked functions
2023-07-28 12:26:29 -10:00
Tri Dao
60499abcfd
[Benchmark] Add script to benchmark FlashAttention
2023-07-28 00:26:52 -10:00
Kirthi Shankar Sivamani
32a953f486
Request for v2.0.2 ( #388 )
...
* Bump version to 2.0.2
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
* Update version in Dockerfile
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
---------
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
2023-07-28 02:46:03 -07:00
Kirthi Shankar Sivamani
a03f6f8e9e
Enable CUDA graphs ( #386 )
...
* Add RNG state to kernel launch params
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
* Save seed and offset for backward
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
* Single thread write to global mem
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
* compute_dq_dk_dv_1colblock get seed and offset from launch params
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
* compute_dq_dk_dv_1rowblock get seed and offset from launch params
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
* Change forward c++ APIs to save RNG state for backward
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
* Change backward c++ APIs to set RNG state for bprop launcher
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
* Bug fixes
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
* Python side API changes
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
* Bug fix; only save seeds instead of full offset
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
* Account for 3D grid size
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
---------
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
2023-07-27 16:11:34 -07:00
Tri Dao
4c98d0b41f
[MLP] Edit ParallelGatedMlp
2023-07-26 09:39:37 -10:00
Haodong Lyu
8ee62efca3
Implement ParallelGatedMlp ( #251 )
2023-07-26 12:14:15 -07:00
Tri Dao
56ccaff126
[GPT] Add LLaMa-13B to test
2023-07-26 07:22:22 -10:00
Tri Dao
8e9820a55b
[Rotary] Fix tests when loading state dict with rotary inv_freqs
2023-07-26 07:16:33 -10:00
Tri Dao
b252072409
Bump to v2.0.1
2023-07-23 12:33:42 -10:00
Tri Dao
2a2a3c4bfd
[LayerNorm] Add test for randomness
2023-07-23 12:31:55 -10:00
Joel Lamy-Poirier
767b71ccf0
Fix random state for dropout_layer_norm ( #315 )
2023-07-23 15:05:13 -07:00
Tri Dao
d38357dd2f
[GPT] Implement Falcon
2023-07-23 10:32:29 -07:00
Kiarash Jamali
684196b8c5
Allow rotary embeddings for Bert ( #363 )
2023-07-23 00:21:45 -07:00
Ian Timmis
cbf982afa5
README syntax highlighting ( #365 )
...
* README syntax highlighting
Adds syntax highlighting to README
* Update README.md
2023-07-23 00:21:30 -07:00
Tri Dao
425dbcb6c6
[MHA] Implement MQA/GQA
2023-07-23 00:06:58 -07:00
Tri Dao
ec9f74ab9a
[Rotary] Don't store inv_freq in state_dict
2023-07-22 23:52:42 -07:00
Tri Dao
a157cc8c9b
[FT] Implement MQA/GQA
2023-07-22 23:47:01 -07:00
Tri Dao
75e334d407
[MLP] Add ParallelMLP
2023-07-22 23:45:51 -07:00
Tri Dao
b3177dfaf6
[GPT] Enable FlashAttention for GPT-J
2023-07-21 17:29:10 -07:00
Tri Dao
6fc1e07da2
[Block] Re-enable DropPath
2023-07-21 16:39:23 -07:00
Tri Dao
9ee0ff1d9b
Fix using dO stride for O, which can cause memory error in bwd
2023-07-20 17:39:57 -07:00
Tri Dao
2dd87d0609
Merge pull request #360 from chuanli11/fix/dockerfile
...
remove checkout v2.0.0.post1 from dockerfile
2023-07-20 19:41:24 -04:00
chuanli11
30fd8c17d8
remove checkout v2.0.0.post1 from dockerfile
2023-07-20 16:40:15 +00:00
Tri Dao
b8020d73c9
Merge pull request #348 from eltociear/patch-2
...
[LayerNorm] Fix typo in ln_api.cpp
2023-07-19 17:25:37 -04:00
Ikko Eltociear Ashimine
dfc60f6b7d
[LayerNorm] Fix typo in ln_api.cpp
...
unintialized -> uninitialized
2023-07-20 01:16:16 +09:00
Tri Dao
31ae2488e6
Merge pull request #343 from danthe3rd/if_constexpr
...
Fix compile error with `BOOL_SWITCH`
2023-07-19 04:27:07 -04:00
danthe3rd
538d570c96
Fix compile error on MSVC
...
See also: https://stackoverflow.com/questions/55136414/constexpr-variable-captured-inside-lambda-loses-its-constexpr-ness
2023-07-19 08:04:57 +00:00
Tri Dao
d1a3b52f17
Add instruction about limiting number of ninja jobs
2023-07-17 23:17:47 -07:00
Tri Dao
b4cc152e97
Make sure dout is contiguous
2023-07-17 21:54:44 -07:00
Tri Dao
4f285b3547
FlashAttention-2 release
2023-07-17 06:21:34 -07:00
Tri Dao
6d48e14a6c
Bump to v1.0.9
2023-07-17 03:16:40 -07:00
Tri Dao
01c40dacc4
Merge pull request #313 from philipturner/patch-1
...
Metal FlashAttention
2023-07-15 20:36:48 -04:00
Philip Turner
4dbcaa1443
Update usage.md
2023-07-15 08:40:46 -04:00
Philip Turner
905c13a2d9
Update usage.md
2023-07-15 01:55:43 -04:00
Philip Turner
6ababeb7db
Update usage.md
2023-07-15 01:34:24 -04:00
Tri Dao
72ad03eaa6
Merge pull request #299 from proger/rotary-inference-mode
...
rotary: update cos/sin cache when switching from inference mode
2023-07-08 12:16:51 -04:00