Commit Graph

380 Commits

Author SHA1 Message Date
Tri Dao
c60851a825 Bump to v2.0.7 2023-08-14 14:55:35 -07:00
Aman Gupta Karmani
aab603af4f
fix binary wheel installation when nvcc is not available (#448) 2023-08-14 14:54:26 -07:00
Tri Dao
f8dccfc90a [CI] Fix MATRIX_CUDA_VERSION check 2023-08-14 10:27:26 -07:00
Tri Dao
9c531bdc0a Use single thread compilation for cuda12.1, torch2.1 to avoid OOM CI 2023-08-14 10:03:31 -07:00
Tri Dao
67ae6fd74b Bump to v2.0.6 2023-08-13 16:52:48 -07:00
Tri Dao
2ddeaa406c Fix wheel building 2023-08-13 16:48:47 -07:00
Tri Dao
d8ec6a2f13 Merge branch 'piercefreeman-feature/demo-wheels'
* piercefreeman-feature/demo-wheels: (25 commits)
  Install standard non-wheel package
  Remove release creation
  Build wheel on each push
  Isolate 2.0.0 & cuda12
  Clean setup.py imports
  Remove builder project
  Bump version
  Add notes to github action workflow
  Add torch dependency to final build
  Exclude cuda erroring builds
  Exclude additional disallowed matrix params
  Full version matrix
  Add CUDA 11.7
  Release is actually unsupported
  echo OS version
  Temp disable deploy
  OS version build numbers
  Restore full build matrix
  Refactor and clean of setup.py
  Strip cuda name from torch version
  ...
2023-08-13 16:09:38 -07:00
Tri Dao
3c458cff77 Merge branch 'feature/demo-wheels' of https://github.com/piercefreeman/flash-attention into piercefreeman-feature/demo-wheels
* 'feature/demo-wheels' of https://github.com/piercefreeman/flash-attention: (25 commits)
  Install standard non-wheel package
  Remove release creation
  Build wheel on each push
  Isolate 2.0.0 & cuda12
  Clean setup.py imports
  Remove builder project
  Bump version
  Add notes to github action workflow
  Add torch dependency to final build
  Exclude cuda erroring builds
  Exclude additional disallowed matrix params
  Full version matrix
  Add CUDA 11.7
  Release is actually unsupported
  echo OS version
  Temp disable deploy
  OS version build numbers
  Restore full build matrix
  Refactor and clean of setup.py
  Strip cuda name from torch version
  ...
2023-08-13 16:03:51 -07:00
Tri Dao
dbd7923782 Prepare for Cutlass 3.2 2023-08-13 15:24:32 -07:00
Tri Dao
c5e87b11e9 Bump to v2.0.5 2023-08-13 13:55:04 -07:00
Tri Dao
3524e13c11 Update to Cutlass 3.1 2023-08-13 13:53:17 -07:00
Pierce Freeman
6ef3bd800e Install standard non-wheel package 2023-08-10 20:12:20 -07:00
Pierce Freeman
ecc6535443 Remove release creation 2023-08-10 19:56:24 -07:00
Pierce Freeman
bc6d4992f2 Build wheel on each push 2023-08-10 19:55:52 -07:00
Pierce Freeman
565615c603 Isolate 2.0.0 & cuda12 2023-08-10 19:54:29 -07:00
Tri Dao
364a5b4a71 [MLP] Change the check for out_features being None 2023-08-10 00:04:38 -07:00
Tri Dao
d30f2e1cd5 Bump to v2.0.4 2023-08-01 09:01:07 -07:00
Tri Dao
1c41d2b0e5 Fix race condition in bwd (overwriting sK) 2023-08-01 09:00:10 -07:00
Tri Dao
a4e5d1eddd Bump to v2.0.3 2023-07-31 17:49:23 -07:00
Tri Dao
8f4cd4c16b [Docs] Fix docstring about Q nheads being divisible by KV nheads 2023-07-31 17:47:03 -07:00
Tri Dao
a4f148b6ab Fix masking of bwd when seqlen is not divisible by 128 2023-07-31 17:46:34 -07:00
Tri Dao
184b992dcb [GPT] Implement parallel LLaMa 2023-07-28 15:52:48 -10:00
Tri Dao
840f7925a0 [Docs] Fix mention of MQA/GQA in qkvpacked functions 2023-07-28 12:26:29 -10:00
Tri Dao
60499abcfd [Benchmark] Add script to benchmark FlashAttention 2023-07-28 00:26:52 -10:00
Kirthi Shankar Sivamani
32a953f486
Request for v2.0.2 (#388)
* Bump version to 2.0.2

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Update version in Dockerfile

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

---------

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
2023-07-28 02:46:03 -07:00
Kirthi Shankar Sivamani
a03f6f8e9e
Enable CUDA graphs (#386)
* Add RNG state to kernel launch params

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Save seed and offset for backward

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Single thread write to global mem

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* compute_dq_dk_dv_1colblock get seed and offset from launch params

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* compute_dq_dk_dv_1rowblock get seed and offset from launch params

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Change forward c++ APIs to save RNG state for backward

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Change backward c++ APIs to set RNG state for bprop launcher

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Bug fixes

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Python side API changes

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Bug fix; only save seeds instead of full offset

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Account for 3D grid size

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

---------

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
2023-07-27 16:11:34 -07:00
Tri Dao
4c98d0b41f [MLP] Edit ParallelGatedMlp 2023-07-26 09:39:37 -10:00
Haodong Lyu
8ee62efca3
Implement ParallelGatedMlp (#251) 2023-07-26 12:14:15 -07:00
Tri Dao
56ccaff126 [GPT] Add LLaMa-13B to test 2023-07-26 07:22:22 -10:00
Tri Dao
8e9820a55b [Rotary] Fix tests when loading state dict with rotary inv_freqs 2023-07-26 07:16:33 -10:00
Tri Dao
b252072409 Bump to v2.0.1 2023-07-23 12:33:42 -10:00
Tri Dao
2a2a3c4bfd [LayerNorm] Add test for randomness 2023-07-23 12:31:55 -10:00
Joel Lamy-Poirier
767b71ccf0
Fix random state for dropout_layer_norm (#315) 2023-07-23 15:05:13 -07:00
Tri Dao
d38357dd2f [GPT] Implement Falcon 2023-07-23 10:32:29 -07:00
Kiarash Jamali
684196b8c5
Allow rotary embeddings for Bert (#363) 2023-07-23 00:21:45 -07:00
Ian Timmis
cbf982afa5
README syntax highlighting (#365)
* README syntax highlighting

Adds syntax highlighting to README

* Update README.md
2023-07-23 00:21:30 -07:00
Tri Dao
425dbcb6c6 [MHA] Implement MQA/GQA 2023-07-23 00:06:58 -07:00
Tri Dao
ec9f74ab9a [Rotary] Don't store inv_freq in state_dict 2023-07-22 23:52:42 -07:00
Tri Dao
a157cc8c9b [FT] Implement MQA/GQA 2023-07-22 23:47:01 -07:00
Tri Dao
75e334d407 [MLP] Add ParallelMLP 2023-07-22 23:45:51 -07:00
Tri Dao
b3177dfaf6 [GPT] Enable FlashAttention for GPT-J 2023-07-21 17:29:10 -07:00
Tri Dao
6fc1e07da2 [Block] Re-enable DropPath 2023-07-21 16:39:23 -07:00
Tri Dao
9ee0ff1d9b Fix using dO stride for O, which can cause memory error in bwd 2023-07-20 17:39:57 -07:00
Tri Dao
2dd87d0609
Merge pull request #360 from chuanli11/fix/dockerfile
remove checkout v2.0.0.post1 from dockerfile
2023-07-20 19:41:24 -04:00
chuanli11
30fd8c17d8 remove checkout v2.0.0.post1 from dockerfile 2023-07-20 16:40:15 +00:00
Tri Dao
b8020d73c9
Merge pull request #348 from eltociear/patch-2
[LayerNorm] Fix typo in ln_api.cpp
2023-07-19 17:25:37 -04:00
Ikko Eltociear Ashimine
dfc60f6b7d
[LayerNorm] Fix typo in ln_api.cpp
unintialized -> uninitialized
2023-07-20 01:16:16 +09:00
Tri Dao
31ae2488e6
Merge pull request #343 from danthe3rd/if_constexpr
Fix compile error with `BOOL_SWITCH`
2023-07-19 04:27:07 -04:00
danthe3rd
538d570c96 Fix compile error on MSVC
See also: https://stackoverflow.com/questions/55136414/constexpr-variable-captured-inside-lambda-loses-its-constexpr-ness
2023-07-19 08:04:57 +00:00
Tri Dao
d1a3b52f17 Add instruction about limiting number of ninja jobs 2023-07-17 23:17:47 -07:00