flash-attention/usage.md

# FlashAttention adoption

We've been very happy to see FlashAttention being adopted by many organizations
and research labs to speed up their training / inference (within 6 months after
FlashAttention's release, at the time of writing).
This page contains a partial list of places where FlashAttention is being used.
If you'd like to add links to your organization / product / codebase, please open a
PR or email us. We'd very much like to hear from you!

## Integrated into machine learning frameworks

- Pytorch: [integrated](https://github.com/pytorch/pytorch/pull/81434) into core Pytorch in nn.Transformer.

- Huggingface's [transformers](https://github.com/huggingface/transformers) library.
  [On-going](https://github.com/huggingface/transformers/pull/18439), blogpost
  coming soon.

- Microsoft's [DeepSpeed](https://github.com/microsoft/DeepSpeed):
  FlashAttention is [integrated](https://github.com/microsoft/DeepSpeed/blob/ec13da6ba7cabc44bb4745a64a208b8580792954/deepspeed/ops/transformer/inference/triton_ops.py) into DeepSpeed's inference engine.

- Nvidia's [Megatron-LM](https://github.com/NVIDIA/Megatron-LM/pull/267). This
  library is a popular framework on training large transformer language models at scale.

- MosaicML [Composer](https://github.com/mosaicml/composer)
  [library](https://www.mosaicml.com/blog/gpt-3-quality-for-500k). Composer is a
  library for efficient neural network training.
  
- EleutherAI's [GPT-NeoX](https://github.com/EleutherAI/gpt-neox/pull/725). This is a research library for training large language transformer models at scale based on NVIDIA's Megatron-LM and Microsoft's DeepSpeed.

- PaddlePaddle: integrated into the framework with [API](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/nn/functional/flash_attention.py) `paddle.nn.functional.flash_attention`.

## MLPerf benchmarks

[MLPerf](https://mlcommons.org/en/) is a competitive machine learning performance benchmark. FlashAttention
yields the fastest BERT training on cloud instances in MLPerf training 2.0 (June
2022) and MLPerf training 2.1 (November 2022).

- MLPerf 2.0: [IEEE Spectrum](https://spectrum.ieee.org/mlperf-rankings-2022) and [Forbes](ttps://www.forbes.com/sites/moorinsights/2022/07/12/google-dethrones-nvidia-in-latest-artificial-intelligence-benchmarking-tests/) articles about our submission to the MLPerf 2.0 benchmark using FlashAttention.

- MLPerf 2.1 -
  collaboration
  between [Azure and Hazy Research](https://techcommunity.microsoft.com/t5/azure-high-performance-computing/azure-collaborates-with-hazy-research-and-nvidia-to-achieve/ba-p/3667511): for the first time, we can train MLPerf BERT
  in under 2 minutes on 16 nodes.

- MLPerf 2.1 -
  [Nvidia](https://developer.nvidia.com/blog/leading-mlperf-training-2-1-with-full-stack-optimizations-for-ai/):
  Nvidia uses techniques from FlashAttention to make their (already extremely optimized) BERT
  implementation go even faster.

- MLPerf 2.1 - [MosaicML](https://www.mosaicml.com/blog/mlperf-nlp-nov2022): FlashAttention
  helps train BERT 2.7x faster in the open division.

## Language model training & inference

- [PubMedGPT 2.7B](https://crfm.stanford.edu/2022/12/15/pubmedgpt.html), a
  domain-specific LLM for biomedicine, by Stanford CRFM, trained on
  [MosaicML](https://www.mosaicml.com/blog/introducing-pubmed-gpt) Cloud. Just
  using FlashAttention nearly halves the total training time.

- Meta's
  [AITemplate](https://ai.facebook.com/blog/gpu-inference-engine-nvidia-amd-open-source/)
  uses FlashAttention as part of their approach to speed up Transformer
  inference (up to 5.3x on BERT).

- Nvidia's [FasterTransformer](https://github.com/NVIDIA/FasterTransformer) is a
  state-of-the-art Transformer inference library. As of version
  [5.2](https://github.com/NVIDIA/FasterTransformer/commit/b672f49e256ba7a2d4fc9691d270b60b7fc1a2ff),
  FlashAttention is used as a component of FasterTransformer to speed up GPT inference.

- [Kernl](https://github.com/ELS-RD/kernl) is a library for fast Transformer
  inference. They use FlashAttention as part of their
  [approach](https://twitter.com/pommedeterre33/status/1585284221014245377) to
  speed up Transformers by up to 12x.

## Diffusion model training and inference

- Huggingface's [diffusers](https://github.com/huggingface/diffusers) library
  for diffusion models. FlashAttention is integrated into [diffusers
  v0.7.0](https://github.com/huggingface/diffusers/releases/tag/v0.7.0).
  Up to 2x faster inference and lower memory usage.

- Colossal-AI's
  [implementation](https://github.com/hpcaitech/ColossalAI/tree/main/examples/images/diffusion)
  of Stable Diffusion: with FlashAttention as one of its components, it speeds up
  pretraining by up to 6.5x, and reduces the hardware cost of fine-tuning by 7x.

- Meta's
  [AITemplate](https://ai.facebook.com/blog/gpu-inference-engine-nvidia-amd-open-source/)
  with FlashAttention one of the components, is currently the [fastest](https://twitter.com/bing_xu_/status/1590447334055632897) Stable
  Diffusion inference engine that we know of.

- Stable Diffusion inference from
  [Labml.ai](https://twitter.com/labmlai/status/1573634095732490240): 50% speedup.

- Our own Stable Diffusion [fork](https://twitter.com/realDanFu/status/1580641495991754752) uses FlashAttention to get 3-4x speedup compared
  to the original version.

## Other models

- [Uni-Fold](https://github.com/dptech-corp/Uni-Fold): Uni-Fold is an
  open-source platform for developing protein models beyond AlphaFold. With
  FlashAttention, Uni-Fold is 2.6x
  [faster](https://twitter.com/guolin_ke/status/1580532071901995008) than AlphaFold.

- [OpenFold](https://github.com/aqlaboratory/openfold): a trainable,
  memory-efficient, and GPU-friendly PyTorch reproduction of AlphaFold 2. With
  FlashAttention as one of its
  [components](https://twitter.com/gahdritz/status/1595420944880779266), it is
  up to 3x faster than AlphaFold2 to run inference on short sequences, and can
  predict 2x longer structures.

## Different implementations

- [Triton](https://github.com/openai/triton): an [implementation](https://github.com/openai/triton/blob/master/python/tutorials/06-fused-attention.py) of
  FlashAttention in Triton by Phil Tillet from OpenAI. Triton is a Python-based
  language and compiler for parallel programming.

- [xformers](https://github.com/facebookresearch/xformers): The xformers team
  has implemented [memory-efficient
  attention](https://twitter.com/fvsmassa/status/1580229170629849089) in a
  similar spirit to FlashAttention.
  xformers dynamically dispatches to whichever implementation is available / faster.

- [Jax](https://github.com/google/jax): an [implementation](https://github.com/lucidrains/flash-attention-jax)
  in Jax by [lucidrains](https://github.com/lucidrains/).

- [Metal](https://developer.apple.com/metal): an [implementation](https://github.com/philipturner/metal-flash-attention) in Metal by Philip Turner. This ports FlashAttention to mobile GPU architectures such as Apple silicon.
Add a page on where FlashAttention is being used 2022-11-14 08:40:18 +08:00			`# FlashAttention adoption`

			`We've been very happy to see FlashAttention being adopted by many organizations`
			`and research labs to speed up their training / inference (within 6 months after`
			`FlashAttention's release, at the time of writing).`
			`This page contains a partial list of places where FlashAttention is being used.`
			`If you'd like to add links to your organization / product / codebase, please open a`
			`PR or email us. We'd very much like to hear from you!`

			`## Integrated into machine learning frameworks`

			`- Pytorch: [integrated](https://github.com/pytorch/pytorch/pull/81434) into core Pytorch in nn.Transformer.`

			`- Huggingface's [transformers](https://github.com/huggingface/transformers) library.`
			`[On-going](https://github.com/huggingface/transformers/pull/18439), blogpost`
			`coming soon.`

Mention DeepSpeed inference in usage.md 2022-11-15 02:01:16 +08:00			`- Microsoft's [DeepSpeed](https://github.com/microsoft/DeepSpeed):`
			`FlashAttention is [integrated](https://github.com/microsoft/DeepSpeed/blob/ec13da6ba7cabc44bb4745a64a208b8580792954/deepspeed/ops/transformer/inference/triton_ops.py) into DeepSpeed's inference engine.`

[Docs] Mention Megatron-LM 2022-12-16 11:49:04 +08:00			`- Nvidia's [Megatron-LM](https://github.com/NVIDIA/Megatron-LM/pull/267). This`
			`library is a popular framework on training large transformer language models at scale.`

Add a page on where FlashAttention is being used 2022-11-14 08:40:18 +08:00			`- MosaicML [Composer](https://github.com/mosaicml/composer)`
			`[library](https://www.mosaicml.com/blog/gpt-3-quality-for-500k). Composer is a`
			`library for efficient neural network training.`
Add gpt-neox adoption 2022-12-29 07:33:58 +08:00
			`- EleutherAI's [GPT-NeoX](https://github.com/EleutherAI/gpt-neox/pull/725). This is a research library for training large language transformer models at scale based on NVIDIA's Megatron-LM and Microsoft's DeepSpeed.`
Add a page on where FlashAttention is being used 2022-11-14 08:40:18 +08:00
add paddlepaddle in usage 2023-04-04 14:15:51 +08:00			- PaddlePaddle: integrated into the framework with [API](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/nn/functional/flash_attention.py) `paddle.nn.functional.flash_attention`.

Add a page on where FlashAttention is being used 2022-11-14 08:40:18 +08:00			`## MLPerf benchmarks`

			`[MLPerf](https://mlcommons.org/en/) is a competitive machine learning performance benchmark. FlashAttention`
			`yields the fastest BERT training on cloud instances in MLPerf training 2.0 (June`
			`2022) and MLPerf training 2.1 (November 2022).`

[Docs] Link to Forbes article 2023-04-15 12:20:38 +08:00			`- MLPerf 2.0: [IEEE Spectrum](https://spectrum.ieee.org/mlperf-rankings-2022) and [Forbes](ttps://www.forbes.com/sites/moorinsights/2022/07/12/google-dethrones-nvidia-in-latest-artificial-intelligence-benchmarking-tests/) articles about our submission to the MLPerf 2.0 benchmark using FlashAttention.`
Add a page on where FlashAttention is being used 2022-11-14 08:40:18 +08:00
			`- MLPerf 2.1 -`
			`collaboration`
			`between [Azure and Hazy Research](https://techcommunity.microsoft.com/t5/azure-high-performance-computing/azure-collaborates-with-hazy-research-and-nvidia-to-achieve/ba-p/3667511): for the first time, we can train MLPerf BERT`
			`in under 2 minutes on 16 nodes.`

			`- MLPerf 2.1 -`
			`[Nvidia](https://developer.nvidia.com/blog/leading-mlperf-training-2-1-with-full-stack-optimizations-for-ai/):`
			`Nvidia uses techniques from FlashAttention to make their (already extremely optimized) BERT`
			`implementation go even faster.`

			`- MLPerf 2.1 - [MosaicML](https://www.mosaicml.com/blog/mlperf-nlp-nov2022): FlashAttention`
			`helps train BERT 2.7x faster in the open division.`

			`## Language model training & inference`

[Docs] Mention PubMedGPT 2022-12-16 11:44:59 +08:00			`- [PubMedGPT 2.7B](https://crfm.stanford.edu/2022/12/15/pubmedgpt.html), a`
			`domain-specific LLM for biomedicine, by Stanford CRFM, trained on`
			`[MosaicML](https://www.mosaicml.com/blog/introducing-pubmed-gpt) Cloud. Just`
			`using FlashAttention nearly halves the total training time.`

Add a page on where FlashAttention is being used 2022-11-14 08:40:18 +08:00			`- Meta's`
			`[AITemplate](https://ai.facebook.com/blog/gpu-inference-engine-nvidia-amd-open-source/)`
			`uses FlashAttention as part of their approach to speed up Transformer`
			`inference (up to 5.3x on BERT).`
Mention AITemplate Stable Diffusion in usage.md 2022-11-15 01:31:55 +08:00
[Docs] Mention FasterTransformer integration 2022-12-05 16:34:09 +08:00			`- Nvidia's [FasterTransformer](https://github.com/NVIDIA/FasterTransformer) is a`
			`state-of-the-art Transformer inference library. As of version`
			`[5.2](https://github.com/NVIDIA/FasterTransformer/commit/b672f49e256ba7a2d4fc9691d270b60b7fc1a2ff),`
			`FlashAttention is used as a component of FasterTransformer to speed up GPT inference.`

Add a page on where FlashAttention is being used 2022-11-14 08:40:18 +08:00			`- [Kernl](https://github.com/ELS-RD/kernl) is a library for fast Transformer`
			`inference. They use FlashAttention as part of their`
			`[approach](https://twitter.com/pommedeterre33/status/1585284221014245377) to`
			`speed up Transformers by up to 12x.`

			`## Diffusion model training and inference`

			`- Huggingface's [diffusers](https://github.com/huggingface/diffusers) library`
			`for diffusion models. FlashAttention is integrated into [diffusers`
			`v0.7.0](https://github.com/huggingface/diffusers/releases/tag/v0.7.0).`
			`Up to 2x faster inference and lower memory usage.`
Mention AITemplate Stable Diffusion in usage.md 2022-11-15 01:31:55 +08:00
Link to Colossal-AI's stable diffusion in usage.md 2022-11-14 12:49:05 +08:00			`- Colossal-AI's`
			`[implementation](https://github.com/hpcaitech/ColossalAI/tree/main/examples/images/diffusion)`
			`of Stable Diffusion: with FlashAttention as one of its components, it speeds up`
			`pretraining by up to 6.5x, and reduces the hardware cost of fine-tuning by 7x.`
Mention AITemplate Stable Diffusion in usage.md 2022-11-15 01:31:55 +08:00
			`- Meta's`
			`[AITemplate](https://ai.facebook.com/blog/gpu-inference-engine-nvidia-amd-open-source/)`
			`with FlashAttention one of the components, is currently the [fastest](https://twitter.com/bing_xu_/status/1590447334055632897) Stable`
			`Diffusion inference engine that we know of.`

Add a page on where FlashAttention is being used 2022-11-14 08:40:18 +08:00			`- Stable Diffusion inference from`
			`[Labml.ai](https://twitter.com/labmlai/status/1573634095732490240): 50% speedup.`
Mention AITemplate Stable Diffusion in usage.md 2022-11-15 01:31:55 +08:00
Add a page on where FlashAttention is being used 2022-11-14 08:40:18 +08:00			`- Our own Stable Diffusion [fork](https://twitter.com/realDanFu/status/1580641495991754752) uses FlashAttention to get 3-4x speedup compared`
			`to the original version.`
Mention AITemplate Stable Diffusion in usage.md 2022-11-15 01:31:55 +08:00
Add a page on where FlashAttention is being used 2022-11-14 08:40:18 +08:00			`## Other models`

			`- [Uni-Fold](https://github.com/dptech-corp/Uni-Fold): Uni-Fold is an`
			`open-source platform for developing protein models beyond AlphaFold. With`
			`FlashAttention, Uni-Fold is 2.6x`
			`[faster](https://twitter.com/guolin_ke/status/1580532071901995008) than AlphaFold.`

[Docs] Mention OpenFold 2022-11-24 05:01:19 +08:00			`- [OpenFold](https://github.com/aqlaboratory/openfold): a trainable,`
			`memory-efficient, and GPU-friendly PyTorch reproduction of AlphaFold 2. With`
			`FlashAttention as one of its`
			`[components](https://twitter.com/gahdritz/status/1595420944880779266), it is`
[Docs] Clarify OpenFold speedup 2022-11-26 02:49:17 +08:00			`up to 3x faster than AlphaFold2 to run inference on short sequences, and can`
			`predict 2x longer structures.`
[Docs] Mention OpenFold 2022-11-24 05:01:19 +08:00
Add a page on where FlashAttention is being used 2022-11-14 08:40:18 +08:00			`## Different implementations`

			`- [Triton](https://github.com/openai/triton): an [implementation](https://github.com/openai/triton/blob/master/python/tutorials/06-fused-attention.py) of`
			`FlashAttention in Triton by Phil Tillet from OpenAI. Triton is a Python-based`
			`language and compiler for parallel programming.`
Mention AITemplate Stable Diffusion in usage.md 2022-11-15 01:31:55 +08:00
Add a page on where FlashAttention is being used 2022-11-14 08:40:18 +08:00			`- [xformers](https://github.com/facebookresearch/xformers): The xformers team`
Mention AITemplate Stable Diffusion in usage.md 2022-11-15 01:31:55 +08:00			`has implemented [memory-efficient`
			`attention](https://twitter.com/fvsmassa/status/1580229170629849089) in a`
			`similar spirit to FlashAttention.`
			`xformers dynamically dispatches to whichever implementation is available / faster.`

Add a page on where FlashAttention is being used 2022-11-14 08:40:18 +08:00			`- [Jax](https://github.com/google/jax): an [implementation](https://github.com/lucidrains/flash-attention-jax)`
			`in Jax by [lucidrains](https://github.com/lucidrains/).`
Update usage.md 2023-07-15 13:34:24 +08:00
Update usage.md 2023-07-15 20:40:46 +08:00			`- [Metal](https://developer.apple.com/metal): an [implementation](https://github.com/philipturner/metal-flash-attention) in Metal by Philip Turner. This ports FlashAttention to mobile GPU architectures such as Apple silicon.`