From 79160a69a944c22dc170327e9d7c5f0264c4123e Mon Sep 17 00:00:00 2001
From: Tri Dao <tridpq@gmail.com>
Date: Sun, 13 Nov 2022 16:40:18 -0800
Subject: [PATCH] Add a page on where FlashAttention is being used

---
 usage.md | 86 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 86 insertions(+)
 create mode 100644 usage.md

diff --git a/usage.md b/usage.md
new file mode 100644
index 0000000..f240195
--- /dev/null
+++ b/usage.md
@@ -0,0 +1,86 @@
+# FlashAttention adoption
+
+We've been very happy to see FlashAttention being adopted by many organizations
+and research labs to speed up their training / inference (within 6 months after
+FlashAttention's release, at the time of writing).
+This page contains a partial list of places where FlashAttention is being used.
+If you'd like to add links to your organization / product / codebase, please open a
+PR or email us. We'd very much like to hear from you!
+
+## Integrated into machine learning frameworks
+
+- Pytorch: [integrated](https://github.com/pytorch/pytorch/pull/81434) into core Pytorch in nn.Transformer.
+
+- Huggingface's [transformers](https://github.com/huggingface/transformers) library.
+  [On-going](https://github.com/huggingface/transformers/pull/18439), blogpost
+  coming soon.
+
+- MosaicML [Composer](https://github.com/mosaicml/composer)
+  [library](https://www.mosaicml.com/blog/gpt-3-quality-for-500k). Composer is a
+  library for efficient neural network training.
+
+## MLPerf benchmarks
+
+[MLPerf](https://mlcommons.org/en/) is a competitive machine learning performance benchmark. FlashAttention
+yields the fastest BERT training on cloud instances in MLPerf training 2.0 (June
+2022) and MLPerf training 2.1 (November 2022).
+
+- MLPerf 2.0: IEEE Spectrum [article](https://spectrum.ieee.org/mlperf-rankings-2022) about our submission to the MLPerf 2.0 benchmark using FlashAttention.
+
+- MLPerf 2.1 -
+  collaboration
+  between [Azure and Hazy Research](https://techcommunity.microsoft.com/t5/azure-high-performance-computing/azure-collaborates-with-hazy-research-and-nvidia-to-achieve/ba-p/3667511): for the first time, we can train MLPerf BERT
+  in under 2 minutes on 16 nodes.
+
+- MLPerf 2.1 -
+  [Nvidia](https://developer.nvidia.com/blog/leading-mlperf-training-2-1-with-full-stack-optimizations-for-ai/):
+  Nvidia uses techniques from FlashAttention to make their (already extremely optimized) BERT
+  implementation go even faster.
+
+- MLPerf 2.1 - [MosaicML](https://www.mosaicml.com/blog/mlperf-nlp-nov2022): FlashAttention
+  helps train BERT 2.7x faster in the open division.
+
+## Language model training & inference
+
+- Meta's
+  [AITemplate](https://ai.facebook.com/blog/gpu-inference-engine-nvidia-amd-open-source/)
+  uses FlashAttention as part of their approach to speed up Transformer
+  inference (up to 5.3x on BERT).
+  
+- [Kernl](https://github.com/ELS-RD/kernl) is a library for fast Transformer
+  inference. They use FlashAttention as part of their
+  [approach](https://twitter.com/pommedeterre33/status/1585284221014245377) to
+  speed up Transformers by up to 12x.
+
+## Diffusion model training and inference
+
+- Huggingface's [diffusers](https://github.com/huggingface/diffusers) library
+  for diffusion models. FlashAttention is integrated into [diffusers
+  v0.7.0](https://github.com/huggingface/diffusers/releases/tag/v0.7.0).
+  Up to 2x faster inference and lower memory usage.
+  
+- Stable Diffusion inference from
+  [Labml.ai](https://twitter.com/labmlai/status/1573634095732490240): 50% speedup.
+  
+- Our own Stable Diffusion [fork](https://twitter.com/realDanFu/status/1580641495991754752) uses FlashAttention to get 3-4x speedup compared
+  to the original version.
+  
+## Other models
+
+- [Uni-Fold](https://github.com/dptech-corp/Uni-Fold): Uni-Fold is an
+  open-source platform for developing protein models beyond AlphaFold. With
+  FlashAttention, Uni-Fold is 2.6x
+  [faster](https://twitter.com/guolin_ke/status/1580532071901995008) than AlphaFold.
+
+## Different implementations
+
+- [Triton](https://github.com/openai/triton): an [implementation](https://github.com/openai/triton/blob/master/python/tutorials/06-fused-attention.py) of
+  FlashAttention in Triton by Phil Tillet from OpenAI. Triton is a Python-based
+  language and compiler for parallel programming.
+  
+- [xformers](https://github.com/facebookresearch/xformers): The xformers team
+  has implemented [memory-efficient attention](https://twitter.com/fvsmassa/status/1580229170629849089) in a similar spirit to FlashAttention.
+  
+- [Jax](https://github.com/google/jax): an [implementation](https://github.com/lucidrains/flash-attention-jax)
+  in Jax by [lucidrains](https://github.com/lucidrains/).
+