From 79160a69a944c22dc170327e9d7c5f0264c4123e Mon Sep 17 00:00:00 2001 From: Tri Dao Date: Sun, 13 Nov 2022 16:40:18 -0800 Subject: [PATCH] Add a page on where FlashAttention is being used --- usage.md | 86 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 86 insertions(+) create mode 100644 usage.md diff --git a/usage.md b/usage.md new file mode 100644 index 0000000..f240195 --- /dev/null +++ b/usage.md @@ -0,0 +1,86 @@ +# FlashAttention adoption + +We've been very happy to see FlashAttention being adopted by many organizations +and research labs to speed up their training / inference (within 6 months after +FlashAttention's release, at the time of writing). +This page contains a partial list of places where FlashAttention is being used. +If you'd like to add links to your organization / product / codebase, please open a +PR or email us. We'd very much like to hear from you! + +## Integrated into machine learning frameworks + +- Pytorch: [integrated](https://github.com/pytorch/pytorch/pull/81434) into core Pytorch in nn.Transformer. + +- Huggingface's [transformers](https://github.com/huggingface/transformers) library. + [On-going](https://github.com/huggingface/transformers/pull/18439), blogpost + coming soon. + +- MosaicML [Composer](https://github.com/mosaicml/composer) + [library](https://www.mosaicml.com/blog/gpt-3-quality-for-500k). Composer is a + library for efficient neural network training. + +## MLPerf benchmarks + +[MLPerf](https://mlcommons.org/en/) is a competitive machine learning performance benchmark. FlashAttention +yields the fastest BERT training on cloud instances in MLPerf training 2.0 (June +2022) and MLPerf training 2.1 (November 2022). + +- MLPerf 2.0: IEEE Spectrum [article](https://spectrum.ieee.org/mlperf-rankings-2022) about our submission to the MLPerf 2.0 benchmark using FlashAttention. + +- MLPerf 2.1 - + collaboration + between [Azure and Hazy Research](https://techcommunity.microsoft.com/t5/azure-high-performance-computing/azure-collaborates-with-hazy-research-and-nvidia-to-achieve/ba-p/3667511): for the first time, we can train MLPerf BERT + in under 2 minutes on 16 nodes. + +- MLPerf 2.1 - + [Nvidia](https://developer.nvidia.com/blog/leading-mlperf-training-2-1-with-full-stack-optimizations-for-ai/): + Nvidia uses techniques from FlashAttention to make their (already extremely optimized) BERT + implementation go even faster. + +- MLPerf 2.1 - [MosaicML](https://www.mosaicml.com/blog/mlperf-nlp-nov2022): FlashAttention + helps train BERT 2.7x faster in the open division. + +## Language model training & inference + +- Meta's + [AITemplate](https://ai.facebook.com/blog/gpu-inference-engine-nvidia-amd-open-source/) + uses FlashAttention as part of their approach to speed up Transformer + inference (up to 5.3x on BERT). + +- [Kernl](https://github.com/ELS-RD/kernl) is a library for fast Transformer + inference. They use FlashAttention as part of their + [approach](https://twitter.com/pommedeterre33/status/1585284221014245377) to + speed up Transformers by up to 12x. + +## Diffusion model training and inference + +- Huggingface's [diffusers](https://github.com/huggingface/diffusers) library + for diffusion models. FlashAttention is integrated into [diffusers + v0.7.0](https://github.com/huggingface/diffusers/releases/tag/v0.7.0). + Up to 2x faster inference and lower memory usage. + +- Stable Diffusion inference from + [Labml.ai](https://twitter.com/labmlai/status/1573634095732490240): 50% speedup. + +- Our own Stable Diffusion [fork](https://twitter.com/realDanFu/status/1580641495991754752) uses FlashAttention to get 3-4x speedup compared + to the original version. + +## Other models + +- [Uni-Fold](https://github.com/dptech-corp/Uni-Fold): Uni-Fold is an + open-source platform for developing protein models beyond AlphaFold. With + FlashAttention, Uni-Fold is 2.6x + [faster](https://twitter.com/guolin_ke/status/1580532071901995008) than AlphaFold. + +## Different implementations + +- [Triton](https://github.com/openai/triton): an [implementation](https://github.com/openai/triton/blob/master/python/tutorials/06-fused-attention.py) of + FlashAttention in Triton by Phil Tillet from OpenAI. Triton is a Python-based + language and compiler for parallel programming. + +- [xformers](https://github.com/facebookresearch/xformers): The xformers team + has implemented [memory-efficient attention](https://twitter.com/fvsmassa/status/1580229170629849089) in a similar spirit to FlashAttention. + +- [Jax](https://github.com/google/jax): an [implementation](https://github.com/lucidrains/flash-attention-jax) + in Jax by [lucidrains](https://github.com/lucidrains/). +