4.4 KiB
FlashAttention adoption
We've been very happy to see FlashAttention being adopted by many organizations and research labs to speed up their training / inference (within 6 months after FlashAttention's release, at the time of writing). This page contains a partial list of places where FlashAttention is being used. If you'd like to add links to your organization / product / codebase, please open a PR or email us. We'd very much like to hear from you!
Integrated into machine learning frameworks
-
Pytorch: integrated into core Pytorch in nn.Transformer.
-
Huggingface's transformers library. On-going, blogpost coming soon.
-
MosaicML Composer library. Composer is a library for efficient neural network training.
MLPerf benchmarks
MLPerf is a competitive machine learning performance benchmark. FlashAttention yields the fastest BERT training on cloud instances in MLPerf training 2.0 (June 2022) and MLPerf training 2.1 (November 2022).
-
MLPerf 2.0: IEEE Spectrum article about our submission to the MLPerf 2.0 benchmark using FlashAttention.
-
MLPerf 2.1 - collaboration between Azure and Hazy Research: for the first time, we can train MLPerf BERT in under 2 minutes on 16 nodes.
-
MLPerf 2.1 - Nvidia: Nvidia uses techniques from FlashAttention to make their (already extremely optimized) BERT implementation go even faster.
-
MLPerf 2.1 - MosaicML: FlashAttention helps train BERT 2.7x faster in the open division.
Language model training & inference
-
Meta's AITemplate uses FlashAttention as part of their approach to speed up Transformer inference (up to 5.3x on BERT).
-
Kernl is a library for fast Transformer inference. They use FlashAttention as part of their approach to speed up Transformers by up to 12x.
Diffusion model training and inference
-
Huggingface's diffusers library for diffusion models. FlashAttention is integrated into diffusers v0.7.0. Up to 2x faster inference and lower memory usage.
-
Colossal-AI's implementation of Stable Diffusion: with FlashAttention as one of its components, it speeds up pretraining by up to 6.5x, and reduces the hardware cost of fine-tuning by 7x.
-
Stable Diffusion inference from Labml.ai: 50% speedup.
-
Our own Stable Diffusion fork uses FlashAttention to get 3-4x speedup compared to the original version.
Other models
- Uni-Fold: Uni-Fold is an open-source platform for developing protein models beyond AlphaFold. With FlashAttention, Uni-Fold is 2.6x faster than AlphaFold.
Different implementations
-
Triton: an implementation of FlashAttention in Triton by Phil Tillet from OpenAI. Triton is a Python-based language and compiler for parallel programming.
-
xformers: The xformers team has implemented memory-efficient attention in a similar spirit to FlashAttention.
-
Jax: an implementation in Jax by lucidrains.