FlashAttention adoption

We've been very happy to see FlashAttention being adopted by many organizations and research labs to speed up their training / inference (within 6 months after FlashAttention's release, at the time of writing). This page contains a partial list of places where FlashAttention is being used. If you'd like to add links to your organization / product / codebase, please open a PR or email us. We'd very much like to hear from you!

Integrated into machine learning frameworks

Pytorch: integrated into core Pytorch in nn.Transformer.
Huggingface's transformers library. On-going, blogpost coming soon.
MosaicML Composer library. Composer is a library for efficient neural network training.

MLPerf benchmarks

MLPerf is a competitive machine learning performance benchmark. FlashAttention yields the fastest BERT training on cloud instances in MLPerf training 2.0 (June 2022) and MLPerf training 2.1 (November 2022).

MLPerf 2.0: IEEE Spectrum article about our submission to the MLPerf 2.0 benchmark using FlashAttention.
MLPerf 2.1 - collaboration between Azure and Hazy Research: for the first time, we can train MLPerf BERT in under 2 minutes on 16 nodes.
MLPerf 2.1 - Nvidia: Nvidia uses techniques from FlashAttention to make their (already extremely optimized) BERT implementation go even faster.
MLPerf 2.1 - MosaicML: FlashAttention helps train BERT 2.7x faster in the open division.

Language model training & inference

Meta's AITemplate uses FlashAttention as part of their approach to speed up Transformer inference (up to 5.3x on BERT).
Kernl is a library for fast Transformer inference. They use FlashAttention as part of their approach to speed up Transformers by up to 12x.

Diffusion model training and inference

Huggingface's diffusers library for diffusion models. FlashAttention is integrated into diffusers v0.7.0. Up to 2x faster inference and lower memory usage.
Colossal-AI's implementation of Stable Diffusion: with FlashAttention as one of its components, it speeds up pretraining by up to 6.5x, and reduces the hardware cost of fine-tuning by 7x.
Stable Diffusion inference from Labml.ai: 50% speedup.
Our own Stable Diffusion fork uses FlashAttention to get 3-4x speedup compared to the original version.

Other models

Uni-Fold: Uni-Fold is an open-source platform for developing protein models beyond AlphaFold. With FlashAttention, Uni-Fold is 2.6x faster than AlphaFold.

Different implementations

Triton: an implementation of FlashAttention in Triton by Phil Tillet from OpenAI. Triton is a Python-based language and compiler for parallel programming.
xformers: The xformers team has implemented memory-efficient attention in a similar spirit to FlashAttention.
Jax: an implementation in Jax by lucidrains.

4.4 KiB Raw Blame History