flash-attention/usage.md

4.4 KiB

FlashAttention adoption

We've been very happy to see FlashAttention being adopted by many organizations and research labs to speed up their training / inference (within 6 months after FlashAttention's release, at the time of writing). This page contains a partial list of places where FlashAttention is being used. If you'd like to add links to your organization / product / codebase, please open a PR or email us. We'd very much like to hear from you!

Integrated into machine learning frameworks

MLPerf benchmarks

MLPerf is a competitive machine learning performance benchmark. FlashAttention yields the fastest BERT training on cloud instances in MLPerf training 2.0 (June 2022) and MLPerf training 2.1 (November 2022).

  • MLPerf 2.0: IEEE Spectrum article about our submission to the MLPerf 2.0 benchmark using FlashAttention.

  • MLPerf 2.1 - collaboration between Azure and Hazy Research: for the first time, we can train MLPerf BERT in under 2 minutes on 16 nodes.

  • MLPerf 2.1 - Nvidia: Nvidia uses techniques from FlashAttention to make their (already extremely optimized) BERT implementation go even faster.

  • MLPerf 2.1 - MosaicML: FlashAttention helps train BERT 2.7x faster in the open division.

Language model training & inference

  • Meta's AITemplate uses FlashAttention as part of their approach to speed up Transformer inference (up to 5.3x on BERT).

  • Kernl is a library for fast Transformer inference. They use FlashAttention as part of their approach to speed up Transformers by up to 12x.

Diffusion model training and inference

  • Huggingface's diffusers library for diffusion models. FlashAttention is integrated into diffusers v0.7.0. Up to 2x faster inference and lower memory usage.

  • Colossal-AI's implementation of Stable Diffusion: with FlashAttention as one of its components, it speeds up pretraining by up to 6.5x, and reduces the hardware cost of fine-tuning by 7x.

  • Stable Diffusion inference from Labml.ai: 50% speedup.

  • Our own Stable Diffusion fork uses FlashAttention to get 3-4x speedup compared to the original version.

Other models

  • Uni-Fold: Uni-Fold is an open-source platform for developing protein models beyond AlphaFold. With FlashAttention, Uni-Fold is 2.6x faster than AlphaFold.

Different implementations