diff --git a/README.md b/README.md
index 5ac1ab5..8bfc85c 100644
--- a/README.md
+++ b/README.md
@@ -71,6 +71,14 @@ Memory savings are proportional to sequence length -- since standard attention h
 We see 10X memory savings at sequence length 2K, and 20X at 4K.
 As a result, FlashAttention can scale to much longer sequence lengths.
 
+#### Head Dimension 128
+
+![FlashAttention speedup, head dimension 128](assets/flashattn_speedup_a100_d128.jpg)
+
+We show speedup with head dimension 128.
+Here we show batch size 16 with 12 heads.
+Speedup is less than with the smaller head sizes, but speedup is still significant -- especially with a causal mask.
+
 ### RTX 3090
 
 For the RTX 3090, we use batch size 12 with 12 attention heads.
diff --git a/assets/flashattn_speedup_a100_d128.jpg b/assets/flashattn_speedup_a100_d128.jpg
new file mode 100644
index 0000000..ac6a677
Binary files /dev/null and b/assets/flashattn_speedup_a100_d128.jpg differ