FlashAttention

Standard attention is not slow because of FLOPs. It is slow because it materialises the n x n attention matrix in HBM, then reads it back to multiply by V. On an A100, HBM is roughly 1.5 TB/s and on-chip SRAM is ~19 TB/s; you spend most of the wall-clock waiting on HBM traffic, not on tensor cores. Tri Dao's FlashAttention (2022) reframes attention as an IO problem and solves it with classic numerical-analysis tools: tiling, online softmax, and recomputation in the backward pass.

The bandwidth bottleneck

A textbook attention forward pass does roughly:

S = Q K^T          # write n*n matrix to HBM
P = softmax(S)     # read S, write P
O = P V            # read P, write O

For n = 8192, fp16, that is 256 MiB of attention matrix written and read twice. The matmuls themselves are cheap relative to those HBM round-trips. The standard implementation is HBM-bandwidth-bound at long context lengths and burns wall-clock waiting on memory.

The fix: tiling + online softmax

FlashAttention loops over blocks of Q (outer) and K/V (inner), holding small tiles in SRAM and computing softmax incrementally using an online algorithm (Milakov & Gimelshein 2018) that keeps running max and sum-of-exp values. The full n x n matrix never exists. The forward pass writes only O and a tiny log-sum-exp vector.

for Qi in blocks(Q):
    mi, li = -inf, 0
    Oi = 0
    for Kj, Vj in blocks(K, V):
        Sij = Qi @ Kj.T / sqrt(d)
        mij = max(mi, rowmax(Sij))
        Pij = exp(Sij - mij)
        li  = exp(mi - mij) * li + rowsum(Pij)
        Oi  = exp(mi - mij) * Oi + Pij @ Vj
        mi  = mij
    write Oi, (mi, li)

For the backward, you do not store the full attention probabilities (the textbook approach) - you recompute them block by block from the saved (m, l) statistics. Recomputation is cheap because the FLOPs were never the bottleneck.

Why it is faster and uses less memory

The counterintuitive bit: FlashAttention does more FLOPs (because of the recomputation in backward) but is 2-4x faster end-to-end and uses O(n) memory instead of O(n^2). The win comes from doing those extra FLOPs on data that lives in SRAM, where bandwidth is an order of magnitude higher than HBM. You trade abundant compute for scarce bandwidth, which is exactly the trade modern GPUs reward.

The 1 -> 2 -> 3 progression

Version	Year	Headline win	What changed
FlashAttention	2022	2-4x over PyTorch SDPA	Tiling, online softmax, recomputation
FlashAttention-2	2023	2x over v1, 50-73% of peak FLOPs	Reordered loops, fewer non-matmul ops, better warp partitioning
FlashAttention-3	2024	1.5-2x over v2 on H100, up to 740 TFLOPS fp16	Async WGMMA, TMA-driven copies, FP8 path

The bandwidth bottleneck

The fix: tiling + online softmax

Why it is faster and uses less memory

The 1 -> 2 -> 3 progression

Keep reading with Pro.