Why FlashAttention Is a Kernel Story

Standard scaled dot-product attention on a sequence of length N requires materialising an N x N matrix: the full attention score grid. On an A100 GPU with 80 GB of HBM running at roughly 2 TB/s, writing and re-reading that matrix for a 2048-token sequence with 64 heads costs tens of milliseconds of pure memory time, even though the arithmetic is trivial. FlashAttention (Dao et al., 2022) reports a 3x wall-clock speedup on GPT-2 style training without changing a single attention output value. The source of that speedup is not new maths. It is a kernel rewrite.

The real bottleneck: memory bandwidth, not FLOPs

Modern GPU compute has outrun its memory system. An A100 delivers around 312 TFLOPS of FP16 throughput but only ~2 TB/s of HBM bandwidth. A simple back-of-envelope: reading a 1 GB tensor takes ~0.5 ms; executing 10^12 FLOPs on it takes ~3 ms. For workloads that are compute-bound, bandwidth is irrelevant. For workloads where the ratio of arithmetic operations to bytes moved is low (called arithmetic intensity), bandwidth is everything.

Attention has terrible arithmetic intensity in its naive form. The forward pass for one head reads Q, K, V (all O(Nd)), writes S = QK^T (O(N^2)), reads S again for softmax, writes P (O(N^2)), reads P and V, writes O (O(Nd)). The N^2 matrices dominate for long sequences. Each one makes a full round-trip through HBM, even though the computation each byte participates in is a handful of multiplications and adds.

Standard attention IO (per head, sequence length N, head dim d):
  Read:   Q, K, V           -> 3Nd   floats from HBM
  Write:  S = QK^T          -> N^2   floats to   HBM
  Read:   S  (for softmax)  -> N^2   floats from HBM
  Write:  P = softmax(S)    -> N^2   floats to   HBM
  Read:   P, V              -> N^2 + Nd floats from HBM
  Write:  O = PV            -> Nd    floats to   HBM
  Total reads + writes: ~4N^2 + 6Nd floats

For N=2048, d=64, that is roughly 128 M floats (256 MB in FP16) flowing through HBM per head. With 64 heads and the whole model, this adds up fast.

Tiling: fitting the work inside SRAM

The key insight is that softmax and the output accumulation can be computed in tiles without ever materialising the full N x N score matrix in HBM, as long as a numerically stable online softmax update is used.

The SRAM of a single A100 streaming multiprocessor (SM) is around 192 KB. A tile of Q with block size Br and a tile of K with block size Bc together occupy 2 * Br * Bc * d floats. For Br = Bc = 64 and d = 64, that is 2 * 64 * 64 * 64 * 2 bytes = 1 MB, still too large. In practice FlashAttention uses Br, Bc around 32-64 and fits each tile inside the shared memory budget.

The real bottleneck: memory bandwidth, not FLOPs

Tiling: fitting the work inside SRAM

Keep reading with Pro.