← Concept library

Applied LLMs

Why FlashAttention Is a Kernel Story

FlashAttention achieves its speedups not by reducing FLOPs but by restructuring the attention computation into a single tiled CUDA kernel that fits working data in on-chip SRAM, eliminating the dominant cost of round-tripping through GPU HBM.

advanced · 8 min read · Premium

Standard scaled dot-product attention on a sequence of length N requires materialising an N x N matrix: the full attention score grid. On an A100 GPU with 80 GB of HBM running at roughly 2 TB/s, writing and re-reading that matrix for a 2048-token sequence with 64 heads costs tens of milliseconds of pure memory time, even though the arithmetic is trivial. FlashAttention (Dao et al., 2022) reports a 3x wall-clock speedup on GPT-2 style training without changing a single attention output value. The source of that speedup is not new maths. It is a kernel rewrite.

The real bottleneck: memory bandwidth, not FLOPs

Modern GPU compute has outrun its memory system. An A100 delivers around 312 TFLOPS of FP16 throughput but only ~2 TB/s of HBM bandwidth. A simple back-of-envelope: reading a 1 GB tensor takes ~0.5 ms; executing 10^12 FLOPs on it takes ~3 ms. For workloads that are compute-bound, bandwidth is irrelevant. For workloads where the ratio of arithmetic operations to bytes moved is low (called arithmetic intensity), bandwidth is everything.

Keep reading with Pro.

You're reading the preview. Unlock the full concept plus the library, study plans, the AI mentor, and daily emails.

Sign in to save and react.
Share Copied