The Memory Wall and Arithmetic Intensity

A 100-billion-parameter model running on an A100 GPU is doing, in aggregate, an astronomical number of floating-point multiplications per second. Yet token generation is often limited not by those multiplications but by how fast the GPU can shuttle numbers in from DRAM. The chip is fast; the wires feeding it are the bottleneck. This gap between compute throughput and memory bandwidth is called the memory wall, and understanding it precisely changes how you think about every optimisation decision from quantisation to batching to operator fusion.

What the Roofline Model Actually Says

The roofline model is a two-line performance ceiling drawn on a log-log plot of achieved performance (FLOP/s) against arithmetic intensity (FLOP/byte). Arithmetic intensity for a kernel is:

I = total floating-point operations
    ─────────────────────────────────
    total bytes read from + written to memory

Every chip has two hard limits:

Peak compute throughput P (FLOP/s), set by transistor count and clock speed.
Peak memory bandwidth B (bytes/s), set by the memory bus width and frequency.

The ridge point of the roofline is the intensity threshold I = P / B. Below I, the kernel is memory-bound: performance scales with bandwidth, not compute. Above I*, the kernel is compute-bound: performance scales with FLOP/s, not bandwidth.

For an NVIDIA V100, P ≈ 125 TFLOP/s (FP16 with Tensor Cores) and B ≈ 900 GB/s, giving I ≈ 139 FLOP/byte. On the A100, P ≈ 312 TFLOP/s and B ≈ 2,000 GB/s (HBM2e), giving I ≈ 156 FLOP/byte.

The practical implication: a kernel must perform roughly 140-160 multiply-accumulates per byte it touches just to keep the compute units busy. Most neural network operations do nowhere near that.

The Intensity of Real Operations

Here is where the gap becomes concrete. NVIDIA's deep-learning performance guide (see Further Reading) gives measured arithmetic intensities for common operations on a V100:

Operation	Arithmetic Intensity (FLOP/byte)	Regime
ReLU activation	~0.25	Deeply memory-bound
Layer normalisation	~5	Memory-bound
Softmax (seq-len 512)	~4	Memory-bound
GEMM, batch=1, hidden=4096	~11	Memory-bound
GEMM, batch=512, hidden=4096	~315	Compute-bound
Large square matmul (8192^3)	~2730	Strongly compute-bound

The batch=1 GEMM row is the one that matters for autoregressive inference. When generating tokens one at a time, the weight matrix (e.g., a 4096 x 4096 projection) is loaded from HBM for each forward pass, but only a single vector (the current hidden state) is multiplied against it. The ratio of arithmetic to bytes is tiny. The GPU is mostly waiting for data.

A useful back-of-envelope: for a decoder-only transformer with hidden size H, an attention projection of shape (H, H) holds H^2 parameters. In FP16 that is 2H^2 bytes. The GEMM for a single token does 2H^2 floating-point operations (one multiply and one accumulate per parameter). Intensity = 2H^2 / 2H^2 = 1 FLOP/byte, which is roughly 140x below the ridge point. The chip is therefore running at about 1/140th of its peak FLOP/s during single-token generation.

What the Roofline Model Actually Says

The Intensity of Real Operations

Keep reading with Pro.