Compute-Bound vs Memory-Bound Kernels

A 100-TFLOP/s GPU can spend 90% of its time waiting for data. Add more matrix multiply units and nothing changes. The hardware is already idle. This is not a GPU problem; it is a kernel classification problem, and getting it wrong sends engineers chasing the wrong bottleneck.

Every kernel falls on a spectrum between two extremes: compute-bound (the ALUs are saturated, memory can keep up) and memory-bound (the ALUs are idle, waiting for bytes to arrive). The distinction is not academic. It dictates which optimisation strategies are even physically capable of helping.

Arithmetic Intensity and the Roofline Model

The key quantity is arithmetic intensity (AI), the ratio of floating-point operations to bytes of memory traffic:

AI = FLOPs executed / bytes read+written from DRAM
     (units: FLOP/byte)

GPUs have two hard ceilings. Let peak compute be P (FLOP/s) and peak memory bandwidth be B (byte/s). The maximum throughput a kernel can achieve is:

Attainable performance = min(P, AI × B)

This is the Roofline model. Plotting attainable performance against AI gives a shape with two regions:

Left of the ridge point (AI < P/B): the kernel is memory-bound. Doubling compute units does nothing; you need more bandwidth or less data movement.
Right of the ridge point (AI > P/B): the kernel is compute-bound. Reducing memory pressure yields little; you need faster ALUs or more parallelism.

On an NVIDIA A100 SXM, peak FP16 tensor-core throughput is roughly 312 TFLOP/s and HBM2e bandwidth is roughly 2 TB/s, giving a ridge point near 156 FLOP/byte. An operation must reuse each loaded byte ~156 times before it becomes compute-limited.

Operation	Typical AI (FP16)	Regime
Layer normalisation	~1-3	Memory-bound
Elementwise activation (ReLU, GELU)	~0.5-1	Memory-bound
Attention softmax (naive)	~2-5	Memory-bound
Large matrix multiply (M,N,K all 4096+)	200-1000+	Compute-bound
FlashAttention (tiled)	40-80	Transitional

Why Matrix Multiplication Is Compute-Bound

A GEMM computing C = A × B with dimensions (M, K) × (K, N) does 2MKN FLOPs while reading MK + KN + MN elements. For large square matrices of size N:

AI ≈ 2N³ / 3N² = (2/3)N

AI grows linearly with N. At N = 512 in FP16 (2 bytes each), AI ≈ 341 FLOP/byte, well above the A100 ridge point. The tensor cores stay busy; memory is not the constraint. This is why increasing matrix size almost always improves GPU utilisation for GEMMs.

Contrast this with an elementwise addition C = A + B: two reads, one write, two FLOPs per element. AI ≈ 2/12 ≈ 0.17 FLOP/byte on FP32. The kernel is memory-bound by a factor of ~900 on the A100. More FP32 units would be wasted.

Arithmetic Intensity and the Roofline Model

Why Matrix Multiplication Is Compute-Bound

Keep reading with Pro.