← Concept library

Applied LLMs

The Roofline Model

The Roofline Model bounds attainable hardware performance using two ceilings - peak compute throughput and peak memory bandwidth - letting you diagnose whether a kernel wastes silicon or wasits time waiting for data.

intermediate · 8 min read

A single A100 GPU delivers 312 TFLOPS of FP16 throughput yet real transformer workloads routinely sustain only 30-60 TFLOPS. The gap is not a bug; it is a consequence of how much arithmetic each kernel does per byte it moves from memory. The Roofline Model makes that gap visible and actionable.

The two-ceiling intuition

Every kernel has an arithmetic intensity (AI): the number of floating-point operations it performs divided by the number of bytes it reads and writes from memory.

Arithmetic Intensity (AI) = FLOPs executed / Bytes transferred (DRAM)
                           [ops/byte]

The hardware offers two hard ceilings:

  1. Peak compute (Pi, in FLOPS/s): the maximum throughput of the execution units.
  2. Peak bandwidth (beta, in bytes/s): the maximum rate at which data can flow from DRAM to the chip.

The attainable performance of a kernel cannot exceed either ceiling simultaneously:

Attainable FLOPS/s = min( Pi,  beta * AI )

Plotting attainable performance against AI on a log-log graph produces the characteristic "roofline" shape: a diagonal ramp (bandwidth-bound region) that bends flat at the compute ceiling. The ridge point - the AI at which the ramp meets the flat - separates the two regimes.

              Peak compute (Pi)
              ─────────────────────────────────
             /
            /   slope = beta (bandwidth)
           /
──────────/
          ridge
          AI = Pi / beta

For an NVIDIA A100 SXM with FP16 Tensor Cores: Pi ~= 312 TFLOPS, HBM2e bandwidth ~= 2 TB/s, giving a ridge at 312e12 / 2e12 = 156 ops/byte. Any kernel below 156 ops/byte is bandwidth-bound regardless of how many CUDA cores are idle.

Placing common neural network kernels on the roof

Kernel Typical AI (ops/byte) Regime on A100
Element-wise activation (ReLU, GELU) 0.25 - 1 Deep bandwidth-bound
Layer normalisation 1 - 5 Bandwidth-bound
Attention softmax (small batch) 2 - 10 Bandwidth-bound
GEMM, batch=1 (inference) 5 - 30 Bandwidth-bound
GEMM, large batch (training) 100 - 500 Compute-bound
Fused Flash Attention (large seq) 50 - 150 Near ridge

The table explains why batching matters so much for GPU efficiency: scaling batch size raises AI for GEMMs, pushing the kernel from the diagonal ramp toward the flat roof where compute units are actually busy.

Using the model to guide optimisation

Step 1 - measure, do not guess. Profile with Nsight Compute. It reports both FLOPs executed and DRAM bytes transferred, so you can compute the empirical AI. Place the measured point on the roofline.

Step 2 - identify the ceiling. If the point is far left of the ridge, bandwidth is the bottleneck. Possible fixes:

  • Fuse consecutive memory-bound kernels (e.g., bias add + activation + dropout into one pass) so intermediate tensors never leave L2 or registers.
  • Use quantisation to shrink weight footprint (INT8 halves bytes moved; AI doubles if FLOPs are preserved).
  • Tile computations to exploit the L2 or shared-memory bandwidth ceiling, which is 5-20x higher than DRAM bandwidth.

If the point sits at or above the ridge and below Pi, you are compute-bound. Options:

  • Use lower-precision formats (FP16 or BF16 feeds Tensor Cores; FP32 does not on most hardware).
  • Increase occupancy to hide instruction latency.
  • Check for instruction bottlenecks (special functions like exp, sqrt have lower throughput than multiply-accumulate).

Step 3 - add the hierarchy. A single roofline using DRAM bandwidth tells only part of the story. Real hardware has L2 cache, shared memory (SMEM), and registers, each with a distinct bandwidth ceiling. If your data fits in L2 (40 MB on A100), the effective bandwidth available to the kernel is 5-10x higher, shifting the ridge point left and potentially flipping a "bandwidth-bound" diagnosis into "compute-bound". This is the hierarchical roofline extension: draw one roofline per memory level and place the kernel on whichever hierarchy level actually supplies its data.

Reading the model correctly

A short example. Suppose a layer-norm kernel on an A100 measures:

  • FLOPs: 50 GFLOPS
  • DRAM bytes: 20 GB
  • Elapsed time: 0.1 ms

Empirical FLOPS/s = 50e9 / 0.1e-3 = 500 GFLOPS/s
AI = 50e9 / 20e9 = 2.5 ops/byte

Bandwidth ceiling at AI=2.5: 2e12 bytes/s * 2.5 = 5 TFLOPS/s
The kernel is far to the left of the ridge (156 ops/byte); the relevant ceiling is 5 TFLOPS/s. Attaining 0.5 TFLOPS/s means the kernel reaches only 10% of even the bandwidth roof. The correct next question is: is the bandwidth itself being wasted? Perhaps uncoalesced memory access or a serialised reduction is leaving DRAM bandwidth on the table.

When it falls down

The model ignores latency. For very small tensors (single-token decoding with batch=1), neither bandwidth nor compute is the bottleneck; kernel launch overhead and memory-access latency dominate. The roofline predicts higher attainable performance than you can actually reach.

Tensor Core utilisation is not automatic. The compute ceiling for Tensor Cores applies only when the GEMM dimensions are multiples of 16 (or 8 for some formats) and when the data type is correct. An FP16 GEMM with M=7 will not use Tensor Cores efficiently; the effective compute ceiling is much lower, making a kernel appear compute-bound when in reality it is under-utilising the hardware.

Mixed-precision complicates byte counting. If weights are stored in INT4 but accumulated in FP32, the "bytes transferred" depends on which format you count. Inconsistent accounting produces misleading AI values; always be explicit about which buffers you include.

The model is per-kernel, not end-to-end. A pipeline of individually efficient kernels can still be slow because of synchronisation barriers, PCIe transfers between CPU and GPU, or pipeline bubbles in a multi-GPU setup. Roofline analysis diagnoses single kernels; system-level bottlenecks require a different lens.

Cache effects shift the roof. If your working set fits in L2, using the DRAM roofline is pessimistic; you will observe sustained performance above the DRAM roof, which looks like "exceeding the model" but is simply the wrong roofline being applied.

Further reading

Sign in to save and react.
Share Copied