Roofline-Guided Kernel Optimisation

A100 SXM4 delivers 312 TFLOP/s of FP16 tensor-core throughput and 2 TB/s of HBM2e bandwidth. The ratio is 156 FLOP/byte. Any kernel that performs fewer than 156 FLOPs per byte of DRAM traffic will exhaust bandwidth before it exhausts compute, no matter how clever the thread scheduling. That single arithmetic fact is the seed of the roofline model, and it makes most ad-hoc optimisation intuition unnecessary.

The Roofline Model, Precisely

Plot arithmetic intensity (AI) on the x-axis: AI = total floating-point operations / total bytes transferred to/from DRAM, measured in FLOP/byte. Plot achieved performance on the y-axis in FLOP/s. Two ceilings dominate:

Memory-bandwidth roof: performance = AI * BW_peak. Below the ridge point, performance scales linearly with AI.
Compute roof: performance = FLOP_peak. Above the ridge point, adding more bytes does not help; only reducing the operation count does.

The ridge point is at AI_ridge = FLOP_peak / BW_peak. For the A100 example above, AI_ridge = 156 FLOP/byte. A fused attention kernel (FlashAttention) achieves AI well above that ridge; a naive layer-normalisation kernel that reads and writes the same buffer twice sits far to the left.

The model has a second layer: ceilings for L2 bandwidth, shared-memory bandwidth, and instruction-level throughput create a cascade of roofs inside the memory-bandwidth roof. A tiled GEMM that fits its working set in L2 operates under the L2 roof, not the DRAM roof, which is why it looks "super-efficient" relative to the DRAM ceiling but still misses the compute roof.

Nsight Compute's roofline chart (section 2.9 of the profiling guide) renders these ceilings automatically and plots each kernel as a dot. A dot sitting far below the nearest relevant roof signals actionable headroom.

Reading the Kernel's Position and What to Do

Kernel position	Diagnosis	Remedy
Far left of ridge, near DRAM roof	Memory-bound; low AI	Fuse ops, tile into shared memory, eliminate redundant loads
Left of ridge, below DRAM roof	Memory-bound AND inefficient memory access	Fix coalescing, align accesses, remove strided loads
Right of ridge, below compute roof	Compute-bound; underutilised tensor cores	Use tensor-core intrinsics, pad tiles to warp-tile multiples, reduce branch divergence
Right of ridge, near compute roof	Close to optimal; further gain requires algorithmic change	Profile instruction mix; consider mixed precision or sparsity

The most common mistake is applying compute-side optimisations (loop unrolling, instruction-level parallelism) to a memory-bound kernel. Roofline makes that mistake visible before a single line of code is touched.

Raising Arithmetic Intensity: Fusion and Tiling

The principal tool for moving a kernel rightward on the roofline chart is operator fusion: combining two or more operations so that intermediate results live in registers or shared memory rather than round-tripping through DRAM.

The Roofline Model, Precisely

Reading the Kernel's Position and What to Do

Raising Arithmetic Intensity: Fusion and Tiling

Keep reading with Pro.