Applied LLMs
Roofline-Guided Kernel Optimisation
The roofline model maps a kernel's arithmetic intensity against hardware ceilings to diagnose whether compute or memory bandwidth is the binding constraint, and directs every subsequent optimisation decision.
advanced · 8 min read · Premium
A100 SXM4 delivers 312 TFLOP/s of FP16 tensor-core throughput and 2 TB/s of HBM2e bandwidth. The ratio is 156 FLOP/byte. Any kernel that performs fewer than 156 FLOPs per byte of DRAM traffic will exhaust bandwidth before it exhausts compute, no matter how clever the thread scheduling. That single arithmetic fact is the seed of the roofline model, and it makes most ad-hoc optimisation intuition unnecessary.
The Roofline Model, Precisely
Plot arithmetic intensity (AI) on the x-axis: AI = total floating-point operations / total bytes transferred to/from DRAM, measured in FLOP/byte. Plot achieved performance on the y-axis in FLOP/s. Two ceilings dominate:
Keep reading with Pro.
You're reading the preview. Unlock the full concept plus the library, study plans, the AI mentor, and daily emails.