Applied LLMs
Compute-Bound vs Memory-Bound Kernels
A kernel's performance ceiling is determined by whether FLOPs or memory bandwidth runs out first, and misidentifying this wastes orders-of-magnitude optimisation effort.
intermediate · 8 min read · Premium
A 100-TFLOP/s GPU can spend 90% of its time waiting for data. Add more matrix multiply units and nothing changes. The hardware is already idle. This is not a GPU problem; it is a kernel classification problem, and getting it wrong sends engineers chasing the wrong bottleneck.
Every kernel falls on a spectrum between two extremes: compute-bound (the ALUs are saturated, memory can keep up) and memory-bound (the ALUs are idle, waiting for bytes to arrive). The distinction is not academic. It dictates which optimisation strategies are even physically capable of helping.
Arithmetic Intensity and the Roofline Model
Keep reading with Pro.
You're reading the preview. Unlock the full concept plus the library, study plans, the AI mentor, and daily emails.