← Concept library

Applied LLMs

The Memory Wall and Arithmetic Intensity

Arithmetic intensity determines whether a GPU kernel is memory-bound or compute-bound, and almost every LLM inference operation sits on the wrong side of that line.

intermediate · 8 min read · Premium

A 100-billion-parameter model running on an A100 GPU is doing, in aggregate, an astronomical number of floating-point multiplications per second. Yet token generation is often limited not by those multiplications but by how fast the GPU can shuttle numbers in from DRAM. The chip is fast; the wires feeding it are the bottleneck. This gap between compute throughput and memory bandwidth is called the memory wall, and understanding it precisely changes how you think about every optimisation decision from quantisation to batching to operator fusion.

What the Roofline Model Actually Says

The roofline model is a two-line performance ceiling drawn on a log-log plot of achieved performance (FLOP/s) against arithmetic intensity (FLOP/byte). Arithmetic intensity for a kernel is:

Keep reading with Pro.

You're reading the preview. Unlock the full concept plus the library, study plans, the AI mentor, and daily emails.

Sign in to save and react.
Share Copied