Applied LLMs
HBM Bandwidth and Capacity
High Bandwidth Memory sets a hard ceiling on how fast a GPU can feed its compute units, and most LLM operations live squarely against that ceiling.
intermediate · 8 min read
An NVIDIA A100 can execute roughly 312 teraFLOPS of dense FP16 arithmetic per second, yet for many practical LLM workloads it delivers a small fraction of that figure. The reason is not broken hardware; it is that the chip can only pull data from its attached DRAM at ~2 TB/s, and reading a weight once costs far more time than multiplying it once. The compute units finish their work and then sit idle waiting for the next tile of data to arrive. This is the memory wall, and High Bandwidth Memory (HBM) is the industry's current answer to it.
What HBM actually is
Conventional GDDR memory sits beside the GPU die on a PCB, connected via a wide but still modest bus. HBM stacks multiple DRAM dies vertically in a single package, then places that stack on an interposer next to the GPU die. The connection is through thousands of tiny through-silicon vias rather than a PCB trace, giving a bus width of 1024 bits or more per stack.
The result is dramatically higher bandwidth at lower power per bit than GDDR. Representative figures across recent generations:
| GPU | HBM gen | Capacity | Bandwidth |
|---|---|---|---|
| V100 (2017) | HBM2 | 16/32 GB | ~900 GB/s |
| A100 (2020) | HBM2e | 40/80 GB | ~2000 GB/s |
| H100 SXM (2022) | HBM3 | 80 GB | ~3350 GB/s |
| H200 (2024) | HBM3e | 141 GB | ~4800 GB/s |
Capacity matters independently from bandwidth: a model that does not fit in HBM requires either quantisation or expensive off-chip offloading, regardless of how fast the memory bus is.
The roofline model: why most LLM ops are memory-bound
The roofline model frames performance with a single ratio called arithmetic intensity (AI):
Arithmetic Intensity = FLOPs performed / bytes read from DRAM
(units: FLOP / byte)
Each GPU has a hardware ridge point: the ratio of peak FLOP/s to peak bytes/s. For an A100 SXM4 in FP16:
Ridge point = 312 TFLOP/s / 2.0 TB/s = 156 FLOP/byte
Any kernel with arithmetic intensity below 156 is memory-bound: even if it executes perfectly, DRAM bandwidth will be the bottleneck. Any kernel above 156 is compute-bound.
Where do LLM operations land?
- Autoregressive decoding (batch=1). Each forward pass loads all weights once and does one multiply-accumulate per weight. For a linear layer of shape
[d_model, d_ffn], the FLOPs are2 * d_model * d_ffnand the bytes are2 * d_model * d_ffn(FP16). Arithmetic intensity is approximately 1 FLOP/byte - more than 100x below the ridge point. The GPU is almost entirely waiting for memory. - Large-batch prefill. With batch size B, the same linear layer does
2 * B * d_model * d_ffnFLOPs against the same weight bytes (weights are loaded once). AI scales linearly with B. At B=156+ the operation crosses the ridge point and becomes compute-bound. - Attention with long sequences. The QK^T and softmax(...)V products operate on the KV cache, which scales with sequence length. On long sequences these become bandwidth-intensive again, motivating techniques like FlashAttention that restructure the computation to stay in on-chip SRAM.
The practical consequence: inference throughput at batch=1 is almost entirely gated by HBM bandwidth, not FLOP count. Doubling the FLOP budget of an H100 without changing HBM bandwidth would not make single-sequence decoding faster.
Capacity constraints in practice
A 70B parameter model in BF16 occupies 140 GB. The H100 SXM has 80 GB of HBM. The model does not fit on one GPU and must be split across devices with tensor or pipeline parallelism, introducing inter-GPU communication overhead (NVLink or InfiniBand). Reducing the model to INT8 brings it to 70 GB, barely fitting on one H100 and eliminating that communication round-trip entirely.
KV cache competes with weights for the same HBM pool. During a long inference session with batch size B and sequence length S, the KV cache for a transformer with L layers and H attention heads grows to:
KV cache bytes = 2 * L * H * d_head * S * B * sizeof(dtype)
For Llama-3 70B (L=80, H=8 GQA groups, d_head=128) in FP16 at S=32k, B=1: roughly 2.6 GB. At B=32, S=8k: roughly 5.2 GB. These numbers compound quickly and often limit the achievable batch size more than compute does.
Why FlashAttention changed the arithmetic
Standard attention materialises the full S x S attention matrix in HBM. For S=8192, that is 8192^2 * 2 bytes = 128 MB written and read back per layer, per token. FlashAttention (Dao et al., 2022) tiles the computation into blocks that fit in the 192 KB of on-chip SRAM per SM, fusing the softmax and the value-weighted sum into a single kernel pass. The attention scores never hit HBM; only Q, K, V, and the output are read/written once. This dramatically reduces the effective memory bandwidth demand for the attention operation and makes long-context training and inference practical on existing hardware.
When it falls down
Quantisation accuracy cliffs. Reducing weights from FP16 to INT4 cuts HBM bandwidth demand by 4x and increases effective arithmetic intensity proportionally, but per-channel quantisation error accumulates, and outlier activations in large models (a well-documented phenomenon in LLMs) cause significant quality degradation below INT8 for some tasks without careful calibration.
Bandwidth saturation at large batch. Once arithmetic intensity crosses the ridge point, additional HBM bandwidth improvements yield diminishing returns. A future GPU with 8 TB/s HBM would not speed up a batch-512 training run that is already compute-bound; only higher FP16 throughput matters there.
NVLink vs. HBM balance. Multi-GPU tensor parallelism moves activations between GPUs at NVLink speeds (600 GB/s for NVLink 4.0 on H100). A parallelism degree large enough to keep each GPU's arithmetic intensity compute-bound can hit the NVLink bandwidth wall instead, shifting the bottleneck off-chip entirely.
Memory fragmentation in long-context serving. HBM is a flat address space managed by the CUDA allocator. Systems like vLLM introduce paged KV-cache management to avoid fragmentation when serving variable-length requests, because fragmented allocation in a 80 GB pool wastes a surprisingly large fraction of usable capacity.
HBM thermal limits. At sustained 3+ TB/s read rates, HBM generates significant heat within its stacked dies. Sustained workloads at full memory bandwidth can trigger thermal throttling, reducing effective bandwidth below the datasheet peak.
Further reading
- NVIDIA Deep Learning Performance: GPU Background - official treatment of arithmetic intensity, ops:byte ratio, and the A100's memory specifications.
- NVIDIA Deep Learning Performance: Matrix Multiplication - roofline analysis worked through concrete GEMM shapes with tile and wave quantisation effects.
- FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness (Dao et al., 2022) - the canonical paper showing how IO-aware kernel design reduces HBM traffic for attention.