← Concept library

Applied LLMs

HBM Bandwidth and Capacity

High Bandwidth Memory sets a hard ceiling on how fast a GPU can feed its compute units, and most LLM operations live squarely against that ceiling.

intermediate · 8 min read

An NVIDIA A100 can execute roughly 312 teraFLOPS of dense FP16 arithmetic per second, yet for many practical LLM workloads it delivers a small fraction of that figure. The reason is not broken hardware; it is that the chip can only pull data from its attached DRAM at ~2 TB/s, and reading a weight once costs far more time than multiplying it once. The compute units finish their work and then sit idle waiting for the next tile of data to arrive. This is the memory wall, and High Bandwidth Memory (HBM) is the industry's current answer to it.

What HBM actually is

Conventional GDDR memory sits beside the GPU die on a PCB, connected via a wide but still modest bus. HBM stacks multiple DRAM dies vertically in a single package, then places that stack on an interposer next to the GPU die. The connection is through thousands of tiny through-silicon vias rather than a PCB trace, giving a bus width of 1024 bits or more per stack.

The result is dramatically higher bandwidth at lower power per bit than GDDR. Representative figures across recent generations:

GPU HBM gen Capacity Bandwidth
V100 (2017) HBM2 16/32 GB ~900 GB/s
A100 (2020) HBM2e 40/80 GB ~2000 GB/s
H100 SXM (2022) HBM3 80 GB ~3350 GB/s
H200 (2024) HBM3e 141 GB ~4800 GB/s

Capacity matters independently from bandwidth: a model that does not fit in HBM requires either quantisation or expensive off-chip offloading, regardless of how fast the memory bus is.

The roofline model: why most LLM ops are memory-bound

The roofline model frames performance with a single ratio called arithmetic intensity (AI):

Arithmetic Intensity = FLOPs performed / bytes read from DRAM
                     (units: FLOP / byte)

Each GPU has a hardware ridge point: the ratio of peak FLOP/s to peak bytes/s. For an A100 SXM4 in FP16:

Ridge point = 312 TFLOP/s / 2.0 TB/s = 156 FLOP/byte

Any kernel with arithmetic intensity below 156 is memory-bound: even if it executes perfectly, DRAM bandwidth will be the bottleneck. Any kernel above 156 is compute-bound.

Where do LLM operations land?

  • Autoregressive decoding (batch=1). Each forward pass loads all weights once and does one multiply-accumulate per weight. For a linear layer of shape [d_model, d_ffn], the FLOPs are 2 * d_model * d_ffn and the bytes are 2 * d_model * d_ffn (FP16). Arithmetic intensity is approximately 1 FLOP/byte - more than 100x below the ridge point. The GPU is almost entirely waiting for memory.
  • Large-batch prefill. With batch size B, the same linear layer does 2 * B * d_model * d_ffn FLOPs against the same weight bytes (weights are loaded once). AI scales linearly with B. At B=156+ the operation crosses the ridge point and becomes compute-bound.
  • Attention with long sequences. The QK^T and softmax(...)V products operate on the KV cache, which scales with sequence length. On long sequences these become bandwidth-intensive again, motivating techniques like FlashAttention that restructure the computation to stay in on-chip SRAM.

The practical consequence: inference throughput at batch=1 is almost entirely gated by HBM bandwidth, not FLOP count. Doubling the FLOP budget of an H100 without changing HBM bandwidth would not make single-sequence decoding faster.

Capacity constraints in practice

A 70B parameter model in BF16 occupies 140 GB. The H100 SXM has 80 GB of HBM. The model does not fit on one GPU and must be split across devices with tensor or pipeline parallelism, introducing inter-GPU communication overhead (NVLink or InfiniBand). Reducing the model to INT8 brings it to 70 GB, barely fitting on one H100 and eliminating that communication round-trip entirely.

KV cache competes with weights for the same HBM pool. During a long inference session with batch size B and sequence length S, the KV cache for a transformer with L layers and H attention heads grows to:

KV cache bytes = 2 * L * H * d_head * S * B * sizeof(dtype)

For Llama-3 70B (L=80, H=8 GQA groups, d_head=128) in FP16 at S=32k, B=1: roughly 2.6 GB. At B=32, S=8k: roughly 5.2 GB. These numbers compound quickly and often limit the achievable batch size more than compute does.

Why FlashAttention changed the arithmetic

Standard attention materialises the full S x S attention matrix in HBM. For S=8192, that is 8192^2 * 2 bytes = 128 MB written and read back per layer, per token. FlashAttention (Dao et al., 2022) tiles the computation into blocks that fit in the 192 KB of on-chip SRAM per SM, fusing the softmax and the value-weighted sum into a single kernel pass. The attention scores never hit HBM; only Q, K, V, and the output are read/written once. This dramatically reduces the effective memory bandwidth demand for the attention operation and makes long-context training and inference practical on existing hardware.

When it falls down

Quantisation accuracy cliffs. Reducing weights from FP16 to INT4 cuts HBM bandwidth demand by 4x and increases effective arithmetic intensity proportionally, but per-channel quantisation error accumulates, and outlier activations in large models (a well-documented phenomenon in LLMs) cause significant quality degradation below INT8 for some tasks without careful calibration.

Bandwidth saturation at large batch. Once arithmetic intensity crosses the ridge point, additional HBM bandwidth improvements yield diminishing returns. A future GPU with 8 TB/s HBM would not speed up a batch-512 training run that is already compute-bound; only higher FP16 throughput matters there.

NVLink vs. HBM balance. Multi-GPU tensor parallelism moves activations between GPUs at NVLink speeds (600 GB/s for NVLink 4.0 on H100). A parallelism degree large enough to keep each GPU's arithmetic intensity compute-bound can hit the NVLink bandwidth wall instead, shifting the bottleneck off-chip entirely.

Memory fragmentation in long-context serving. HBM is a flat address space managed by the CUDA allocator. Systems like vLLM introduce paged KV-cache management to avoid fragmentation when serving variable-length requests, because fragmented allocation in a 80 GB pool wastes a surprisingly large fraction of usable capacity.

HBM thermal limits. At sustained 3+ TB/s read rates, HBM generates significant heat within its stacked dies. Sustained workloads at full memory bandwidth can trigger thermal throttling, reducing effective bandwidth below the datasheet peak.

Further reading

Sign in to save and react.
Share Copied