Applied LLMs
Reading an Accelerator Datasheet
A datasheet number means nothing without the four unit-aware ratios that reveal whether your workload will actually be compute-bound or memory-bound on that chip.
intermediate · 8 min read
An H100 SXM5 advertises 989 TFLOPS of FP16 compute and 3.35 TB/s of HBM3 memory bandwidth. A naive reading says the chip is fast. A correct reading asks: fast at what? Those two numbers, divided, reveal the machine's arithmetic intensity ridge point at roughly 295 FLOPS/byte. Every operation your model performs either lives above that ridge (compute-bound) or below it (memory-bound), and the datasheet tells you which regime you are in before you run a single kernel.
This concept walks through the four ratios that matter, what they expose about real workloads, and where the datasheet picture breaks down.
The Four Numbers That Actually Matter
Every accelerator datasheet, whether for an NVIDIA GPU, Google TPU, or AMD Instinct card, ultimately contains four quantities worth extracting:
| Quantity | Symbol | Example (H100 SXM5) |
|---|---|---|
| Peak compute (at your dtype) | F |
989 TFLOPS (FP16) |
| Memory bandwidth (HBM) | BW_mem |
3.35 TB/s |
| On-chip / shared-memory bandwidth | BW_sram |
~33 TB/s (est.) |
| NVLink / interconnect bandwidth | BW_link |
900 GB/s (bidirectional) |
From these you derive:
Ridge point (also called the operational intensity threshold):
I* = F / BW_mem [FLOPS/byte]
For H100: I* = 989e12 / 3.35e12 ≈ 295 FLOPS/byte
Any kernel whose arithmetic intensity I = FLOPS / bytes_accessed satisfies:
I > I*: compute-bound; bottleneck is the tensor cores, not DRAM.I < I*: memory-bound; the ALUs sit idle waiting for data.
A matrix-multiply of shape (M, N, K) has arithmetic intensity ≈ 2MNK / (2(MN + MK + NK)). For the large square GEMMs typical in transformer attention (M=N=K=4096), I ≈ 1365 FLOPS/byte, comfortably compute-bound. For an elementwise activation, I ≈ 1 FLOP/byte, deeply memory-bound. Same chip; completely different bottleneck.
Decoding the Dtype Table
Datasheets list peak TFLOPS for multiple formats: FP64, TF32, FP16, BF16, FP8, INT8. The numbers cascade in powers of two because each halved exponent width doubles throughput on tensor cores. The catch: these are dense tensor-core rates, which require aligned, power-of-two tile sizes. The sparse (2:4 sparsity) numbers, sometimes also listed, are exactly double the dense figures.
Key reads:
- FP32 (CUDA cores) - useful for scalar or non-tensor-core code, typically 60-80x slower than BF16 tensor-core peak on the same chip.
- TF32 - default float32 mode in PyTorch AMP on Ampere and later; 10-bit mantissa, so numerics differ from full FP32.
- FP8 E4M3 / E5M2 - Hopper and newer; halved memory footprint and doubled tensor throughput relative to FP16, but requires explicit scaling (no hardware subnormal guarantee in E4M3).
A workload mixed in BF16 weights but FP32 accumulation (common) sees the BF16 compute rate in the tensor cores and the FP32 rate only in the accumulator reductions. Reporting "I ran in BF16" conflates these.
Memory Hierarchy and Effective Bandwidth
HBM bandwidth is the headline number, but there are three other bandwidths the datasheet buries or omits:
- L2 cache bandwidth. H100 has 50 MB L2 at roughly 12 TB/s effective bandwidth. A kernel that reuses data within L2 is shielded from HBM entirely.
- Shared memory (SRAM) bandwidth. Each SM has 228 KB shared memory; aggregate bandwidth across 132 SMs exceeds 30 TB/s. Fused kernels (FlashAttention, fused LayerNorm) exploit this.
- NVLink / NVSwitch. For multi-GPU inference or pipeline parallelism, the 900 GB/s NVLink figure replaces PCIe (64 GB/s) as the relevant interconnect. All-reduce over 8 GPUs sees
~450 GB/seffective throughput (half ring, one direction).
The practical rule: if your operation's working set fits in L2, the HBM bandwidth is irrelevant. This is why kernel fusion matters so much; chaining multiple memory-bound ops into one kernel keeps activations in fast SRAM and only pays the HBM cost once.
A simple capacity check:
# Does the KV-cache for one attention head fit in L2?
seq_len = 8192
d_head = 128
dtype = 2 # bytes, BF16
kv_size = 2 * seq_len * d_head * dtype # K and V
# = 2 * 8192 * 128 * 2 = 4 MB
# H100 L2 = 50 MB => fits, so attention softmax can be memory-efficient
Interconnect Bandwidth and the Scaling Cliff
When a model is sharded across multiple accelerators, the interconnect bandwidth becomes a first-class constraint that the single-chip datasheet does not capture.
For tensor parallelism with TP=8 Llama-style all-reduce after each MLP, the communication volume per token per layer is 2 * hidden_dim * bytes_per_param. At hidden 8192, BF16, that is 32 KB per layer per token. With 80 layers and batch 32:
comm_volume = 80 * 32 * 32e3 = 81.9 GB
time_comm = 81.9e9 / 450e9 ≈ 0.18 s (NVLink)
time_comm = 81.9e9 / 32e9 ≈ 2.56 s (PCIe 4.0 x16)
NVLink makes TP-8 viable inside a node; PCIe makes it painful. The datasheet number you need is not always on the front page.
When It Falls Down
Sustained vs. burst bandwidth. The HBM bandwidth in the datasheet is the peak burst rate measured with STREAM-like sequential access. Real workloads with irregular access patterns (sparse attention, embedding lookups, KV-cache with variable sequence lengths) see 40-70% of headline bandwidth. Budget accordingly.
Thermal throttling. Sustained workloads on air-cooled or poorly configured liquid-cooled systems cause GPU Boost clocks to drop 10-20% below the base clock used in the datasheet TFLOPS calculation. NVIDIA's datasheet figure uses the maximum boost clock; actual sustained throughput is lower.
Tile quantisation. Tensor-core efficiency collapses for matrix dimensions not aligned to multiples of 16 (FP16) or 8 (FP8). A matrix of shape (4097, 4096) wastes one entire tile row. Real batch sizes and sequence lengths are rarely round numbers. The NVIDIA performance guide shows efficiency dropping to 70-85% with misaligned dimensions.
Memory capacity vs. bandwidth tradeoff in consumer chips. NVIDIA's consumer lineup (RTX 4090) ships with GDDR6X at 1 TB/s but only 24 GB, versus the H100's 80 GB HBM3 at 3.35 TB/s. The GDDR chip has lower bandwidth but comparable L2 size; long-context inference that spills to HBM pays a steeper penalty on GDDR because burst bandwidth is lower and memory is smaller.
Multi-die and chiplet designs. AMD MI300X is a multi-die package with 192 GB HBM3 across three compute dies. Die-to-die bandwidth is listed separately from HBM bandwidth. Naive TFLOPS comparison against a monolithic die ignores cross-die latency for operations that span the chiplet boundary.
Further Reading
- NVIDIA Deep Learning Performance: GPU Background - NVIDIA's own framework for compute vs. memory-bound analysis, with the A100 as the running example.
- NVIDIA Deep Learning Performance: Matrix Multiplication - Tile quantisation, tensor-core alignment rules, and arithmetic intensity for GEMMs.
- Google TPU System Architecture - MXU systolic array, ICI interconnect, and HBM layout across TPU generations.
- Benchmarking TPU, GPU, and CPU Platforms for Deep Learning (arXiv:1907.10701) - Empirical comparison of TPU v2/v3, V100, and Skylake revealing real bottlenecks behind headline specs.