Applied LLMs
Floating-Point Formats for ML
ML accelerators expose a menu of floating-point formats that trade numerical range and precision for throughput and memory bandwidth; choosing the wrong one silently degrades accuracy or leaves peak FLOPS on the table.
intermediate · 8 min read
A100 GPUs deliver 312 TFLOPS in BF16 but only 19.5 TFLOPS in FP32. That is not a rounding error: it is a 16x gap driven entirely by format choice. Before you reach for torch.autocast, it helps to understand what you are actually trading away.
The anatomy of a floating-point number
Every IEEE 754 floating-point value is three fields packed into a fixed number of bits:
[sign | exponent | mantissa (significand)]
The value is: (-1)^sign * 2^(exponent - bias) * (1 + mantissa)
The exponent controls dynamic range (the ratio of the largest to smallest representable number). The mantissa controls precision (how finely you can distinguish two nearby values). Every bit you move from one field to the other buys range at the cost of precision, or vice versa.
The formats used in ML practice:
| Format | Sign | Exponent | Mantissa | Total bits | Dynamic range |
|---|---|---|---|---|---|
| FP32 | 1 | 8 | 23 | 32 | ~1.2e-38 to 3.4e+38 |
| FP16 | 1 | 5 | 10 | 16 | ~6e-8 to 65504 |
| BF16 | 1 | 8 | 7 | 16 | same as FP32 |
| FP8 E4M3 | 1 | 4 | 3 | 8 | ~1.9e-3 to 448 |
| FP8 E5M2 | 1 | 5 | 2 | 8 | ~1.5e-14 to 57344 |
FP32 is the reference. Everything else is a deliberate lossy compression of its number line.
FP16: the original training accelerant
FP16 (IEEE 754 binary16) halves memory and, on hardware with native FP16 units, doubles the arithmetic throughput relative to FP32. NVIDIA's Volta Tensor Cores were the first widely deployed hardware that made FP16 matrix multiply a first-class operation.
The problem is its tiny dynamic range. With a maximum value of 65504, gradient accumulations that exceed this threshold simply overflow to infinity. Underflow is the symmetric danger: gradients smaller than ~6e-8 flush to zero and vanish. The fix, introduced in the mixed-precision training paper (Micikevicius et al., ICLR 2018), is loss scaling: multiply the loss by a large scalar (e.g. 2^16) before backprop, then divide the gradients back before the weight update. This shifts the gradient distribution into the representable range.
The mixed-precision recipe that became standard:
1. Forward pass in FP16 (fast Tensor Core path).
2. Backward pass in FP16 with loss scaling.
3. Gradient unscale + overflow check in FP32.
4. Weight update in FP32 (a "master copy" of weights is kept in FP32).
5. Cast weights back to FP16 for the next forward pass.
The master copy exists because small weight updates applied directly to FP16 values can be rounded to zero; FP32 has enough mantissa bits to accumulate those tiny deltas.
BF16: range first, precision second
Google designed BF16 (Brain Float 16) for the original TPU. The format keeps FP32's 8-bit exponent but shrinks the mantissa from 23 bits to 7. The result is a 16-bit format with FP32's dynamic range and roughly FP32's overflow/underflow behaviour, but with far coarser precision (about 2 decimal digits versus FP32's 7).
Why does coarser precision matter less than range? Empirically, neural network weights and activations cluster in a relatively narrow dynamic range during training, but occasional large gradient spikes can exceed FP16's ceiling. BF16 sidesteps the overflow problem entirely: no loss scaling is needed. The tradeoff is that nearby values that FP16 could distinguish, BF16 rounds to the same number.
BF16 became popular in transformer training partly because it removes the loss-scaling bookkeeping. PyTorch's torch.autocast(device_type='cuda', dtype=torch.bfloat16) is now a one-liner for Ampere GPUs (A100) and newer. AMD's CDNA and Intel Gaudi accelerators also treat BF16 as the default mixed-precision type.
A useful mental model: FP16 is precision-first, BF16 is range-first. Both are lossy relative to FP32, just in different directions.
FP8: the inference and training frontier
The push to 8-bit floating point is driven by the arithmetic intensity argument. Halving precision doubles the number of multiply-accumulate operations you can pack into the same silicon area, and halves bandwidth for memory-bound operations.
Two FP8 variants are used in practice, formalised in Micikevicius et al. 2022:
- E4M3 (4-bit exponent, 3-bit mantissa): higher precision in a compressed range. Used for forward pass activations and weights, where precision matters more than catching extreme values.
- E5M2 (5-bit exponent, 2-bit mantissa): wider range, coarser precision. Used for backward pass gradients, which can spike more aggressively.
The two-format split mirrors the logic of FP16/BF16 specialisation but at 8 bits. NVIDIA H100 Tensor Cores natively support FP8 matrix multiply with FP16/FP32 accumulation. Training large language models (up to 175B parameters) with FP8 has been demonstrated to match FP16 baselines, though it requires per-tensor or per-block scaling factors to keep values in the narrow representable range.
FP8 inference is now common in production: vLLM, TensorRT-LLM, and Transformer Engine all support it. The throughput gain over BF16 is roughly 2x, at the cost of careful quantisation calibration.
When it falls down
FP16 overflow during training. Loss scaling mitigates this, but dynamic loss scaling (automatically doubling the scale when no overflow occurs, halving it on overflow) can interact badly with gradient clipping. A scale that oscillates too frequently is a sign the model's gradient distribution is pathological, not just a numerical artefact.
BF16 precision loss in sensitive layers. Softmax attention, layer normalisation, and cross-entropy loss all involve subtracting nearly equal numbers, a pattern where mantissa bits matter most. Implementations typically keep these reductions in FP32 regardless of the outer autocast context. If you bypass this (e.g. a custom kernel that stays in BF16 throughout), you can see training instability that is hard to attribute to a specific layer.
FP8 saturation. E4M3's maximum representable value is 448. A batch of activations with outlier channels (common in large transformers) will saturate FP8 without per-tensor scaling, producing clipped values that corrupt the forward pass. The scaling factors must be updated frequently (typically per-step or per-batch), which adds overhead and complexity.
Format mismatch across hardware. A model trained in BF16 on an A100 then deployed in FP16 on an older T4 (which does not support BF16 natively, falling back to FP32) will run at FP32 speed - completely erasing the intended throughput gain. Format support is hardware-specific and must be verified before deployment.
Accumulator width is still FP32 (or FP16) in most hardware. Tensor Cores compute the dot product in the narrow format but accumulate into a wider type. This is intentional: accumulated errors from low-precision multiplications would compound catastrophically without the wider accumulator. The throughput numbers advertised are for the multiply step, not the accumulate.