Mixed-Precision Kernels

A single A100 GPU delivers 312 TFLOPS in FP16/BF16 but only 77 TFLOPS in FP32. That 4x gap is not free: you earn it by convincing every kernel in your training or inference stack to operate at reduced precision without corrupting the model. That gap between available compute and what naive code actually uses is the central tension mixed-precision engineering resolves.

The Precision Hierarchy and What Each Format Costs

Modern GPU kernels work across at least four floating-point formats. Understanding their bit layouts explains which operations break at low precision and which do not.

Format	Sign	Exponent	Mantissa	Range	Notes
FP32	1	8	23	~1e-38-3e38	Default training dtype
FP16	1	5	10	~6e-5-65504	Fast; narrow range is a trap
BF16	1	8	7	~1e-38-3e38	Same range as FP32, less frac
FP8	1	4 or 5	3 or 2	varies	H100+; two sub-formats (E4M3, E5M2)

BF16 has the same exponent width as FP32, so it never overflows where FP32 does not. FP16's 5-bit exponent caps at 65504, which is why gradient norms routinely overflow it during training without intervention. FP8 pushes further; the E4M3 sub-format preserves more mantissa bits for forward-pass activations, E5M2 preserves more range for backward-pass gradients.

The arithmetic throughput ratio is roughly linear in mantissa bit width for matrix multiplications, because the hardware tensor cores natively accumulate in a wider format. An A100's FP16 tensor core multiplies two FP16 inputs but accumulates into FP32; the final write can be downcast. This accumulate-in-higher-precision pattern is the foundation all mixed-precision kernels exploit.

The Three-Tier Storage Pattern

Micikevicius et al. (ICLR 2018) codified the canonical pattern for mixed-precision training: store weights in FP32 as a "master copy", cast them to FP16 for each forward and backward pass, accumulate gradients in FP32, and update the master copy. The memory cost of carrying both copies is offset by the halved bandwidth of every activation tensor.

Master weights  (FP32, on DRAM)
       |
   cast to FP16
       |
Forward pass    (FP16 compute, FP32 accumulation in tensor cores)
       |
Activations     (FP16, checkpointed or recomputed)
       |
Backward pass   (FP16 compute)
       |
Gradients       (FP16) ─── loss scaling ──► FP32 grads ──► FP32 weight update

Loss scaling multiplies the loss by a large constant S (commonly 2^8 to 2^15) before the backward pass, shifting gradient magnitudes into the representable FP16 range, then divides the accumulated FP32 gradients by S before the weight update. Dynamic loss scaling adjusts S up when no inf/NaN appears for N consecutive steps, and halves S immediately on any inf/NaN detection.

The Precision Hierarchy and What Each Format Costs

The Three-Tier Storage Pattern

Keep reading with Pro.