Applied LLMs
Mixed-Precision Kernels
Mixed-precision kernels reduce memory bandwidth and arithmetic cost by storing and computing in lower-precision formats while selectively preserving full precision where numerical stability demands it.
advanced · 8 min read · Premium
A single A100 GPU delivers 312 TFLOPS in FP16/BF16 but only 77 TFLOPS in FP32. That 4x gap is not free: you earn it by convincing every kernel in your training or inference stack to operate at reduced precision without corrupting the model. That gap between available compute and what naive code actually uses is the central tension mixed-precision engineering resolves.
The Precision Hierarchy and What Each Format Costs
Modern GPU kernels work across at least four floating-point formats. Understanding their bit layouts explains which operations break at low precision and which do not.
Keep reading with Pro.
You're reading the preview. Unlock the full concept plus the library, study plans, the AI mentor, and daily emails.