Applied LLMs
Writing a Fused Softmax
A fused softmax kernel collapses three separate memory-bound passes over a matrix row into one, cutting HBM traffic by roughly 4x and turning a memory-bound operation into a compute-limited one.
intermediate · 8 min read · Premium
A naively written softmax reads a row of floats from HBM three times: once for the max, once for the exponentials, once for the normalisation. On an A100, HBM bandwidth is roughly 2 TB/s while the chip can execute hundreds of TFLOP/s. For softmax - which does almost no arithmetic - those three round-trips are the entire cost. Fusing them into a single kernel pass is not an optimisation nicety; it is most of the work.
Why Softmax Is Memory-Bound by Default
Softmax along a row of length N is:
Keep reading with Pro.
You're reading the preview. Unlock the full concept plus the library, study plans, the AI mentor, and daily emails.