← Concept library

Applied LLMs

Writing a Fused Softmax

A fused softmax kernel collapses three separate memory-bound passes over a matrix row into one, cutting HBM traffic by roughly 4x and turning a memory-bound operation into a compute-limited one.

intermediate · 8 min read · Premium

A naively written softmax reads a row of floats from HBM three times: once for the max, once for the exponentials, once for the normalisation. On an A100, HBM bandwidth is roughly 2 TB/s while the chip can execute hundreds of TFLOP/s. For softmax - which does almost no arithmetic - those three round-trips are the entire cost. Fusing them into a single kernel pass is not an optimisation nicety; it is most of the work.

Why Softmax Is Memory-Bound by Default

Softmax along a row of length N is:

Keep reading with Pro.

You're reading the preview. Unlock the full concept plus the library, study plans, the AI mentor, and daily emails.

Sign in to save and react.
Share Copied