Writing a Fused Softmax

A naively written softmax reads a row of floats from HBM three times: once for the max, once for the exponentials, once for the normalisation. On an A100, HBM bandwidth is roughly 2 TB/s while the chip can execute hundreds of TFLOP/s. For softmax - which does almost no arithmetic - those three round-trips are the entire cost. Fusing them into a single kernel pass is not an optimisation nicety; it is most of the work.

Why Softmax Is Memory-Bound by Default

Softmax along a row of length N is:

y_i = exp(x_i - max(x)) / sum_j(exp(x_j - max(x)))

The subtraction of max(x) is numerically important: without it, exp(x_i) overflows for large logits. But it creates a dependency: you cannot compute exp(x_i - max(x)) until you have seen the entire row to find max(x). A straightforward implementation therefore runs three separate GPU kernels:

Reduce over the row to compute m = max(x).
Compute e_i = exp(x_i - m) and write a temporary tensor to HBM.
Reduce again to compute s = sum(e_i), then divide.

Each kernel reads or writes the full row. For a matrix of shape (M, N) this costs roughly 5MN + 2M element reads plus 3MN + 2M element writes (quoting the Triton tutorial's accounting). A fused kernel that keeps values in registers and shared memory across reductions costs MN reads and MN writes. The ratio is the speedup: around 4x in practice.

The Triton fused-softmax tutorial benchmarks this directly and finds the Triton implementation is roughly 4x faster than torch.jit.script-ed PyTorch and competitive with (or faster than) torch.softmax for large enough rows.

The Online Softmax Algorithm

The core algorithmic insight is the online softmax trick, which lets you compute max and sum in a single streaming pass rather than two:

# Streaming over blocks of the row
m = -inf
d = 0.0
for each block b of x:
    m_new = max(m, max(b))
    d = d * exp(m - m_new) + sum(exp(b - m_new))
    m = m_new

After the pass, m holds the row maximum and d holds the correct normalisation denominator sum(exp(x_i - m)). A second pass writes the outputs:

for each block b of x:
    y_b = exp(b - m) / d

This two-pass-over-HBM version is already twice as good as the naive three-pass. In a Triton or CUDA kernel you can often load a block, compute the online update, and emit the output all in one pass for rows that fit in shared memory, reducing HBM traffic to one read and one write per element.

The correctness of the rescaling factor exp(m - m_new) is worth checking once. When you encounter a new block with a larger local max m_new, previously accumulated exponentials exp(x_i - m) are each too large by exactly exp(m - m_new). Multiplying the running sum d by that factor corrects them without re-reading the old data.

Why Softmax Is Memory-Bound by Default

The Online Softmax Algorithm

Keep reading with Pro.