Normalisation: BatchNorm, LayerNorm, RMSNorm

Normalisation layers rescale activations so that downstream layers receive inputs with stable statistics. The technique cuts training time by 5-10x on deep networks and is now standard in every architecture. The choice of which normaliser is not a detail - it determines whether your model can be trained with batch size 1, whether you can serve with batches of varying size, and how fast each forward pass is.

BatchNorm

Introduced by Ioffe and Szegedy (2015) for CNNs. For each feature channel, compute mean and variance across the batch dimension, then normalise:

mu_c = mean over batch of x[:, c]
var_c = var over batch of x[:, c]
x_hat = (x - mu_c) / sqrt(var_c + eps)
y = gamma_c * x_hat + beta_c

gamma and beta are learnable per-channel scale and shift. At inference, running averages of mean and variance are used instead of batch statistics.

The original justification was "reducing internal covariate shift" - keeping the distribution of layer inputs stable across training. Santurkar et al (2018) showed that this story is wrong: BatchNorm works because it smooths the loss landscape, making gradients more predictable and allowing larger learning rates. The mechanism was right, the explanation was not.

Why transformers do not use BatchNorm

Three problems make BatchNorm a poor fit for sequence models:

Variable batch shapes at inference. Generation often runs with batch size 1. BatchNorm's running statistics drift from the training distribution.
Padding tokens corrupt statistics. In a batch of variable-length sequences, padded positions skew the mean and variance.
Distributed training pain. Computing batch statistics across data-parallel workers requires an all-reduce per BN layer.

LayerNorm (Ba, Kiros, Hinton 2016) normalises across the feature dimension for each token independently:

mu = mean over features of x_t
var = var over features of x_t
x_hat = (x_t - mu) / sqrt(var + eps)
y = gamma * x_hat + beta

No batch dependency, no padding issue, no all-reduce. It just works for sequence models. Every transformer from BERT and GPT-2 through GPT-4 uses LayerNorm.

Pre-norm vs post-norm

The original Transformer placed LayerNorm after the residual addition:

y = LayerNorm(x + Sublayer(x))   # post-norm

Modern transformers put it before:

y = x + Sublayer(LayerNorm(x))   # pre-norm

Pre-norm is dramatically easier to train at depth - gradients flow cleanly through the residual highway without going through a normaliser. The trade-off is a small accuracy hit on the original 6-layer translation benchmarks, irrelevant once you scale past 24 layers.

RMSNorm

Zhang and Sennrich (2019) observed that LayerNorm's mean-centring barely affects anything; the rescaling does most of the work. RMSNorm drops the mean:

rms = sqrt(mean(x^2) + eps)
y = gamma * x / rms

BatchNorm

Why transformers do not use BatchNorm

Pre-norm vs post-norm

RMSNorm

Keep reading with Pro.