Dropout and Modern Regularisation

Regularisation is anything that reduces the gap between training loss and test loss. The 2012-2018 deep learning era leaned heavily on architectural regularisers like dropout. The 2020+ LLM era leans on more data. Understanding why the shift happened tells you when to reach for each tool.

Dropout as approximate ensembling

Srivastava et al (2014). During training, randomly zero each activation with probability p (typically 0.1-0.5), then scale the survivors by 1/(1-p) so the expected output is unchanged. At inference, run the full network.

mask = bernoulli(1 - p)   # 1 with prob 1-p, else 0
y = (x * mask) / (1 - p)

The intuition: you are training an ensemble of 2^N sub-networks (one per dropout mask) that share weights. At test time, the un-masked forward pass approximates the geometric mean of all of them. Co-adaptation between specific neurons is discouraged because no neuron can rely on a specific other neuron being present.

It worked spectacularly for dense MLPs and helped CNNs slightly. For RNNs and transformers, the picture is more nuanced.

Why modern LLMs barely use it

Open the Llama 3 or Mistral source code and you find dropout set to 0 across attention and FFN layers. Three reasons:

Data is the regulariser. When you train on 15 trillion tokens, you do not need to inject noise to prevent the model from memorising. There simply is not enough capacity to memorise that much.
It hurts at scale. The Chinchilla and Llama ablations consistently show dropout has zero or negative effect on validation loss when compute and data are scaled together.
Throughput tax. Generating dropout masks and scaling activations is a measurable cost on the critical path of every layer.

You still see small amounts of attention dropout in fine-tuning recipes where the dataset is small and overfitting is real.

Label smoothing

Szegedy et al (Inception-v3, 2015). Instead of training with one-hot targets [0, 0, 1, 0], soften them to [eps/K, eps/K, 1-eps, eps/K] with eps = 0.1. The model is discouraged from producing arbitrarily confident outputs.

Effects:

Calibration improves. Predicted probabilities better match empirical frequencies.
Slight accuracy gain on classification benchmarks.
KL distillation gets harder - the softened targets erase some of the dark knowledge a teacher would transfer. Modern LLM distillation usually disables label smoothing.

Weight decay vs L2 regularisation

In plain SGD the two are equivalent: adding lambda * w^2 to the loss and gradient-descending is the same as multiplying w by (1 - lr * lambda) each step.

In Adam they are not. Adam divides the gradient by sqrt(v_t) (the running variance), which means an L2 term added to the loss gets divided too - so heavily updated parameters get less effective decay than rarely updated ones. AdamW (covered in the optimisers note) decouples decay from gradient so every parameter shrinks at the same rate. This is why every LLM trainer uses AdamW, not Adam-with-L2.

Stochastic depth

Huang et al (2016). Like dropout but on entire residual blocks. With probability p_L for layer L, skip the block and pass the input through unchanged:

y = x + (bernoulli(1 - p_L) / (1 - p_L)) * F(x)

p_L usually increases linearly with depth so deeper layers drop more often. Trains a 1200-layer ResNet that would otherwise diverge. Vision Transformers (DeiT, Swin) use a similar trick under the name "drop path."

What modern training relies on instead

Technique	What it does	Where used
Massive pretraining data	Removes the overfitting regime entirely	All frontier LLMs
Weight decay (AdamW)	Shrinks unused parameters	Universal
Data augmentation	Synthetic input diversity	Vision, speech
Mixup / CutMix	Convex combinations of training examples	Vision pretraining
Stochastic depth	Random layer skipping	Deep ViTs
Early stopping	Bail before validation diverges	Fine-tuning

The general lesson: regularisation that worked when models were 100M parameters and data was 1M examples mostly disappears when models are 100B parameters and data is 10T tokens. Different regime, different toolbox.