Deep Learning
Dropout and Modern Regularisation
Why dropout was the dominant regulariser for a decade and why modern LLM training mostly skips it in favour of letting data do the work.
beginner · 6 min read
Regularisation is anything that reduces the gap between training loss and test loss. The 2012-2018 deep learning era leaned heavily on architectural regularisers like dropout. The 2020+ LLM era leans on more data. Understanding why the shift happened tells you when to reach for each tool.
Dropout as approximate ensembling
Srivastava et al (2014). During training, randomly zero each activation with probability p (typically 0.1-0.5), then scale the survivors by 1/(1-p) so the expected output is unchanged. At inference, run the full network.
mask = bernoulli(1 - p) # 1 with prob 1-p, else 0
y = (x * mask) / (1 - p)
The intuition: you are training an ensemble of 2^N sub-networks (one per dropout mask) that share weights. At test time, the un-masked forward pass approximates the geometric mean of all of them. Co-adaptation between specific neurons is discouraged because no neuron can rely on a specific other neuron being present.
It worked spectacularly for dense MLPs and helped CNNs slightly. For RNNs and transformers, the picture is more nuanced.
Why modern LLMs barely use it
Open the Llama 3 or Mistral source code and you find dropout set to 0 across attention and FFN layers. Three reasons:
- Data is the regulariser. When you train on 15 trillion tokens, you do not need to inject noise to prevent the model from memorising. There simply is not enough capacity to memorise that much.
- It hurts at scale. The Chinchilla and Llama ablations consistently show dropout has zero or negative effect on validation loss when compute and data are scaled together.
- Throughput tax. Generating dropout masks and scaling activations is a measurable cost on the critical path of every layer.
You still see small amounts of attention dropout in fine-tuning recipes where the dataset is small and overfitting is real.
Label smoothing
Szegedy et al (Inception-v3, 2015). Instead of training with one-hot targets [0, 0, 1, 0], soften them to [eps/K, eps/K, 1-eps, eps/K] with eps = 0.1. The model is discouraged from producing arbitrarily confident outputs.
Effects:
- Calibration improves. Predicted probabilities better match empirical frequencies.
- Slight accuracy gain on classification benchmarks.
- KL distillation gets harder - the softened targets erase some of the dark knowledge a teacher would transfer. Modern LLM distillation usually disables label smoothing.
Weight decay vs L2 regularisation
In plain SGD the two are equivalent: adding lambda * w^2 to the loss and gradient-descending is the same as multiplying w by (1 - lr * lambda) each step.
In Adam they are not. Adam divides the gradient by sqrt(v_t) (the running variance), which means an L2 term added to the loss gets divided too - so heavily updated parameters get less effective decay than rarely updated ones. AdamW (covered in the optimisers note) decouples decay from gradient so every parameter shrinks at the same rate. This is why every LLM trainer uses AdamW, not Adam-with-L2.
Stochastic depth
Huang et al (2016). Like dropout but on entire residual blocks. With probability p_L for layer L, skip the block and pass the input through unchanged:
y = x + (bernoulli(1 - p_L) / (1 - p_L)) * F(x)
p_L usually increases linearly with depth so deeper layers drop more often. Trains a 1200-layer ResNet that would otherwise diverge. Vision Transformers (DeiT, Swin) use a similar trick under the name "drop path."
What modern training relies on instead
| Technique | What it does | Where used |
|---|---|---|
| Massive pretraining data | Removes the overfitting regime entirely | All frontier LLMs |
| Weight decay (AdamW) | Shrinks unused parameters | Universal |
| Data augmentation | Synthetic input diversity | Vision, speech |
| Mixup / CutMix | Convex combinations of training examples | Vision pretraining |
| Stochastic depth | Random layer skipping | Deep ViTs |
| Early stopping | Bail before validation diverges | Fine-tuning |
The general lesson: regularisation that worked when models were 100M parameters and data was 1M examples mostly disappears when models are 100B parameters and data is 10T tokens. Different regime, different toolbox.
Further reading
- Dropout: A Simple Way to Prevent Neural Networks from Overfitting - the original Srivastava et al paper.
- Rethinking the Inception Architecture for Computer Vision - Szegedy et al, introduces label smoothing in Section 7.
- Deep Networks with Stochastic Depth - Huang et al on dropping residual blocks.