Optimisers: SGD, Adam, AdamW, Lion

The optimiser is the inner loop of every training run. The choice between SGD with momentum and AdamW changes loss curves, memory footprint, and the kind of hyperparameter tuning you need. Understanding the lineage from SGD through Adam clarifies why AdamW is the LLM default and why newer optimisers like Lion and Muon are starting to displace it.

SGD with momentum

Plain SGD updates each parameter by -lr * grad. Momentum adds a running average of past gradients:

v_t = mu * v_{t-1} + grad
w_t = w_{t-1} - lr * v_t

mu (usually 0.9) smooths noisy gradients and accelerates progress along consistent directions. SGD+momentum was the workhorse of computer vision through 2018. It generalises famously well - the implicit bias toward flat minima is real and helps held-out accuracy.

The downside is hand-tuning. The right learning rate varies by orders of magnitude across architectures, layers, and training phases.

RMSProp

Hinton's lecture-only proposal: scale each parameter's update by the inverse root of its running squared gradient.

v_t = beta * v_{t-1} + (1 - beta) * grad^2
w_t = w_{t-1} - lr * grad / (sqrt(v_t) + eps)

This is per-parameter adaptive learning rate. Parameters with consistently large gradients get smaller steps; sparsely updated parameters get larger ones. Crucial for RNNs where different parts of the network see wildly different gradient magnitudes.

Adam

Kingma and Ba (2014) combined momentum and RMSProp:

m_t = beta1 * m_{t-1} + (1 - beta1) * grad       # first moment
v_t = beta2 * v_{t-1} + (1 - beta2) * grad^2     # second moment

m_hat = m_t / (1 - beta1^t)                       # bias correction
v_hat = v_t / (1 - beta2^t)

w_t = w_{t-1} - lr * m_hat / (sqrt(v_hat) + eps)

Default beta1=0.9, beta2=0.999. Robust to learning rate choice across architectures. Adam took over within two years of publication because it just works - no per-layer LR scheduling, modest sensitivity to the global LR, fast convergence.

Cost: two extra tensors per parameter (m and v). For a 70B-parameter model in fp32, that is 560 GB of optimiser state, vs 280 GB for the weights themselves.

The AdamW fix

Loshchilov and Hutter (2017) noticed Adam's weight decay was broken. Adding lambda * w^2 to the loss makes the decay term get divided by sqrt(v_t) along with the gradient. Parameters that already have large v_t get less decay than they should.

AdamW decouples the decay from the gradient computation:

w_t = w_{t-1} - lr * (m_hat / (sqrt(v_hat) + eps) + lambda * w_{t-1})

The decay is applied directly to weights, untouched by adaptive scaling. Every parameter shrinks at the same fractional rate. AdamW closed the generalisation gap with SGD and is the optimiser used to train GPT-3, Llama, Claude, Mistral - essentially every modern LLM.

SGD with momentum

RMSProp

Adam

The AdamW fix

Keep reading with Pro.