Optimisation Theory

Classical optimisation theory says you should not be able to train a neural network. The loss is wildly non-convex, has exponentially many critical points, and gradient descent has no convergence guarantees. Yet SGD on a 100B-parameter transformer reliably finds solutions that generalise. Understanding why this works - and where the classical intuitions break - is what separates a practitioner who can train models from one who can debug them.

Convex vs non-convex landscapes

A function f is convex if for all x, y and lambda in [0, 1]:

f(lambda x + (1 - lambda) y) <= lambda f(x) + (1 - lambda) f(y)

Equivalently, every local minimum is a global minimum and the Hessian is positive semi-definite everywhere. Convex optimisation has a complete theory: gradient descent converges at rate O(1/t), accelerated methods at O(1/t^2), the optimum is unique. Linear regression, logistic regression, SVMs are all convex.

Neural networks are aggressively non-convex. The loss landscape has:

Permutation symmetries. Swap any two hidden units; the loss is unchanged. A 1000-unit layer has 1000! equivalent global minima.
Scaling symmetries. Multiply one layer's weights by c, divide the next by c (for ReLU networks). Same function, different parameters.
Saddle points at every scale. Far more saddles than minima, especially in high dimension.

Despite this, SGD finds solutions that perform well on held-out data. Why?

Why SGD works on non-convex landscapes

Three empirical observations, none fully explained theoretically:

Loss-surface findings. Visualisations of large-network loss landscapes (Goodfellow et al, Li et al) show that the path from initialisation to the trained solution is essentially monotonic, no high-loss ridges to cross. The geometry near the solution is wide and flat in most directions.
Local minima are mostly equivalent. Choromanska et al (2015) and others argued, with caveats, that for large enough networks the local minima found by SGD all sit at similar loss values. There is no "vastly better" minimum being missed.
The lottery-ticket framing. Frankle and Carbin (2018) showed that dense networks contain sparse subnetworks ("winning tickets") that, when trained from the original initialisation, match or exceed the dense network's performance. The implication: overparameterisation gives SGD many independent paths to a good solution.

The working intuition: in very high dimension, a "bad" critical point would require all eigenvalues of the Hessian to be positive (a true minimum that is worse than what SGD finds). The probability of this drops sharply as dimension grows.

Saddle points dominate

Dauphin et al (2014) showed that critical points in high-dimensional non-convex landscapes are overwhelmingly saddles, not local minima. Reasoning: at a random critical point, each eigenvalue of the Hessian is independently roughly equally likely to be positive or negative. The probability that all n are the same sign drops as 2^{-n}.

Convex vs non-convex landscapes

Why SGD works on non-convex landscapes

Saddle points dominate

Keep reading with Pro.