Mathematical Foundations
Calculus and Gradients
Partial derivatives, the chain rule as the engine of backprop, why second-order methods are rare in deep learning, and what gradient clipping actually does.
intermediate · 8 min read
Backprop is a single application of the chain rule, executed in a particular order. Gradient clipping is a hack to fight Lipschitz blowup. Why we do not use Newton's method on a 70B model is not philosophical - it is that the Hessian has 7e10 * 7e10 entries and will not fit in any datacentre. The calculus you need to understand modern ML is shallow but you have to know it cold.
Partial derivatives and gradients
For a scalar function f: R^n -> R, the gradient is the vector of partial derivatives:
grad f = [df/dx_1, df/dx_2, ..., df/dx_n]
The gradient points in the direction of steepest ascent, with magnitude equal to the maximum rate of change. Gradient descent moves in the opposite direction:
x_{t+1} = x_t - lr * grad f(x_t)
The gradient is a local first-order linearisation. It tells you the best direction now, says nothing about curvature, and stops being useful once the step size exceeds the scale on which f is approximately linear. That mismatch is why learning rate is the single most sensitive hyperparameter.
Jacobian and Hessian
For a vector function f: R^n -> R^m, the Jacobian is the (m, n) matrix of all first partial derivatives:
J_{ij} = df_i / dx_j
For a scalar function f: R^n -> R, the Hessian is the (n, n) matrix of second partial derivatives:
H_{ij} = d^2 f / (dx_i dx_j)
The Hessian is symmetric (for nice functions, by Clairaut's theorem) and encodes local curvature. Its eigenvalues tell you the shape of the loss surface near a point:
- All positive: local minimum (positive-definite Hessian).
- All negative: local maximum.
- Mixed signs: saddle point. The dominant feature in high-dimensional non-convex losses.
The chain rule as the engine of backprop
Given y = f(g(h(x))):
dy/dx = (df/dg) * (dg/dh) * (dh/dx)
For vector-valued intermediates these become Jacobian-Jacobian-Jacobian products. Reverse-mode autodiff evaluates this product right-to-left for scalar outputs, which is exactly what you want for a loss function. One backward pass yields all parameter gradients regardless of how many parameters there are.
The cost asymmetry is dramatic:
- Forward mode:
O(n)evaluations forninputs. - Reverse mode:
O(1)extra passes for a scalar output, regardless ofn.
This is why no major framework uses forward mode for training. See the backprop-autodiff note for the implementation details.
Why second-order methods are rare in deep learning
Newton's method updates with H^{-1} grad. For convex problems this converges quadratically - dramatically faster than first-order methods. Yet no one trains GPT-4 with Newton's method. Three reasons:
- Memory. A 70B-parameter model has a
(7e10, 7e10)Hessian. That is5e21floats. The entire planet's storage is in the exabyte range; this is six orders of magnitude beyond. - Compute. Even forming Hessian-vector products costs comparable to a backward pass, and Newton's method needs many of them per step.
- Non-convexity. Vanilla Newton converges to stationary points, not minima. In a saddle-dominated landscape it heads straight for the nearest saddle.
Workarounds exist - K-FAC, Shampoo, Sophia, the natural-gradient family - and a few production runs use them. But the cost-benefit math is so brutal that AdamW (cheap, approximate diagonal preconditioner) wins for almost every workload.
Gradient norms and clipping
The norm of the gradient at step t:
||grad||_2 = sqrt(sum_i grad_i^2)
When this explodes (RNNs, deep transformers without warmup, end of training), parameters take a huge step and the loss spikes or NaNs. Gradient clipping rescales:
if ||grad||_2 > threshold:
grad = grad * (threshold / ||grad||_2)
This caps the step magnitude while preserving the direction. Threshold of 1.0 is the most common default in transformer training. It is a crude safety net; in practice it triggers a few times early in training and then almost never. If you see it firing constantly, your learning rate is too high or your initialisation is bad.
Gradient clipping fixes explosion. It cannot fix vanishing - you cannot rescale your way out of a number that is already zero. Architectural changes (residual connections, LayerNorm, gating) are the actual fix for vanishing gradients.
Directional derivatives and JVPs / VJPs
A directional derivative is the gradient projected onto a direction v:
D_v f(x) = grad f . v
In autodiff jargon this is a JVP (Jacobian-vector product). The reverse - a row vector times the Jacobian - is a VJP (vector-Jacobian product). PyTorch's .backward() computes VJPs. JAX exposes both via jax.jvp and jax.vjp. Composing a JVP with a VJP gives you a Hessian-vector product without ever materialising the Hessian - the trick K-FAC and natural-gradient methods build on.
Common pitfalls
- Confusing the gradient with the steepest-descent direction in a non-Euclidean metric. Natural-gradient methods use
F^{-1} gradwhereFis the Fisher information. The "right" direction depends on what metric you measure distance in. - Forgetting that softmax + cross-entropy has a beautiful joint gradient (
q - p). Computing them separately and composing the gradients is numerically unstable; always use the fused version. - Gradient checking with finite differences at the wrong scale. Forward difference
(f(x+h) - f(x)) / hhasO(h)truncation error andO(1/h)floating-point error. Central differences withh = 1e-5and double precision is the standard sanity check.
Further reading
- Deep Learning Book - Chapter 4: Numerical Computation - Goodfellow et al, gradients and conditioning in ML context.
- On the difficulty of training Recurrent Neural Networks - Pascanu, Mikolov, Bengio; the gradient clipping paper.
- Identifying and attacking the saddle point problem - Dauphin et al, why second-order methods see saddles everywhere.