Calculus and Gradients

Backprop is a single application of the chain rule, executed in a particular order. Gradient clipping is a hack to fight Lipschitz blowup. Why we do not use Newton's method on a 70B model is not philosophical - it is that the Hessian has 7e10 * 7e10 entries and will not fit in any datacentre. The calculus you need to understand modern ML is shallow but you have to know it cold.

Partial derivatives and gradients

For a scalar function f: R^n -> R, the gradient is the vector of partial derivatives:

grad f = [df/dx_1, df/dx_2, ..., df/dx_n]

The gradient points in the direction of steepest ascent, with magnitude equal to the maximum rate of change. Gradient descent moves in the opposite direction:

x_{t+1} = x_t - lr * grad f(x_t)

The gradient is a local first-order linearisation. It tells you the best direction now, says nothing about curvature, and stops being useful once the step size exceeds the scale on which f is approximately linear. That mismatch is why learning rate is the single most sensitive hyperparameter.

Jacobian and Hessian

For a vector function f: R^n -> R^m, the Jacobian is the (m, n) matrix of all first partial derivatives:

J_{ij} = df_i / dx_j

For a scalar function f: R^n -> R, the Hessian is the (n, n) matrix of second partial derivatives:

H_{ij} = d^2 f / (dx_i dx_j)

The Hessian is symmetric (for nice functions, by Clairaut's theorem) and encodes local curvature. Its eigenvalues tell you the shape of the loss surface near a point:

All positive: local minimum (positive-definite Hessian).
All negative: local maximum.
Mixed signs: saddle point. The dominant feature in high-dimensional non-convex losses.

The chain rule as the engine of backprop

Given y = f(g(h(x))):

dy/dx = (df/dg) * (dg/dh) * (dh/dx)

For vector-valued intermediates these become Jacobian-Jacobian-Jacobian products. Reverse-mode autodiff evaluates this product right-to-left for scalar outputs, which is exactly what you want for a loss function. One backward pass yields all parameter gradients regardless of how many parameters there are.

The cost asymmetry is dramatic:

Forward mode: O(n) evaluations for n inputs.
Reverse mode: O(1) extra passes for a scalar output, regardless of n.

This is why no major framework uses forward mode for training. See the backprop-autodiff note for the implementation details.

Why second-order methods are rare in deep learning

Newton's method updates with H^{-1} grad. For convex problems this converges quadratically - dramatically faster than first-order methods. Yet no one trains GPT-4 with Newton's method. Three reasons:

Memory. A 70B-parameter model has a (7e10, 7e10) Hessian. That is 5e21 floats. The entire planet's storage is in the exabyte range; this is six orders of magnitude beyond.
Compute. Even forming Hessian-vector products costs comparable to a backward pass, and Newton's method needs many of them per step.
Non-convexity. Vanilla Newton converges to stationary points, not minima. In a saddle-dominated landscape it heads straight for the nearest saddle.

Workarounds exist - K-FAC, Shampoo, Sophia, the natural-gradient family - and a few production runs use them. But the cost-benefit math is so brutal that AdamW (cheap, approximate diagonal preconditioner) wins for almost every workload.

Gradient norms and clipping

The norm of the gradient at step t:

||grad||_2 = sqrt(sum_i grad_i^2)

When this explodes (RNNs, deep transformers without warmup, end of training), parameters take a huge step and the loss spikes or NaNs. Gradient clipping rescales:

if ||grad||_2 > threshold:
    grad = grad * (threshold / ||grad||_2)

This caps the step magnitude while preserving the direction. Threshold of 1.0 is the most common default in transformer training. It is a crude safety net; in practice it triggers a few times early in training and then almost never. If you see it firing constantly, your learning rate is too high or your initialisation is bad.

Gradient clipping fixes explosion. It cannot fix vanishing - you cannot rescale your way out of a number that is already zero. Architectural changes (residual connections, LayerNorm, gating) are the actual fix for vanishing gradients.

Directional derivatives and JVPs / VJPs

A directional derivative is the gradient projected onto a direction v:

D_v f(x) = grad f . v

In autodiff jargon this is a JVP (Jacobian-vector product). The reverse - a row vector times the Jacobian - is a VJP (vector-Jacobian product). PyTorch's .backward() computes VJPs. JAX exposes both via jax.jvp and jax.vjp. Composing a JVP with a VJP gives you a Hessian-vector product without ever materialising the Hessian - the trick K-FAC and natural-gradient methods build on.

Common pitfalls

Confusing the gradient with the steepest-descent direction in a non-Euclidean metric. Natural-gradient methods use F^{-1} grad where F is the Fisher information. The "right" direction depends on what metric you measure distance in.
Forgetting that softmax + cross-entropy has a beautiful joint gradient (q - p). Computing them separately and composing the gradients is numerically unstable; always use the fused version.
Gradient checking with finite differences at the wrong scale. Forward difference (f(x+h) - f(x)) / h has O(h) truncation error and O(1/h) floating-point error. Central differences with h = 1e-5 and double precision is the standard sanity check.