Mathematical Foundations
Probability and Information Theory
Distributions, expectations, entropy, KL, and why softmax + cross-entropy is the canonical pair that secretly underlies almost every LLM loss.
intermediate · 9 min read
Cross-entropy is the loss function of every language model trained today. Perplexity, the headline number on every LLM benchmark, is just exp(cross-entropy). The KL penalty that holds RLHF policies near the reference model is a divergence between two probability distributions. If you do not have a working mental model of these quantities, large chunks of modern ML papers will read as opaque ritual.
Discrete and continuous distributions
A probability distribution assigns mass to outcomes. Discrete distributions use a probability mass function that sums to 1 over a countable set:
P(X = k) = p_k, sum_k p_k = 1
Continuous distributions use a probability density function that integrates to 1:
integral p(x) dx = 1
Densities can exceed 1 at a point - they are not probabilities, only probability per unit measure. This trips up most newcomers when comparing log-densities of generative models.
Expectation and variance
For a function f and random variable X:
E[f(X)] = sum_k f(k) p_k # discrete
E[f(X)] = integral f(x) p(x) dx # continuous
Var(X) = E[(X - E[X])^2] = E[X^2] - E[X]^2
Two practical consequences:
- Linearity of expectation:
E[X + Y] = E[X] + E[Y]even whenXandYare dependent. This is what makes minibatch SGD an unbiased estimator of the full gradient. - Variance does not add in general; it adds for independent variables. Batch size
Bindependent samples reduces gradient variance by1/B, hence the1/sqrt(B)standard error scaling.
Entropy
Shannon entropy of a discrete distribution p:
H(p) = - sum_k p_k log p_k
Average number of bits (or nats, with natural log) needed to encode samples from p using an optimal code. Uniform distributions have maximum entropy; deterministic distributions have zero. Temperature scaling in language model sampling is literally entropy control: temperature 0 collapses to argmax (zero entropy), high temperature flattens the distribution toward uniform (max entropy).
Cross-entropy and KL divergence
Cross-entropy of q relative to p:
H(p, q) = - sum_k p_k log q_k
This is the number of bits you actually spend if you encode samples from p using a code optimised for q. It is minimised when q = p. KL divergence is the excess:
KL(p || q) = H(p, q) - H(p) = sum_k p_k log (p_k / q_k)
KL is always non-negative and zero iff p = q. It is not symmetric and not a metric. The asymmetry matters in practice:
KL(p_data || p_model)(forward KL) penalises the model for putting low mass where data has high mass. Encourages mode-covering.KL(p_model || p_data)(reverse KL) penalises the model for putting high mass where data has low mass. Encourages mode-seeking.
Maximum-likelihood training minimises forward KL. RLHF KL penalties typically use reverse KL KL(policy || reference) because you want the policy to stay inside the reference's support.
Cross-entropy is negative log-likelihood
If your "true" distribution p is a one-hot label, cross-entropy collapses to:
H(p, q) = - log q_{correct_class}
This is the negative log-likelihood of the correct class. Every classification loss you have ever used - softmax + cross-entropy in a transformer, the autoregressive language modelling loss, the policy gradient log-prob term - is this same quantity.
Perplexity
perplexity = exp(cross_entropy_in_nats)
If a model has perplexity 20 on held-out text, it is on average "as confused as if it had to choose uniformly among 20 tokens at each step." Lower is better. The exponential scale is why a perplexity drop from 30 to 20 is a much bigger deal than a drop from 100 to 90.
Softmax + cross-entropy as the canonical pair
Why is the softmax-then-cross-entropy combo everywhere? Two reasons:
- It is the maximum-entropy distribution given linear constraints. Softmax over logits is the distribution with maximum entropy subject to the expected logit matching a target. There is a derivation route from exponential families that lands here naturally.
- The gradient is beautiful. For softmax output
qand one-hot targetp:
dL/dlogits = q - p
No exponentials, no divisions, no special cases. This is why training is numerically well-behaved and why every framework fuses the two operations into one kernel.
Mutual information
I(X; Y) = KL(p(x, y) || p(x) p(y)) = H(X) - H(X | Y)
How much knowing Y reduces uncertainty about X. Zero iff X and Y are independent. Used in:
- InfoNCE contrastive losses (SimCLR, CLIP) - a lower bound on
I(view_1; view_2). - Information bottleneck explanations of representation learning.
- Feature selection via mutual information ranking.
Estimating mutual information from samples is notoriously hard in high dimensions; the InfoNCE bound sidesteps this by only requiring a discriminator.
Common pitfalls
- Confusing density with probability.
p(x) = 5at a point is fine for a continuous distribution; it just means high density there. You can only get probabilities by integrating. - Comparing log-likelihoods across different parameterisations. Change of variables introduces a Jacobian term. Flow-based generative models hinge on tracking this correctly.
- KL with zero denominators.
KL(p || q)blows up ifq(x) = 0wherep(x) > 0. Always smooth your distributions or add a small epsilon when computing KL on empirical estimates.
Further reading
- Visual Information Theory - Christopher Olah on entropy, cross-entropy, KL, with the clearest pictures available.
- Deep Learning Book - Chapter 3: Probability and Information Theory - Goodfellow, Bengio, Courville.
- Flow-based Deep Generative Models - Lilian Weng; uses change-of-variables and KL throughout.