Probability and Information Theory

Cross-entropy is the loss function of every language model trained today. Perplexity, the headline number on every LLM benchmark, is just exp(cross-entropy). The KL penalty that holds RLHF policies near the reference model is a divergence between two probability distributions. If you do not have a working mental model of these quantities, large chunks of modern ML papers will read as opaque ritual.

Discrete and continuous distributions

A probability distribution assigns mass to outcomes. Discrete distributions use a probability mass function that sums to 1 over a countable set:

P(X = k) = p_k,   sum_k p_k = 1

Continuous distributions use a probability density function that integrates to 1:

integral p(x) dx = 1

Densities can exceed 1 at a point - they are not probabilities, only probability per unit measure. This trips up most newcomers when comparing log-densities of generative models.

Expectation and variance

For a function f and random variable X:

E[f(X)] = sum_k f(k) p_k        # discrete
E[f(X)] = integral f(x) p(x) dx  # continuous

Var(X) = E[(X - E[X])^2] = E[X^2] - E[X]^2

Two practical consequences:

Linearity of expectation: E[X + Y] = E[X] + E[Y] even when X and Y are dependent. This is what makes minibatch SGD an unbiased estimator of the full gradient.
Variance does not add in general; it adds for independent variables. Batch size B independent samples reduces gradient variance by 1/B, hence the 1/sqrt(B) standard error scaling.

Entropy

Shannon entropy of a discrete distribution p:

H(p) = - sum_k p_k log p_k

Average number of bits (or nats, with natural log) needed to encode samples from p using an optimal code. Uniform distributions have maximum entropy; deterministic distributions have zero. Temperature scaling in language model sampling is literally entropy control: temperature 0 collapses to argmax (zero entropy), high temperature flattens the distribution toward uniform (max entropy).

Cross-entropy and KL divergence

Cross-entropy of q relative to p:

H(p, q) = - sum_k p_k log q_k

This is the number of bits you actually spend if you encode samples from p using a code optimised for q. It is minimised when q = p. KL divergence is the excess:

KL(p || q) = H(p, q) - H(p) = sum_k p_k log (p_k / q_k)

KL is always non-negative and zero iff p = q. It is not symmetric and not a metric. The asymmetry matters in practice:

KL(p_data || p_model) (forward KL) penalises the model for putting low mass where data has high mass. Encourages mode-covering.
KL(p_model || p_data) (reverse KL) penalises the model for putting high mass where data has low mass. Encourages mode-seeking.

Maximum-likelihood training minimises forward KL. RLHF KL penalties typically use reverse KL KL(policy || reference) because you want the policy to stay inside the reference's support.

Cross-entropy is negative log-likelihood

If your "true" distribution p is a one-hot label, cross-entropy collapses to:

H(p, q) = - log q_{correct_class}

This is the negative log-likelihood of the correct class. Every classification loss you have ever used - softmax + cross-entropy in a transformer, the autoregressive language modelling loss, the policy gradient log-prob term - is this same quantity.

Perplexity

perplexity = exp(cross_entropy_in_nats)

If a model has perplexity 20 on held-out text, it is on average "as confused as if it had to choose uniformly among 20 tokens at each step." Lower is better. The exponential scale is why a perplexity drop from 30 to 20 is a much bigger deal than a drop from 100 to 90.

Softmax + cross-entropy as the canonical pair

Why is the softmax-then-cross-entropy combo everywhere? Two reasons:

It is the maximum-entropy distribution given linear constraints. Softmax over logits is the distribution with maximum entropy subject to the expected logit matching a target. There is a derivation route from exponential families that lands here naturally.
The gradient is beautiful. For softmax output q and one-hot target p:

dL/dlogits = q - p

No exponentials, no divisions, no special cases. This is why training is numerically well-behaved and why every framework fuses the two operations into one kernel.

Mutual information

I(X; Y) = KL(p(x, y) || p(x) p(y)) = H(X) - H(X | Y)

How much knowing Y reduces uncertainty about X. Zero iff X and Y are independent. Used in:

InfoNCE contrastive losses (SimCLR, CLIP) - a lower bound on I(view_1; view_2).
Information bottleneck explanations of representation learning.
Feature selection via mutual information ranking.

Estimating mutual information from samples is notoriously hard in high dimensions; the InfoNCE bound sidesteps this by only requiring a discriminator.

Common pitfalls

Confusing density with probability. p(x) = 5 at a point is fine for a continuous distribution; it just means high density there. You can only get probabilities by integrating.
Comparing log-likelihoods across different parameterisations. Change of variables introduces a Jacobian term. Flow-based generative models hinge on tracking this correctly.
KL with zero denominators. KL(p || q) blows up if q(x) = 0 where p(x) > 0. Always smooth your distributions or add a small epsilon when computing KL on empirical estimates.