Reward Models as Learned Rewards

When OpenAI released InstructGPT in 2022, a 1.3 billion parameter model outperformed the raw 175 billion parameter GPT-3 on human preference evaluations. The size gap is a distraction; the real story is the reward signal. The model was not trained on a fixed objective like next-token cross-entropy. It was trained to maximise a score produced by a second neural network, one that had learned, from roughly 50,000 human comparisons, what "a good answer" feels like. That second network is the reward model, and understanding it is central to understanding modern LLM alignment.

What a reward model actually is

A reward model (RM) is a language model with its final unembedding head replaced by a single linear layer projecting to a scalar. Given a prompt \(x\) and a completion \(y\), it outputs \(r_\theta(x, y) \in \mathbb{R}\). Nothing about the architecture forces the output to mean anything useful; that meaning is instilled by training.

Training uses a preference dataset of triples \((x, y_w, y_l)\), where \(y_w\) is the human-preferred completion and \(y_l\) the dispreferred one. The Bradley-Terry model gives the probability that \(y_w\) is preferred as:

\[P(y_w \succ y_l \mid x) = \sigma\!\left(r_\theta(x, y_w) - r_\theta(x, y_l)\right)\]

The training loss is the negative log-likelihood of the observed human choices:

\[\mathcal{L}(\theta) = -\mathbb{E}_{(x, y_w, y_l)}\!\left[\log \sigma\!\left(r_\theta(x, y_w) - r_\theta(x, y_l)\right)\right]\]

This is structurally identical to a binary classification loss. The RM learns to assign higher scores to responses humans preferred, without ever receiving a direct label of "good" or "bad" - only relative comparisons.

Initialisation matters. The RM is typically initialised from the same pretrained (or SFT) checkpoint used for the policy. This is not just engineering convenience: a model that already understands language can use its representations to reason about quality, helpfulness, and factual consistency. Starting from a randomly initialised network would require far more comparison data to achieve the same quality.

From pairwise comparisons to a policy

Once the RM is trained, the RLHF pipeline uses it as the reward function for a policy gradient update. The policy \(\pi_\theta\) (the language model being aligned) generates a completion \(y\) for prompt \(x\), the RM scores it, and the score becomes the reward in a Markov decision process where each token is an action.

The objective is not to maximise \(r_\theta\) naively. Raw maximisation would immediately exploit the RM, producing unnatural text that scores high but satisfies nobody. The standard fix is a KL-regularised objective:

\[\mathcal{J}(\phi) = \mathbb{E}_{x \sim \mathcal{D},\, y \sim \pi_\phi(\cdot|x)}\!\left[r_\theta(x, y) - \beta \cdot \mathrm{KL}\!\left[\pi_\phi(\cdot|x) \,\|\, \pi_\mathrm{ref}(\cdot|x)\right]\right]\]

Here \(\pi_\mathrm{ref}\) is the frozen SFT model, and \(\beta\) is a hyperparameter controlling how far the policy can drift. Intuitively, the KL term is a leash: every bit of reward gained by deviating from the reference distribution costs \(\beta\) nats of divergence. This prevents the policy from collapsing into degenerate high-scoring outputs while still improving on the base model.

What a reward model actually is

From pairwise comparisons to a policy

Keep reading with Pro.