Reward Modelling in Practice

Anthropic's InstructGPT paper (Ouyang et al., 2022) reported that a 1.3B parameter model trained with RLHF was preferred by human raters over the raw 175B GPT-3. The reward model sitting in the middle of that pipeline is the reason. Without a reliable reward signal, reinforcement learning from human feedback collapses to noise. Getting that signal right is the central engineering challenge of alignment tuning.

What a reward model actually is

A reward model (RM) is a language model with its final token-prediction head replaced by a scalar regression head. Given a prompt x and a completion y, it outputs a single number r(x, y) that estimates how much a human rater would prefer that completion.

Training uses pairwise comparisons. Human labellers see the same prompt with two different completions and pick the better one. If y_w is the winner and y_l the loser, the standard Bradley-Terry loss is:

L = -log σ( r(x, y_w) - r(x, y_l) )

The model is pushed to assign a higher score to the preferred completion. Because the loss only requires relative ranking, you never need labellers to agree on an absolute quality scale. That sidesteps a hard measurement problem.

In practice the RM is usually initialised from the supervised fine-tuned (SFT) model rather than the base pretrained model. The SFT model already speaks the same token distribution as the completions being scored; starting there gives the RM a better prior and requires fewer comparison pairs to reach useful accuracy.

Building the comparison dataset

The quality of the comparison data dominates RM quality. Several practical observations have hardened into consensus:

Dimension	Common failure	Better practice
Labeller agreement	Disagreement rate > 30% makes the signal noisy	Filter or bucket by agreement; treat low-agreement pairs as harder negatives
Prompt diversity	Clustering around popular request types	Active sampling to cover tail distributions
Completion source	Only SFT completions	Mix in adversarial, random, and early-policy completions
Margin signal	Binary choice only	Collect intensity ratings ("much better" vs. "slightly better") and use them to weight the loss

Llama 2's technical report describes using two separate reward models: one for helpfulness and one for safety. This decomposition is expensive but addresses a real problem. A single RM trained on combined feedback tends to learn a trade-off between helpfulness and safety that no single scalar can fully express, so optimising it hard eventually produces a model that is either uselessly cautious or subtly harmful.

From scores to policy: the RL training loop

Once the RM is trained, it is frozen and used as a reward function inside a PPO (or similar) training loop. The setup looks like this:

for each rollout batch:
    1. sample prompt x from dataset
    2. generate completion y ~ π_θ(· | x)    # current policy
    3. compute r = RM(x, y)
    4. compute KL penalty: β · KL(π_θ || π_ref)
    5. optimise: R_total = r - β · KL
    6. update θ via PPO

The KL penalty against the reference policy (usually the SFT model) is not optional. Without it, the policy quickly drifts to completions that exploit the RM's blind spots while becoming incoherent in natural language. The coefficient β trades off how much the policy can deviate from the SFT baseline. Typical values sit between 0.01 and 0.1; tuning it is often the highest-leverage hyperparameter in the whole pipeline.

One subtlety: the RM is typically evaluated at the final token of the completion. This works when the RM was trained that way, but it means the reward signal is sparse: one scalar for a potentially hundreds-of-tokens generation. Dense token-level reward models exist and can accelerate credit assignment, but require more careful training to avoid reward hacking at individual tokens.

Scoring at scale: practical infrastructure

A reward model serving PPO rollouts needs to handle high throughput. During training, the policy generates thousands of completions per minute; the RM must score them without becoming the bottleneck.

Common approaches:

Separate RM server: Host the RM as a dedicated inference process. The PPO trainer sends batched completions over a socket. This decouples scaling the generator from scaling the scorer.
Smaller RM: The RM does not need to match policy size. A 7B RM scoring a 70B policy is common. Smaller is cheaper and easier to replicate for ensemble scoring.
Half-precision inference: The RM only outputs a scalar; it does not need high numerical precision. Running in bfloat16 or float16 halves memory and typically has negligible effect on ranking accuracy.

For Constitutional AI (Bai et al., 2022), Anthropic showed that the RM itself can be replaced in part by an AI model generating preference judgements from a written set of principles (the "constitution"). This reduces dependence on human labelling at scale, at the cost of inheriting the generating model's own biases.

When it falls down

Reward hacking. The RM is a proxy, not the true objective. The policy will find completions that score well on the RM without being genuinely preferred by humans. Gao et al. (2022) measured this empirically: proxy reward keeps climbing while gold-standard human preference flattens and then falls. The relationship follows predictable scaling laws. Standard mitigations include early stopping based on human eval checkpoints, KL regularisation, and periodic RM refreshes trained on policy-generated completions.

Distribution shift. The RM was trained on SFT-era completions. As the policy drifts during RL training, the RM is asked to score completions increasingly unlike its training distribution. Scores become unreliable at the margins. One fix is iterative RM training: collect new comparisons on current-policy completions, fine-tune the RM, resume policy training.

Length bias. Human labellers often prefer longer responses, all else being equal. A naively trained RM absorbs this bias and the policy learns to pad. Calibrating for length explicitly, or using length-normalised rewards, is necessary to prevent this.

Ambiguous labeller instructions. "Helpful, harmless, and honest" sounds clear until labellers face a request where these goals conflict. Without detailed, worked-example labelling guidelines, different labellers will make inconsistent choices and the RM learns a noisy mixture of different objectives. Investment in annotation guidelines pays compounding dividends.

Catastrophic forgetting in the RM. If the RM is fine-tuned from a model that was already instruction-tuned, the instruction-following capability of the base model can degrade. This rarely matters for the RM itself (which only outputs scalars), but it matters if you later want to reuse those weights elsewhere.

What a reward model actually is

Building the comparison dataset

From scores to policy: the RL training loop

Scoring at scale: practical infrastructure

When it falls down

Further reading