← Concept library

Foundations

Reward Models as Learned Rewards

A reward model is a classifier trained on human preference comparisons that outputs a scalar score, standing in for the true human utility function during RL fine-tuning of language models.

intermediate · 8 min read · Premium

When OpenAI released InstructGPT in 2022, a 1.3 billion parameter model outperformed the raw 175 billion parameter GPT-3 on human preference evaluations. The size gap is a distraction; the real story is the reward signal. The model was not trained on a fixed objective like next-token cross-entropy. It was trained to maximise a score produced by a second neural network, one that had learned, from roughly 50,000 human comparisons, what "a good answer" feels like. That second network is the reward model, and understanding it is central to understanding modern LLM alignment.

What a reward model actually is

A reward model (RM) is a language model with its final unembedding head replaced by a single linear layer projecting to a scalar. Given a prompt \(x\) and a completion \(y\), it outputs \(r_\theta(x, y) \in \mathbb{R}\). Nothing about the architecture forces the output to mean anything useful; that meaning is instilled by training.

Keep reading with Pro.

You're reading the preview. Unlock the full concept plus the library, study plans, the AI mentor, and daily emails.

Sign in to save and react.
Share Copied