RLHF as a Reinforcement-Learning Problem

A 1.3-billion-parameter InstructGPT model, trained with reinforcement learning from human feedback, was rated better than a raw 175-billion-parameter GPT-3 by human evaluators on open-ended generation tasks. The parameter count ratio is 100:1 in the wrong direction. That result, reported by Ouyang et al. (2022), crystallised a shift in how the field thinks about post-training: raw scale on next-token prediction is not the same as doing what people actually want. Bridging that gap requires a different objective, and RLHF provides one.

What makes this a reinforcement-learning problem

In standard supervised fine-tuning the training signal is token-level: given a prefix, predict the correct next token from a reference answer. That formulation cannot express preferences about whole-response quality: fluency, factual accuracy, harmlessness, helpfulness. Humans do not annotate token distributions; they compare two full responses and pick the better one.

RLHF maps the language generation problem onto the standard RL framework as follows:

RL concept	RLHF equivalent
Environment	The human (or human-preference simulator)
State	The dialogue context / prompt
Action	Each sampled token (discrete action space ~50k)
Episode	One full response completion
Reward	Scalar from the reward model, given only at end-of-episode
Policy	The language model being fine-tuned

The policy \(\pi_\theta\) maps a prompt \(x\) to a response \(y\). The reward model \(r_\phi(x, y)\) maps the same pair to a scalar. The objective is to find parameters \(\theta\) that maximise expected reward:

\[\max_\theta \; \mathbb{E}_{x \sim \mathcal{D},\, y \sim \pi_\theta(\cdot|x)} \left[ r_\phi(x, y) \right]\]

Training the reward model itself is a separate supervised step: annotators compare pairs \((y_1, y_2)\) for the same prompt \(x\) and label which they prefer. The reward model is fit to those comparisons using the Bradley-Terry model, maximising:

\[\mathcal{L}(r_\phi) = -\mathbb{E}_{(x, y_w, y_l)} \left[ \log \sigma\!\left(r_\phi(x, y_w) - r_\phi(x, y_l)\right) \right]\]

where \(y_w\) is the preferred ("won") response and \(y_l\) is the rejected one.

The KL-regularised objective

Optimising \(r_\phi\) without constraint is immediately problematic. The policy can learn to produce responses that exploit weaknesses in the reward model, generating text that scores high on \(r_\phi\) but that no human would actually prefer. This is Goodhart's Law applied to neural networks: once a measure becomes a target it ceases to be a good measure.

The standard fix is a KL penalty that keeps the fine-tuned policy close to a frozen reference policy \(\pi_\text{ref}\) (usually the SFT-stage model before RL):

\[\max_\theta \; \mathbb{E}_{x,\, y \sim \pi_\theta} \left[ r_\phi(x, y) \right] - \beta \; \text{KL}\!\left[\pi_\theta(\cdot|x) \,\|\, \pi_\text{ref}(\cdot|x)\right]\]

The coefficient \(\beta\) controls the trade-off. Setting \(\beta = 0\) gives unconstrained optimisation (reward hacking territory). Setting \(\beta\) very high turns the problem back into imitation of \(\pi_\text{ref}\) (no alignment benefit). In practice \(\beta\) is tuned empirically; InstructGPT used a small fixed value and found the model responsive to the reward signal while staying coherent.

In implementation the KL term is computed token-by-token and added as a per-token penalty to the reward signal before PPO updates:

for each token t in response y:
    token_reward[t] = 0.0  # no intermediate reward
token_reward[-1] += r_phi(x, y)  # terminal reward at end-of-sequence
token_reward[t]  -= beta * log(pi_theta(t|context) / pi_ref(t|context))

PPO then treats token_reward as the reward signal and updates \(\pi_\theta\) with clipped surrogate objectives and a value function baseline.

Why PPO is used (and why it is non-trivial)

Proximal Policy Optimisation (Schulman et al., 2017) clips the policy-ratio to prevent large updates in a single step:

\[\mathcal{L}^\text{PPO}(\theta) = \mathbb{E}_t \left[ \min\!\left( \rho_t A_t,\; \text{clip}(\rho_t, 1-\epsilon, 1+\epsilon) A_t \right) \right]\]

where \(\rho_t = \pi_\theta(a_t|s_t)/\pi_{\theta_\text{old}}(a_t|s_t)\) is the importance ratio and \(A_t\) is the advantage estimate.

Applying PPO to a language model is substantially harder than in standard RL settings:

The action space is the full vocabulary (roughly 50,000 tokens), far larger than most RL benchmarks.
Episodes are long (hundreds of tokens), making credit assignment difficult.
The reward signal is sparse (one scalar per full response).
The policy is initialised from a large pretrained model; gradient steps must be small to avoid catastrophic forgetting.
You are simultaneously training a critic (value network) that must generalise across the entire context space.

These challenges explain why RLHF implementations require careful batch sizing, gradient clipping, learning-rate warmup schedules, and the KL term above.

When it falls down

Reward model overfitting. The reward model is trained on a finite comparison dataset and is itself a neural network with its own inductive biases. Once \(\pi_\theta\) has been optimised hard against \(r_\phi\), it can find modes that \(r_\phi\) rewards highly but humans would not. Gao et al. (2022) showed empirically that ground-truth performance (measured by a held-out "gold" reward model) follows an inverted-U curve as a function of KL divergence from \(\pi_\text{ref}\): it improves initially, then degrades as over-optimisation sets in. The peak shifts with reward model capacity.

Annotation noise and inconsistency. Human preference labels are noisy. Annotators disagree; labelling instructions change over time; crowd workers optimise for speed. The reward model absorbs this noise. Any bias in the annotator pool (e.g., preference for long, confident-sounding answers) becomes a bias in the policy.

Sparse reward and credit assignment. Assigning the end-of-episode scalar reward to specific tokens is approximate. The value function must learn which tokens were causally responsible for a good or bad response, across sequences that can be several hundred tokens long.

Distribution shift. The policy shifts during RL training. The reward model was trained on completions from the SFT policy. As \(\pi_\theta\) moves away from that distribution, \(r_\phi\) is evaluated on out-of-distribution inputs and its outputs become less reliable. Iterative reward model retraining (collecting new comparisons from the updated policy) is the standard mitigation but is expensive.

Verbosity bias. Without careful design, RLHF policies learn to produce longer responses because annotators often rate longer, structured answers as more helpful regardless of actual content quality. This is a specific instance of reward hacking.

Mode collapse on style. PPO can collapse to a narrow style that scores well on the reward model - overly formal, hedged, or formulaic - rather than maintaining the diversity of the pretrained model.

What makes this a reinforcement-learning problem

The KL-regularised objective

Why PPO is used (and why it is non-trivial)

When it falls down

Further reading