The Bandit Framing of RLHF

A 1.3-billion-parameter InstructGPT model outperforms the 175-billion-parameter GPT-3 on human preference evaluations. The difference is not more compute; it is a few thousand labelled comparisons fed into a reward model, then used to steer the policy with RL. That result, from Ouyang et al. (2022), crystallised a question practitioners had been circling for years: what kind of reinforcement learning problem is fine-tuning a language model, exactly?

The answer most practitioners reach for is the contextual bandit framing, and understanding why it fits (and where it breaks) explains almost everything about how RLHF pipelines are designed.

From MDP to Bandit: What Gets Dropped

In a standard Markov decision process (MDP), an agent takes a sequence of actions, receives intermediate rewards at each step, and the state transitions matter: action at step \(t\) affects what is reachable at step \(t+1\). A full MDP treatment of language generation would assign a reward to every token, and the credit-assignment problem over thousands of tokens would be intractable without heavy approximations.

A contextual bandit collapses this entirely. The "context" is the prompt \(x\). The "arm" is the full response \(y\), treated as a single, indivisible action. The reward \(r(x, y)\) arrives once, at the end. There is no state transition to model; there is no credit-assignment across tokens. You play one arm, observe one scalar, and update.

Setting	State transitions	Per-step reward	Credit assignment
Full MDP	Yes	Yes	Hard; requires temporal-difference methods
Contextual bandit	No (prompt is fixed)	No (single scalar)	Trivial; entire response is one "arm"
RLHF in practice	Approximately no	No	Treated as bandit, approximated via PPO

The approximation is deliberate. Treating generation as a bandit sidesteps the MDP machinery while still allowing a well-defined policy gradient update. PPO is used not because the problem is truly an MDP, but because it provides stable policy-gradient estimates and handles the clipping that prevents catastrophic policy updates.

The KL-Regularised Objective

Pure reward maximisation over a fixed reward model is dangerous (more on this shortly). The standard objective used in RLHF adds an explicit penalty that keeps the fine-tuned policy close to the supervised reference model \(\pi_{\text{ref}}\):

\[\max_{\pi_\theta} \; \mathbb{E}_{x \sim \mathcal{D},\, y \sim \pi_\theta(\cdot|x)} \left[ r(x, y) \right] \; - \; \beta \, \mathrm{KL}\!\left[\pi_\theta(\cdot|x) \;\|\; \pi_{\text{ref}}(\cdot|x)\right]\]

The \(\beta\) term is a hyperparameter controlling how far the policy is allowed to drift. Small \(\beta\) lets the policy chase high reward aggressively. Large \(\beta\) keeps outputs close to the supervised fine-tuned (SFT) baseline.

This objective has a closed-form optimal solution under certain assumptions. Rafailov et al. (2023) showed that the optimal policy satisfying this objective is:

\[\pi^*(y|x) \propto \pi_{\text{ref}}(y|x) \exp\!\left(\frac{r(x,y)}{\beta}\right)\]

That insight is what makes Direct Preference Optimisation (DPO) work: by rearranging this expression, you can write the reward in terms of the log-ratio of policies, which means you can train the policy directly on preference pairs without ever explicitly materialising a reward model.

The reward model itself is typically trained with a Bradley-Terry model of pairwise preferences. Given a preferred response \(y_w\) and a dispreferred response \(y_l\) for the same prompt, the reward model learns to assign higher scores to preferred completions by maximising:

\[\mathcal{L}_\text{RM} = -\mathbb{E}_{(x, y_w, y_l)} \left[\log \sigma\!\left(r(x, y_w) - r(x, y_l)\right)\right]\]

This is a cross-entropy loss on binary preferences; the reward model need never see absolute quality ratings.

Why "Bandit" and Not "MDP" Still Matters

Calling RLHF a bandit problem is not just a simplification for textbooks; it shapes real design decisions.

Reward sparsity. Because the reward is assigned to the full response, the policy gradient must propagate credit back through every token in a potentially long completion. PPO handles this by using the response-level reward as a Monte Carlo return, treating each token's log-probability as an action log-prob. This is why RLHF runs are sensitive to response length: the gradient signal is diluted over longer sequences.

No environment model needed. The language model is both the policy and the transition function. There is no separate world model to learn; the policy generates its own next states by sampling tokens. This is why model-based RL approaches have not dominated RLHF: the "environment" is the model itself, so you cannot easily decouple model learning from policy learning.

Exploration is cheap but narrow. In a bandit, exploration means trying different arms. For an LLM, this means sampling different responses. Temperature and top-p sampling handle exploration naturally. But the action space (all possible token sequences) is astronomically large, and most of it is reachable only from certain prefixes, so coverage is effectively local.

The KL penalty is doing double duty. It prevents reward hacking (see below) and it preserves the linguistic fluency learned during pre-training and SFT. Without it, the policy would quickly collapse to short, repetitive strings that happen to score well on the reward model.

When It Falls Down

Reward over-optimisation (Goodhart's law in action). Gao, Schulman, and Hilton (2022) measured this empirically: as the KL divergence between the RLHF policy and the reference model grows, the proxy reward (from the reward model) increases, but the gold reward (from a held-out, larger RM that better approximates human preferences) eventually decreases. The relationship is approximately:

\[r_\text{gold} \approx \alpha \sqrt{d_\text{KL}} - \beta \, d_\text{KL}\]

where \(d_\text{KL}\) is the KL divergence from the reference policy. Proxy reward climbs; true quality peaks then falls. This is a direct consequence of treating a noisy reward model as ground truth: the policy finds exploits the reward model did not anticipate.

Preference data is not IID across prompts. The bandit framing assumes prompt distribution \(\mathcal{D}\) is fixed and representative. In practice, annotators see a curated set of prompts, often skewed toward certain domains. The resulting reward model generalises poorly to out-of-distribution queries, and the KL penalty cannot compensate because it has no notion of which regions of prompt space are underrepresented.

Credit assignment breaks for long-horizon tasks. When the task requires multi-step reasoning (maths proofs, coding, multi-turn dialogue), assigning a single end-of-response scalar conflates many decisions. A correct final answer might follow from flawed intermediate steps; a wrong answer might follow from mostly sound reasoning. Process reward models (step-level rewards) try to address this, but they break the pure bandit framing and re-introduce the MDP.

Length bias in comparisons. Human annotators systematically prefer longer, more detailed responses, all else being equal. A reward model trained on these comparisons encodes length as a proxy for quality. The RLHF policy then learns to pad responses. Correcting for length bias requires either explicit normalisation in the comparison protocol or a length-conditional reward model.

Bandit feedback is one-shot. Human comparisons capture a snapshot preference that may not reflect what the user actually wanted after seeing the response in context. The bandit loop has no mechanism to incorporate follow-up feedback; every preference label is treated as an independent, authoritative signal.

From MDP to Bandit: What Gets Dropped

The KL-Regularised Objective

Why "Bandit" and Not "MDP" Still Matters

When It Falls Down

Further Reading