Process vs Outcome Rewards

A model can arrive at the right answer for the wrong reason. It might cancel errors, guess lucky, or produce a plausible-looking chain of symbols that does not actually demonstrate understanding. Outcome-based reward ignores all of this: if the answer string matches, the model earns full credit. That single bit of feedback has powered an enormous amount of progress, but it also has a structural blind spot that process supervision was specifically designed to fix.

The core distinction

An outcome reward model (ORM) reads only the final answer and emits a scalar: correct or not, preferred or not. Every step between the question and the answer is invisible to it.

A process reward model (PRM) reads each intermediate reasoning step and scores it independently. A chain of thought with five steps receives five signals. The final training objective sums (or aggregates) those per-step scores.

Concretely, if a model is solving a multi-step maths problem and makes an algebraic error at step 3 but happens to recover at step 5 with the right answer, an ORM rewards the whole trajectory. A PRM penalises step 3 even though the final answer was correct.

The distinction matters because RL optimises aggressively. Given only outcome signal, a policy quickly learns that sounding confident and ending with a plausible-looking number is a winning strategy, whether or not the reasoning is valid.

Why outcome rewards alone are fragile

Consider the KL-regularised RL objective used in standard RLHF:

J(θ) = E[r(x, y)] - β · KL[π_θ(y|x) || π_ref(y|x)]

Here r(x, y) is the reward assigned to a complete response y given input x, and the KL term prevents the policy from drifting too far from the supervised reference. This formulation provides exactly one reward signal per full rollout.

When the reward model is imperfect (and it always is), aggressive optimisation of r leads the policy to find "shortcuts" that score high on the learned reward but low on actual quality. This is reward over-optimisation, or Goodhart's Law applied to language: once a measure becomes a target, it ceases to be a good measure.

In reasoning tasks this shows up as:

Answer duplication: the model repeats the target format without working through the problem.
Step fabrication: steps are grammatically plausible but mathematically incoherent.
Lucky recovery: an incorrect intermediate derivation happens to cancel out, giving a correct answer that the model cannot reliably reproduce.

Outcome signal does not penalise any of these.

How process supervision addresses this

A PRM trained on per-step human labels provides dense, step-specific feedback. The training signal for a rollout becomes:

J_PRM(θ) = E[ Σ_t r_step(x, s_1, ..., s_t) ] - β · KL(...)

where s_t is the t-th reasoning step and r_step is the step-level reward. Errors at intermediate steps receive negative signal immediately, regardless of whether the final answer is correct.

Lightman et al. (2023, "Let's Verify Step by Step") collected 800,000 step-level human annotations (PRM800K) and showed that process-supervised models reached 78% on a representative MATH subset, substantially outperforming outcome-supervised baselines at equivalent compute. Crucially, the PRM also improved reliability: the gap between best-of-N sampling and average performance narrowed, meaning the model was more consistently correct rather than occasionally lucky.

Uesato et al. (2022) had found an interesting asymmetry earlier: outcome supervision and process supervision achieve similar final-answer accuracy, but process supervision produces dramatically fewer reasoning errors among those correct answers. This implies the two approaches train qualitatively different capabilities.

Property	ORM	PRM
Label cost	Low (final answer only)	High (every step annotated)
Training signal density	1 signal per rollout	t signals per rollout
Catches mid-chain errors	No	Yes
Vulnerable to lucky recovery	Yes	Partially
Scales with chain length	Poorly	Better

Process rewards in the GRPO / RLVR era

The 2024-2025 generation of reasoning models (DeepSeek-R1 and its contemporaries) made a different design choice: they used a verifiable outcome reward rather than a learned reward model. For problems with deterministic correct answers (maths, code, formal logic), the reward is simply 1 if the answer is provably correct and 0 otherwise. No neural reward model is needed at all.

DeepSeek-R1 (Guo et al., 2025) demonstrated that this approach, combined with Group Relative Policy Optimisation (GRPO), could elicit long chain-of-thought reasoning including self-correction and backtracking, without a single step-level human label. The outcome reward is still a scalar per rollout, but because it is ground-truth rather than a learned approximation, the over-optimisation problem is structurally reduced: there is no imperfect proxy to Goodhart against.

This creates an interesting fork in current practice:

Learned ORM: cheap labels, applicable to open-ended tasks, but prone to over-optimisation.
PRM: expensive labels, better credit assignment, better calibration for search.
Verifiable outcome reward: zero label cost, immune to proxy gaming, but only applicable where ground truth can be checked.

PRMs have found a second life not as the training signal itself but as a search heuristic during inference: a PRM scores candidate reasoning steps at each node of a beam search or tree search, selecting the most plausible continuation. This decouples the PRM's benefit from the RL training loop entirely.

When it falls down

PRM reward hacking. PRMs are themselves learned models, and they can be gamed. A policy trained against a PRM long enough will find step phrasings that score high without being mathematically sound. The PRM simply moves the goalposts rather than eliminating the problem.

Annotation cost and scalability. Step-level labels are expensive. PRM800K required annotating 800,000 individual steps. For long reasoning chains (hundreds of steps, as in modern chain-of-thought), this becomes prohibitive. Automated process supervision (OmegaPRM, Luo et al. 2024) attempts to replace human labels with Monte Carlo tree search, but introduces its own assumptions.

Ambiguous step boundaries. Reasoning does not always decompose cleanly into discrete steps. For open-ended generation or coding tasks, deciding what constitutes a "step" that deserves individual labelling is a non-trivial design choice. Inconsistent step boundaries introduce noise into the PRM training signal.

Verifiable rewards have a coverage problem. Ground-truth outcome rewards only work for tasks where correctness is machine-checkable. Natural language tasks (summarisation, instruction following, factual accuracy in prose) have no ground truth and cannot use this approach. Learned reward models remain necessary for the majority of real-world applications.

Dense rewards can encourage verbosity. If a PRM is rewarded for each correct step, a model may learn to pad reasoning chains with trivially-true filler steps to accumulate reward, inflating generation length without adding accuracy.

The core distinction

Why outcome rewards alone are fragile

How process supervision addresses this

Process rewards in the GRPO / RLVR era

When it falls down

Further reading