Process Reward Models and Verifiable Rewards

Standard RLHF rewards the final answer. For multi-step reasoning that is a brutally sparse signal: the model gets a single scalar at the end of a 50-step chain, with no way to tell which step was the load-bearing mistake. Process reward models (PRMs) score each intermediate step. Reinforcement Learning with Verifiable Rewards (RLVR) sidesteps the reward model entirely and uses a programmatic checker. Both target the same problem - the credit-assignment crisis on long reasoning traces - and they are now the two dominant recipes behind the 2024-2025 reasoning wave.

ORM vs PRM

An Outcome Reward Model (ORM) takes a (prompt, full chain, final answer) tuple and returns a scalar - usually a probability that the final answer is correct. It is what you get if you train a reward model the standard RLHF way on (problem, solution, correct/wrong) data.

A Process Reward Model (PRM) takes a (prompt, partial chain up to step k) and returns a per-step score - the probability that the reasoning so far is on the right track. Training labels are step-level: a human (or a strong model) marks each step as correct, neutral, or incorrect.

The difference at inference time:

# ORM-guided best-of-N
candidates = [sample_full_solution() for _ in range(N)]
scores = [orm(prompt, c) for c in candidates]
return candidates[argmax(scores)]

# PRM-guided beam search
beam = [empty_chain]
for step in range(max_steps):
    expansions = [c.extend(sample_step(c)) for c in beam for _ in range(k)]
    scores = [prm(prompt, e) for e in expansions]
    beam = top_b(expansions, scores)  # keep best-b partial chains
return best(beam)

ORMs prune after the fact. PRMs prune during search - cutting off doomed branches early and reallocating compute to live ones.

Let's Verify Step by Step

The canonical PRM paper is Lightman, Kosaraju, Burda et al, OpenAI 2023: "Let's Verify Step by Step" (arXiv 2305.20050). Setup: GSM8K-style maths, then MATH. Two reward models trained on the same prompts:

An ORM on (solution, final-answer-correct) labels.
A PRM on PRM800K - 800k step-level human annotations of GPT-4-generated solutions.

Result: best-of-N with the PRM solved 78% of a representative MATH test subset, materially beating the ORM at the same N. The gap widened with N: PRMs scale better with sampling because they prune wrong-direction chains earlier.

A second finding from the same line of work: PRMs can be trained on automatically-labelled data (Math-Shepherd, OmegaPRM) by Monte-Carlo-rolling-out each step and labelling it correct if it leads to a correct answer often enough. This is the bridge from "PRMs are great but labelling is impossible at scale" to "PRMs are great and we can synthesise the labels".

RLVR: skip the reward model entirely

For domains where correctness is programmatically checkable - maths with a known answer, code with unit tests, formal proofs - you do not need a learned reward model at all. You can run RL against a deterministic verifier.

ORM vs PRM

Let's Verify Step by Step

RLVR: skip the reward model entirely

Keep reading with Pro.