Foundations
Best-of-N and Inference-Time Selection
Best-of-N sampling generates multiple completions from a language model and returns the one ranked highest by a reward model, trading inference compute for quality without updating any weights.
intermediate · 8 min read
Asking a model once and hoping it is right is not a policy; it is a gamble. The same 7B model that scores 15.9% on SWE-bench Lite when sampled once reaches 56% when sampled 250 times and the best solution is selected -- a figure that beat the single-sample state-of-the-art at the time (Brown et al., 2024). That gap is not magic. It is a consequence of coverage: the model already knows enough to solve many problems; it just does not surface the right answer deterministically. Best-of-N and related inference-time selection methods are the systematic attempt to exploit that latent capability.
What Best-of-N Actually Does
The mechanics are disarmingly simple. Given a prompt x, sample N independent completions y_1, ..., y_N from the policy pi, score each with a reward function r, and return the argmax:
y* = argmax_{i=1..N} r(y_i)
The reward function can be a trained reward model (an outcome reward model, or ORM), a process reward model (PRM) that scores intermediate reasoning steps, a verifier that checks a final answer against ground truth, or even majority voting ("self-consistency") where the score is how many other completions agree.
The expected reward of Best-of-N under a fixed policy is analytically tractable. If you model completions as i.i.d. draws from a score distribution F, the expected maximum of N draws is the N-th order statistic: it grows roughly as log N for light-tailed distributions. In practice, gains are real but sublinear; doubling your compute budget adds less than doubling your gains.
| N | Typical gain on math benchmarks (relative) |
|---|---|
| 1 | baseline |
| 4 | +10 to +20 pp |
| 16 | +15 to +30 pp |
| 64 | +20 to +40 pp |
| 256+ | diminishing; selection noise dominates |
These numbers are indicative, not universal. They depend heavily on task difficulty, model size, and -- critically -- how good the verifier is.
Why It Matters for RL Post-Training
Best-of-N sits at an interesting position in the RL-for-LLMs hierarchy. Unlike PPO or GRPO, it requires no gradient updates, no KL penalty tuning, and no instability from reward hacking during training. It is a pure inference-time strategy.
But it connects tightly to RL concepts. The quality ceiling of Best-of-N is bounded by the reward model: optimise hard enough against any fixed proxy and you will eventually diverge from true performance (Gao, Schulman, Hilton 2022). Even without weight updates, running Best-of-N with a miscalibrated reward model is still reward overoptimisation in miniature. The model that wins the argmax is not the best completion -- it is the completion the reward model thinks is best.
Best-of-N is also the simplest implementation of the broader idea of "inference-time compute scaling": spending more FLOPs at test time rather than increasing model size. Snell et al. (2024) showed that, for problems of the right difficulty band, a smaller model with optimal test-time compute allocation can outperform a 14x larger model running a single greedy decode.
The Role of the Verifier
Verifier quality is the load-bearing variable. Three verifier types appear in practice:
Outcome reward models (ORM). A classifier trained on human preferences or correctness labels that scores a completed sequence. Fast, general-purpose, but blind to intermediate reasoning. A model that arrives at the right answer by wrong reasoning gets the same score as one that reasons correctly.
Process reward models (PRM). Score individual steps in a chain-of-thought. Cobbe et al. (2021) introduced this framing for math: train a verifier on step-level correctness, then select the completion whose worst step is highest-scored. PRMs are more expensive to label and train but are harder to fool with answer-hacking.
Automatic verifiers. For tasks with ground-truth answers (maths, code execution, formal proofs), you can verify correctness exactly. This is the regime where Best-of-N works best. Brown et al. (2024) found that coverage (fraction of problems solved by any sample) scales predictably across four orders of magnitude when an exact verifier is available, but selection with a learned reward model plateaus much earlier.
The intuition: an exact verifier is never fooled by a confident-but-wrong answer. A learned reward model can be. The harder you push N, the more likely you are to surface a completion that scores high on the proxy while being wrong.
Compute-Optimal Selection
Not all prompts benefit equally from more samples. Easy problems are solved on sample 1; very hard problems remain unsolved even at N=1000. Spending the same budget on every prompt is wasteful.
Snell et al. (2024) showed that a compute-optimal strategy -- allocating more samples to harder prompts and fewer to easy ones -- yields more than a 4x efficiency improvement over uniform Best-of-N at the same total compute budget. In practice, difficulty estimation is done via a lightweight predictor trained on the prompt.
This also motivates adaptive methods that go beyond flat sampling: beam search guided by a PRM, Monte Carlo Tree Search over reasoning steps, or iterative self-refinement. All of these can be viewed as more structured ways of searching the completion space, where Best-of-N with an ORM is the simplest possible such search.
When It Falls Down
Verifier mismatch. The reward model is a proxy, not the truth. At high N, the returned completion is the one best adapted to the reward model's blind spots, not the genuinely best answer. This is Goodhart's law applied at inference time: once the metric becomes the target, it ceases to be a good measure.
Coverage ceiling. If the correct answer never appears in any of N samples -- because the model simply does not have the capability -- no selection strategy can help. Coverage saturates for problems that exceed the model's knowledge or reasoning depth. Sampling more does not invent capability the model was never trained for.
Temperature sensitivity. Best-of-N requires diversity to be useful. With temperature near zero, all N completions collapse toward the greedy decode and you get no coverage benefit. With temperature too high, individual completions become incoherent. Calibrating temperature (and other sampling hyperparameters like top-p) for the task is non-trivial.
Latency and cost. N=64 means 64x inference cost. For interactive applications this is often unacceptable. Serving infrastructure that can batch these completions efficiently (e.g., speculative decoding, tensor parallelism across N streams) is a real engineering constraint.
Selection noise at large N. As N grows, the reward model is being applied to increasingly marginal completions. Score differences become small relative to reward model noise. The argmax becomes nearly random among the top few completions, so further scaling gives diminishing benefit -- or even hurts if the reward model has systematic biases.
Domain mismatch for reward models. A reward model trained on general instruction-following may be miscalibrated for narrow technical tasks (formal proofs, low-level code, specialised scientific reasoning). Applying Best-of-N with such a model can actively harm selection quality relative to random choice.
Further Reading
- Brown et al. (2024). "Large Language Monkeys: Scaling Inference Compute with Repeated Sampling." arxiv.org/abs/2407.21787
- Snell et al. (2024). "Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters." arxiv.org/abs/2408.03314
- Gao, Schulman, Hilton (2022). "Scaling Laws for Reward Model Overoptimization." arxiv.org/abs/2210.10760
- Cobbe et al. (2021). "Training Verifiers to Solve Math Word Problems." arxiv.org/abs/2110.14168