Test-Time Compute: How Reasoning Models Buy Intelligence by the Token

In September 2024, OpenAI shipped a model that was slower than its predecessor on purpose. Ask o1 a competition math problem and it would pause, sometimes for the better part of a minute, generating thousands of tokens of private reasoning before committing to a short answer. The reasoning was not shown to the user, but it was billed to them. For the first time, a frontier lab was selling thinking time as a product, and the benchmark gains were large enough that the field reorganized around the idea within a year.

The shift has a precise name: test-time compute scaling. Instead of making a model smarter by adding parameters and training tokens, you make a fixed model smarter by letting it spend more computation at the moment it answers. The same weights, given a larger inference budget, solve harder problems.

Why this matters: For a decade, the road to better models ran through pretraining: more parameters, more data, more GPUs burned before the model ever met a user. Test-time compute opens a second road. A small model with a generous thinking budget can match a much larger model run cheaply, which changes what you build, what you serve, and where the next order of magnitude of compute actually goes.

TL;DR

Test-time compute scaling improves a fixed model's accuracy by spending more computation at inference, through longer chains of reasoning, many parallel samples, or verifier-guided search.
Snell et al. (2024) showed a compute-optimal strategy can beat a naive best-of-N baseline by more than 4x at equal inference budget, and that on some problems extra inference substitutes for a roughly 14x larger model.
The two axes are sequential (one long, self-correcting reasoning trace) and parallel (many independent samples, then select or vote). They have different failure modes and combine well.
Selection is where the gains live. A perfect generator with a random selector wastes its samples; verifiers, especially process reward models that score each step, are the difference between pass@k potential and realized accuracy.
Reinforcement learning turned the trick from a prompting hack into a native capability. DeepSeek-R1 (2025) showed that pure RL on verifiable rewards makes reasoning, self-reflection, and backtracking emerge without any supervised reasoning traces.
Budget forcing, demonstrated by the s1 model with only 1,000 training examples, controls thinking length so directly that appending the word "Wait" makes a model double-check and fix its own work.
More thinking is not monotonically better. Overthinking is real and measurable: past a problem-dependent point, longer reasoning makes models abandon correct answers and accuracy declines.

At a Glance

flowchart LR
    Q[Prompt] --> G[Model generates<br/>reasoning + candidates]
    G --> S{Selection<br/>strategy}
    S -->|sequential| R[Revise and<br/>extend one trace]
    S -->|parallel| V[Score N samples<br/>vote or verify]
    R --> A[Answer]
    V --> A
    B[Inference budget] -.controls.-> G
    B -.controls.-> S

    classDef blue fill:#1e40af,stroke:#3b82f6,stroke-width:1px,color:#fff
    classDef purple fill:#6d28d9,stroke:#a78bfa,stroke-width:1px,color:#fff
    classDef teal fill:#0e7490,stroke:#22d3ee,stroke-width:1px,color:#fff
    classDef amber fill:#b45309,stroke:#fbbf24,stroke-width:1px,color:#fff
    class Q blue
    class G,R,V purple
    class A teal
    class B amber

The system has three moving parts: a generator that produces reasoning and candidate answers, a selection strategy that decides how to spend the budget, and the budget itself as a tunable knob. Everything interesting in test-time compute is a choice about how to spend that budget well.

[IMAGE: Side-by-side accuracy-vs-compute curves for a small model with increasing test-time budget and a large model at fixed budget, with the crossover point annotated]

Before Reasoning Models

The idea that you could trade computation for correctness at inference predates the reasoning-model era, but it arrived in pieces. Chain-of-thought prompting (Wei et al., 2022, Chain-of-Thought Prompting Elicits Reasoning in Large Language Models, arXiv:2201.11903) showed that simply asking a model to produce intermediate steps before its answer unlocked reasoning that direct prompting could not. The model had the capability; the format was the bottleneck.

Self-consistency (Wang et al., 2022, Self-Consistency Improves Chain of Thought Reasoning, arXiv:2203.11171) added the first crude form of parallel test-time compute: sample many reasoning paths, take the majority answer. It worked because independent errors are diverse while correct reasoning tends to converge. Tree of Thoughts (Yao et al., 2023, Tree of Thoughts, arXiv:2305.10601) generalized this into explicit search over reasoning states.

The missing piece was a good way to tell a right path from a wrong one. Process supervision supplied it. Let's Verify Step by Step (Lightman et al., 2023, arXiv:2305.20050) trained a reward model on 800,000 step-level human labels (the PRM800K dataset) and found that scoring each reasoning step, rather than only the final answer, produced a verifier strong enough to solve 78% of a representative MATH subset by reranking samples. Verification, it turned out, was easier than generation, and that asymmetry is the engine of the whole field.

timeline
    title From prompting trick to native capability
    2022 : Chain-of-thought prompting
         : Self-consistency voting
    2023 : Process reward models (PRM800K)
         : Tree of Thoughts search
    2024 : OpenAI o1 ships RL-trained reasoning
         : Compute-optimal scaling formalized
    2025 : DeepSeek-R1 (pure RL reasoning)
         : s1 budget forcing with 1K examples
         : Latent reasoning in continuous space

What changed between 2022 and 2025 was not the discovery that more inference helps. It was the realization that you could train a model to use that inference well, rather than coaxing it with prompts.

How Test-Time Compute Actually Works

Test-time compute is a family of strategies, not a single algorithm. They divide cleanly along one question: do you spend your budget making one reasoning trace longer and better, or making many traces and choosing among them?

Sequential scaling: one trace, refined

Sequential scaling extends a single reasoning trajectory. The model thinks, notices a problem, backtracks, tries again, and checks its work, all within one continuous generation. This is what reasoning-native models do when they emit a long internal monologue before answering. The compute cost is the number of reasoning tokens generated, and accuracy on hard problems tends to climb as that token budget grows.

The mechanism is powerful because later tokens can attend to earlier ones. A model that wrote a wrong intermediate result can, in principle, catch it and correct course, something a single forward pass cannot do. The cost is latency: sequential tokens cannot be parallelized, so a long trace is a slow answer.

[IMAGE: Two timelines of token generation, one long sequential chain versus N short parallel chains, annotated with wall-clock latency and total token cost]

Parallel scaling: many traces, selected

Parallel scaling samples N independent completions and then picks one. The simplest selector is majority vote (self-consistency); the strongest is usually a learned verifier. Because the samples are independent, they can be generated concurrently, so parallel scaling trades throughput for latency in a way sequential scaling cannot.

The ceiling of parallel scaling is pass@N, the probability that at least one of the N samples is correct. Realized accuracy sits below that ceiling and is determined entirely by the selector. With a random selector you get pass@1; with a perfect oracle verifier you reach pass@N. Everything practical lives in between.

\[\text{acc}_{\text{best-of-}N} = \mathbb{E}\big[\,\text{correct}\big(\arg\max_{i \in 1..N} \; r(s_i)\,)\big]\]

Here $r$ is the verifier's score of sample $s_i$. The quality of $r$ is the whole game. A model can have a high pass@64 and a mediocre best-of-64 if its verifier cannot reliably rank the good sample first.

Verifiers and search

Verifiers come in two shapes. An outcome reward model (ORM) scores only the final answer. A process reward model (PRM) scores each intermediate step, which lets you prune bad reasoning before it finishes and lets you run guided search rather than blind sampling.

A PRM turns generation into a search problem. Beam search keeps the top-k partial traces by step score and expands them; lookahead search simulates a few steps ahead before committing; Monte Carlo Tree Search balances exploring new branches against exploiting promising ones. Each spends compute to find a high-scoring path through the tree of possible reasoning steps.

[IMAGE: A reasoning tree with nodes colored by PRM step-score, pruned low-score branches greyed out, and the surviving high-score path highlighted from root to answer]

flowchart TD
    Start[Problem] --> S1[Step 1 candidates]
    S1 --> P1{PRM scores}
    P1 -->|high| K1[Keep top-k]
    P1 -->|low| X1[Prune]
    K1 --> S2[Step 2 candidates]
    S2 --> P2{PRM scores}
    P2 -->|high| K2[Keep top-k]
    P2 -->|low| X2[Prune]
    K2 --> Ans[Final answer]

    classDef blue fill:#1e40af,stroke:#3b82f6,stroke-width:1px,color:#fff
    classDef purple fill:#6d28d9,stroke:#a78bfa,stroke-width:1px,color:#fff
    classDef teal fill:#0e7490,stroke:#22d3ee,stroke-width:1px,color:#fff
    classDef rose fill:#be123c,stroke:#fb7185,stroke-width:1px,color:#fff
    class Start blue
    class S1,S2,K1,K2 purple
    class Ans teal
    class X1,X2 rose

Compute-optimal allocation

The central result of the field is that how you spend the budget matters more than how much you spend. Snell et al., 2024, Scaling LLM Test-Time Compute Optimally, arXiv:2408.03314 showed that matching the strategy to the problem's difficulty improves efficiency by more than 4x over a uniform best-of-N baseline. Easy problems benefit from sequential revision, because the model is close and just needs to fix small mistakes. Hard problems benefit from parallel search, because the model needs to explore genuinely different approaches.

Their stronger claim reframes the pretraining-versus-inference tradeoff: in compute-matched comparisons on problems within a model's reach, allocating FLOPs to test-time search can outperform using those FLOPs to run a model with roughly 14x more parameters. The caveat is "within reach"; on problems the small model fundamentally cannot do, no amount of thinking helps, and the larger model wins.

Training the model to think

Prompting and search treat the model as fixed. The reasoning-model era made the model itself the variable. OpenAI's o1 (2024) was trained with reinforcement learning to produce long, useful chains of thought, so that its test-time thinking improved with both more RL training and more inference.

DeepSeek-R1, 2025, arXiv:2501.12948 made the recipe public and stark. Its R1-Zero variant was trained with pure reinforcement learning on verifiable rewards (correct math answers, passing code), with no supervised reasoning traces at all. Reasoning behaviors emerged on their own: the model learned to allocate more steps to harder problems, to verify its own intermediate results, and to backtrack when a path failed. The training signal was only "was the final answer right," and the structure of good reasoning fell out of optimizing it.

Seeing It in Motion

A reasoning model answering one hard question runs a loop: generate, optionally check, decide whether to continue or stop. Budget forcing makes the stop condition explicit.

sequenceDiagram
    participant U as User
    participant M as Reasoning model
    participant C as Budget controller
    U->>M: Hard problem
    M->>M: Generate reasoning step
    M->>C: Propose to stop
    C-->>M: Append "Wait" (min budget not met)
    M->>M: Re-examine, fix error
    M->>C: Propose to stop
    C-->>M: Force end (max budget hit)
    M->>U: Final answer

The controller in the middle is the entire trick behind s1, 2025, arXiv:2501.19393. When the model tries to end its thinking before a minimum budget, the system suppresses the end-of-thinking token and appends "Wait," which reliably causes the model to second-guess and often correct itself. When the model exceeds a maximum budget, the system forces a conclusion. Two simple interventions give direct, monotone control over the thinking length, and the resulting accuracy-versus-budget curve is the cleanest demonstration of test-time scaling on a fully open model.

A second, more radical direction moves reasoning out of token space entirely. Latent reasoning (Geiping et al., 2025, Scaling up Test-Time Compute with Latent Reasoning, arXiv:2502.05171) iterates a recurrent block in the model's continuous hidden state, so the model can "think" for more internal steps without emitting more text. This decouples reasoning depth from output length, which sidesteps the latency tax of long visible chains.

stateDiagram-v2
    [*] --> Thinking
    Thinking --> Checking: propose answer
    Checking --> Thinking: budget remains, doubt found
    Checking --> Answer: budget met, confident
    Thinking --> Answer: max budget forced
    Answer --> [*]

By the Numbers

Real figures from the primary literature, with sources. Treat any value labeled approximate as an order-of-magnitude guide rather than a precise benchmark.

Result	Figure	Source
Compute-optimal vs best-of-N efficiency	> 4x at equal budget	Snell et al., 2024
Inference compute substituting for parameters	up to ~14x larger model (in-reach problems)	Snell et al., 2024
Process supervision on MATH subset	78% solved by PRM reranking	Lightman et al., 2023
Human step-label dataset size	800,000 labels (PRM800K)	Lightman et al., 2023
s1 training set	1,000 curated examples	Muennighoff et al., 2025
s1 gain from budget forcing	up to +27% on AIME24 / MATH vs o1-preview	Muennighoff et al., 2025
DeepSeek-R1 vs OpenAI o1-1217	comparable on reasoning benchmarks	DeepSeek-AI, 2025
Projected inference vs training compute	inference demand greatly exceeds training (approx.)	industry analyst projections

The cost model is simple to first order. A dense transformer spends roughly $2N$ FLOPs per generated token for $N$ parameters, so the inference cost of a reasoning trace scales with the number of tokens it generates:

\[\text{cost} \approx 2 N \cdot T_{\text{think}}\]

where $T_{\text{think}}$ is the reasoning token count. Parallel scaling multiplies this by the number of samples; sequential scaling grows $T_{\text{think}}$ directly. Either way, a reasoning answer can cost ten to a hundred times a direct answer, which is why selection efficiency, getting the right answer with fewer tokens, is where the engineering effort concentrates.

[IMAGE: Log-log plot of accuracy versus test-time token budget for an open reasoning model, showing the diminishing-returns curve and the overthinking downturn]

A Concrete Example

Take an AIME-style problem: "Find the number of ordered pairs of positive integers $(a,b)$ such that $a + b = 1000$ and neither $a$ nor $b$ has a zero digit."

Run a 32B reasoning model with parallel scaling at N = 8 and a majority vote. Here is how the samples might land:

Sample	Approach	Answer
1	Count, subtract zero-digit cases	738
2	Inclusion-exclusion on digits	738
3	Brute mental enumeration, arithmetic slip	740
4	Inclusion-exclusion	738
5	Complementary counting	738
6	Off-by-one on boundary	736
7	Inclusion-exclusion	738
8	Miscounts hundreds digit	0 errors flagged, 738

[IMAGE: Histogram of the eight sampled answers showing the 738 mode dominating the scattered wrong answers, illustrating why majority vote is robust to per-sample noise]

Six of eight samples converge on 738; the majority vote returns 738, discarding the three arithmetic slips. A verifier-based best-of-8 would reach the same answer by scoring sample 1's clean derivation highest, even if only four samples had agreed.

Now switch to sequential scaling with budget forcing on a single trace. The model computes the complementary count, reaches 736, and moves to end. The controller appends "Wait." The model re-reads its boundary condition, notices it excluded a valid pair at the edge, corrects 736 to 738, and stops. One trace, one self-correction, the right answer. The same fix that majority vote achieved by averaging out noise, sequential scaling achieved by catching the specific error. This is why compute-optimal allocation prefers sequential revision when the model is already close and parallel search when it is not.

Where It Breaks

More thinking is not free improvement, and the failure modes are specific.

Overthinking is the sharpest. Past a problem-dependent budget, additional reasoning can make accuracy fall: the model talks itself out of a correct answer, introduces an error during unnecessary revision, or drifts off the problem. The accuracy-versus-budget curve is not monotone; it rises, plateaus, and on many problems turns down. Budget forcing helps precisely because an unbounded thinker is often a worse thinker.

[IMAGE: A single accuracy-versus-thinking-budget curve marked into three zones, gains, plateau, and overthinking decline, with the optimal stop point flagged]

Verifier quality caps everything. A best-of-N system inherits the blind spots of its reward model. If the PRM systematically scores a plausible-but-wrong reasoning style highly, more samples make the system more confidently wrong, because the bad pattern is exactly what gets selected. Reward models are also vulnerable to reward hacking under RL: the policy learns to produce traces the verifier likes rather than traces that are correct.

Latency and cost are the operational tax. A sequential trace of 10,000 tokens cannot be parallelized and may take tens of seconds, which is unacceptable for interactive use. Parallel scaling recovers latency but multiplies cost linearly in N. Neither is acceptable for high-volume, low-margin traffic, which is why production systems route only hard queries to the expensive path.

Finally, test-time compute cannot manufacture capability the base model lacks. If a problem requires a fact the model never learned or a leap it cannot make in any sample, no budget recovers it. Thinking longer helps a model that could get there; it does not rescue one that could not.

Alternative Designs

The strategies are not exclusive; production systems blend them. The honest comparison is about where each wins.

Approach	Strengths	Weaknesses	Best when
Sequential (long CoT)	Self-correction, low sample count, high ceiling on near-miss problems	High latency, prone to overthinking	Model is close; errors are local
Parallel (best-of-N, voting)	Low latency via concurrency, robust to per-sample noise	Cost grows with N, capped by verifier	Many independent attempts help; diversity matters
PRM-guided search	Prunes bad paths early, sample-efficient	Needs a strong PRM, complex to build	Long multi-step problems with checkable steps
Bigger base model	No inference-time complexity, broad capability	Fixed cost on every query, expensive to train	Capability is the bottleneck, not effort
Latent reasoning	Depth without output length, lower latency	Less interpretable, newer and less proven	Reasoning depth needed but tokens are costly

The unifying lesson from Snell et al. is that the optimal mix is difficulty-dependent. A system that classifies query difficulty and routes accordingly beats any single fixed strategy, which is why "adaptive" allocation, not "maximal" allocation, is the design target.

How It Is Used in Practice

Reasoning models are now a product category. OpenAI's o-series, DeepSeek-R1 and its open derivatives, and the thinking modes of other frontier models all expose test-time compute as a user- or developer-controllable setting, often as discrete "reasoning effort" levels that map to token budgets. The pattern in deployment is tiered: a fast non-reasoning path handles easy, high-volume queries, and a slow reasoning path is reserved for problems where the accuracy gain justifies the latency and cost.

The economic consequence is the one analysts have flagged most loudly. As reasoning models proliferate, inference becomes the dominant compute consumer, because every hard query now spends what used to be a training-scale luxury on a single answer. This reverses the long-standing assumption that training is where the GPUs go, and it reshapes capacity planning around serving rather than building.

The open ecosystem matters here. DeepSeek-R1 and s1 demonstrated that the capability is reproducible without a frontier lab's budget: R1 with a public RL recipe, s1 with a thousand examples and a clever stop condition. That reproducibility is why test-time compute spread across the industry in months rather than years.

[IMAGE: System diagram of a production router sending easy queries to a fast model and hard queries to a reasoning model with a token-budget controller]

Insights Worth Remembering

Generation and verification are asymmetric: checking a reasoning step is easier than producing it, and that gap is the energy source for every selection-based method.
The budget is not the strategy. Doubling tokens with a bad allocation can lose to half the tokens spent well; compute-optimal beats compute-maximal.
Parallel scaling buys you pass@N as a ceiling, but you only collect it through a good selector. The verifier, not the generator, often sets realized accuracy.
Reinforcement learning on verifiable rewards turns reasoning from a prompting style into an emergent behavior, which is why pure-RL models rediscover backtracking and self-checking unprompted.
Control over thinking length is itself a capability. The word "Wait" is a one-token intervention that recovers a measurable fraction of accuracy.
Overthinking means the relationship between compute and accuracy is a hill, not a ramp. Knowing when to stop is part of reasoning well.
Test-time compute amplifies a model's reachable frontier; it does not move the frontier. Capability still comes from the base model.

Open Questions

The evidence is strong that test-time compute helps on verifiable problems (math, code, formal logic) where a reward is cheap to compute. It is far less settled how the same machinery transfers to open-ended domains, where "correct" is a judgment rather than a check. Whether learned verifiers can score open-ended reasoning without being gamed is an open problem, not a solved one.

The pretraining-versus-inference tradeoff is also unresolved in the limit. Snell et al. showed inference can substitute for parameters within a model's reach; it remains an empirical question how far that substitution extends as base models grow and as reasoning training improves. The likely answer is that the two scale together rather than one replacing the other, but the exact curve is still being measured.

Latent reasoning is the most speculative thread. Moving reasoning into continuous hidden state could decouple thinking depth from token cost entirely, but the interpretability cost is real: a chain of thought you can read is also a chain you can audit, and reasoning in latent space gives that up. Whether the efficiency is worth the opacity is a question the field has not answered.

Sources and Further Reading

Foundational Papers

Important Follow-up Work

Additional Resources

OpenAI, Learning to Reason with LLMs (o1 system documentation), 2024
The s1 project page and code: simplescaling.github.io

A note on figures: benchmark numbers above are quoted from the cited primary sources. Projections about inference exceeding training compute are analyst estimates and are labeled approximate; treat them as directional rather than precise.