← Concept library

Reasoning Models

Test-time Compute Scaling

Why "thinking longer" at inference can substitute for "training bigger", how the trade-off is operationalised as a token budget, and where the strategy stops paying.

intermediate · 7 min read

For most of the deep-learning era the only knob that mattered was train-time compute. Bigger model, more tokens, more FLOPs, better loss. The 2024 reasoning wave introduced a second knob the field had under-used: inference-time compute. Hand the same model more thinking room per query - a longer internal scratchpad, more candidate solutions, an explicit search - and on a wide class of problems it gets meaningfully better without changing a single weight.

What "thinking longer" actually means

There is no new mechanism. The model still emits one token at a time. What changes is the budget:

  • Longer chains of thought. Let the decoder generate thousands of intermediate reasoning tokens before the final answer. Those tokens are hidden from the user but consume real compute (and real money).
  • Parallel sampling. Draw N independent completions and aggregate (majority vote, best-of-N against a verifier, self-consistency).
  • Search. Tree of Thoughts, MCTS-style rollouts, beam search over reasoning steps. The model becomes a node-expansion oracle and a value head; an outer loop drives the search.
  • Iterative refinement. Generate, critique, revise. Each pass is another inference call.

All four cost more wall-clock and more tokens. All four can move a fixed-weights model up the accuracy curve on hard problems.

The Snell et al result

The clearest empirical statement is Snell, Lee, Xu, Kumar (DeepMind, August 2024): "Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters". Headline findings:

  1. On MATH-difficulty problems, a small model with a compute-optimal test-time strategy can match the accuracy of a model ~14x larger evaluated greedily.
  2. The optimal strategy is question-difficulty-dependent. Easy problems: a single greedy decode is fine. Hard problems: sequential revision plus verifier-guided search wins. Naïve "always sample 64 and majority vote" leaves performance on the table.
  3. The trade-off only holds when base capability is non-trivial. If the model cannot solve the problem in any of N samples, sampling more is wasted compute.

This is the reason a 32B reasoning model with a long thinking budget can outscore a 400B generalist on AIME or MATH: the small model is paying for accuracy in inference FLOPs the large model would have paid in training FLOPs.

When the trade is worth it

Workload Test-time scaling? Why
Competition maths, theorem proving Yes Verifiable answer, no latency contract
Coding agent that submits a PR Yes Minutes are acceptable, correctness compounds
Scientific data analysis Yes Throughput matters more than latency
Customer-support chat No User waits, expects sub-second first token
Autocomplete No 100ms budget kills everything but greedy
Voice assistant turn No Same
Bulk classification Maybe Depends on per-call $ vs accuracy gain

The clean rule: scale test-time compute when the value of one extra correct answer exceeds the cost of the extra tokens plus the latency penalty. Reasoning models exist because that inequality flipped for a lot of high-value workloads.

How the budget shows up in your bill

In the OpenAI o-series and Anthropic extended-thinking APIs, the model emits reasoning tokens you are billed for but never see. A single hard query can burn 20-50k reasoning tokens before producing 200 visible tokens of answer. Per-call cost goes from cents to dollars. For agentic workflows running thousands of queries this is the dominant line item, and the reason the API exposes a reasoning_effort or max_thinking_tokens knob - the caller has to choose where on the cost-accuracy curve to sit.

Where it falls down

  • Interactive chat. Users will not wait 45 seconds for a reasoning trace to render. Streaming the visible answer hides some of it; the hidden reasoning still has to finish first.
  • Low base-rate problems. If the model gets a class of problem right 0% of the time, sampling 1000 traces still gives you 0%. Compute helps the middle of the distribution, not the impossible tail.
  • Verifier-bottlenecked tasks. Best-of-N needs a way to pick the best. Without a checker (unit tests, a maths answer key, a process reward model) you fall back to self-consistency, which only works when the model's wrong answers are diverse and the right answer is the modal one.
  • Token-budget reward hacking. Models trained with a length-reward signal learn to use the budget regardless of need - padding chains-of-thought with restatements that look like reasoning. The mitigations are length-penalty rewards, adaptive stopping, and serving-side caps.

What changed in 2024-2026

The Snell paper put numbers on what frontier labs had already been quietly exploiting. By late 2024 every major lab shipped a "reasoning" SKU: OpenAI o1 then o3, Anthropic extended thinking, Google Gemini 2.0 Thinking, DeepSeek-R1, Qwen QwQ. The "small-model + big-inference" trade became a real product axis, not a research demo. The corollary nobody anticipated: the cost of a hard query went from predictable to wildly variable, and inference infrastructure had to learn to budget per-call.

Further reading