← Concept library

Reasoning Models

Chain of Thought, Self-consistency, and Search at Inference

A tour of the inference-time reasoning toolkit - from zero-shot CoT prompts to MCTS-decoded reasoning trees, and when each pays for itself.

intermediate · 8 min read

Chain-of-thought (CoT) prompting taught the field that reasoning is partly decoding strategy, not just model capability. The four years since have built a small zoo of inference-time techniques on top of that observation: sample one chain, sample many and vote, expand them into a tree, search the tree with a value function. They form a ladder of increasing compute for increasing accuracy on hard problems, and the choice of rung is the first question to answer when you put a reasoning workload into production.

Zero-shot CoT vs trained CoT

Zero-shot CoT is a prompt trick: append "Let's think step by step" (Kojima et al, 2022) and the model emits a reasoning trace before its answer. Works on any instruction-tuned model. Free. Often gives most of the gain.

Few-shot CoT (Wei et al, 2022) shows 2-5 worked examples with explicit reasoning before the target question. Better than zero-shot when you can craft good demonstrations.

Trained CoT bakes the behaviour into the weights. The model is SFT-tuned (or RL-tuned) on long reasoning traces so it produces them spontaneously without a prompt. Every frontier model since GPT-4 does this implicitly. The o-series, R1, Gemini Thinking take it further: reasoning is the primary output mode, often hidden from the user.

The practical takeaway: on a modern model, "think step by step" rarely beats the model's default behaviour. The interesting knobs are above the prompt level.

Self-consistency

Wang et al, ICLR 2023, "Self-Consistency Improves Chain of Thought Reasoning" (arXiv 2203.11171). Sample N independent reasoning chains at temperature > 0, look at the final answer of each, return the majority vote.

def self_consistency(prompt, n=40, temperature=0.7):
    answers = []
    for _ in range(n):
        chain = sample(prompt, temperature=temperature)
        answers.append(extract_final_answer(chain))
    return mode(answers)

Reported gains on GSM8K: +17.9 points over greedy CoT. On MATH, AQuA, StrategyQA, ARC: similar story.

Why it works: greedy decoding commits early to one chain and propagates any errors in that chain forward. Sampled chains explore different reasoning paths; the correct answer tends to be reachable via several distinct paths while wrong answers are usually one-off mistakes. Marginalising over chains concentrates probability on the correct answer.

Where it breaks: tasks where wrong answers are systematically modal (the model is confidently wrong in a consistent way), tasks with free-form outputs that cannot be aggregated by mode (essays, code), tasks where N is bounded by latency budget below the level needed for the variance to average out.

Tree of Thoughts and Graph of Thoughts

Yao et al, NeurIPS 2023, "Tree of Thoughts" (arXiv 2305.10601) frames reasoning as deliberate search:

  • A thought is a coherent intermediate text chunk (a step, a sub-conclusion).
  • The model generates multiple candidate thoughts at each step.
  • A value function (either the model self-evaluating or an external checker) scores each candidate.
  • A search algorithm (BFS, DFS, beam) expands the most promising branches.

On Game of 24, ToT solved 74% of problems versus 4% for GPT-4 with chain-of-thought. The gap is what disciplined search buys you on combinatorial problems.

Graph of Thoughts (Besta et al, 2023) generalises further: thoughts form a DAG, allowing merging (aggregating two partial solutions) and refining (improving a thought based on others). More expressive, more expensive, mostly useful for tasks with natural composition structure.

MCTS-style decoding

For problems with a verifier and large branching factors, Monte Carlo Tree Search becomes attractive. The model is the node-expansion policy; the verifier (or a value model) is the reward at terminal nodes; UCB balances exploration and exploitation.

rStar / rStar-Math (Microsoft, 2024) used MCTS with a process reward model to teach a 7B Qwen to match o1-preview on MATH. LLaMA-Berry (2024) added pairwise preference search on solution pairs. Both are existence proofs that small base model + heavy inference search + good value function can beat much larger models without search on verifiable maths tasks.

Schematic of MCTS decoding:

def mcts_decode(prompt, n_iterations=400):
    root = Node(prompt)
    for _ in range(n_iterations):
        leaf = select(root)                   # UCB descent
        children = expand(leaf, model)        # sample k continuations
        value = evaluate(children, verifier)  # PRM or programmatic check
        backpropagate(leaf, value)
    return best_path(root)

When parallel sampling beats deeper single-trace

The Snell test-time-compute paper made this explicit: the optimal strategy depends on problem difficulty.

Problem difficulty Best strategy
Easy (model usually right first try) Greedy decode
Medium (model right with effort) Self-consistency, modest N
Hard (rare correct paths) Verifier-guided beam search / MCTS
Pathological (model essentially incapable) More base capability, not more search

Heuristic: if the model's pass@1 is decent but pass@N grows fast with N, parallel sampling (self-consistency, best-of-N) wins. If pass@1 is low and individual chains tend to dead-end early, search (ToT, MCTS with a PRM) wins because it can prune doomed branches before they burn compute. If pass@N plateaus near pass@1, your bottleneck is the model, not the decoding strategy.

Where it falls down

  • Aggregation requires comparable outputs. Majority-vote is trivial for "answer is a number" and hard for "answer is a paragraph". For open-ended generation, you fall back to LLM-as-judge ranking, which adds cost and bias.
  • Value functions are expensive. A real verifier (run the code, check the proof) is the cheapest. A trained PRM is next. Self-evaluation is the weakest and most prone to confident wrong scores.
  • Search depth multiplies cost fast. A branching factor of 4 and depth of 10 is 4^10 nodes if you do not prune. You always prune; the prune quality determines whether you find the right answer.
  • Latency budget. Self-consistency at N=64 is 64x the latency of greedy. For an interactive product this is unacceptable; for a batch agent it is fine.

What changed in 2024-2026

The frontier shifted from "is CoT useful?" (answered: yes, by 2022) to "how do you turn extra inference compute into extra accuracy efficiently?". By 2025 the answer was a pipeline rather than a single technique: trained CoT in the model, self-consistency or PRM-guided search at inference, optional verifier in the loop for agent tasks. Reasoning models like o1, R1 and Gemini Thinking are the integration of these moves into the model itself, not new mechanisms.

Further reading