Prompt Chaining and Task Decomposition

Ask a model to solve a compositional problem in one shot and it will often answer every sub-question correctly, then still get the final answer wrong. Press et al. named this the compositionality gap: the frequency with which a model can answer all the sub-problems but not compose them into the overall solution. They found that as GPT-3 scaled, single-hop accuracy improved faster than multi-hop accuracy, so the gap did not close with size. Scaling alone does not teach a model to chain its own reasoning reliably. The fix is structural, not bigger: break the task into a pipeline of simpler prompts, solve them in order, and feed each answer forward.

Three ways to decompose

The literature converged on three patterns, each solving a slightly different problem.

Least-to-most prompting (Zhou et al.) decomposes a hard problem into an ordered list of simpler subproblems, then solves them in sequence, appending each answer to the context for the next. The name is the mechanism: you order subproblems easiest-first so that solving one supplies exactly what the next one needs. The headline result was compositional generalisation. On the SCAN command-parsing benchmark, standard chain-of-thought on code-davinci-002 scored about 16%; least-to-most reached at least 99% with 14 examples. The win is easy-to-hard generalisation: the model solves problems harder than any single example it was shown, because no single step is harder than the examples.

Decomposed prompting (Khot et al.) makes the decomposition modular. A controller prompt breaks the task into sub-tasks and dispatches each to a dedicated sub-prompt handler, a library of prompting-based LLMs each specialised for one operation. A handler can itself be decomposed recursively, swapped for a stronger prompt, or replaced with a symbolic function or a retrieval call. This is the shift from "one clever prompt" to a small software system: sub-tasks become named, reusable, independently optimisable units.

Self-ask (Press et al.) pushes the decomposition inside a single generation. Before answering the top-level question, the model explicitly writes follow-up questions and answers them, one at a time, then composes the final answer. Because the follow-ups are surfaced as explicit text, you can intercept each one and route it to a search engine instead of trusting the model's parametric recall, which is where a large part of self-ask's accuracy gain comes from on multi-hop factual questions.

Question: Who was president when the Eiffel Tower opened?
Are follow-up questions needed here: Yes.
Follow-up: When did the Eiffel Tower open?
Intermediate answer: 1889.
Follow-up: Who was US president in 1889?
Intermediate answer: Benjamin Harrison.
So the final answer is: Benjamin Harrison.

Why a chain beats a megaprompt

Cramming the whole task into one prompt asks the model to plan, retrieve, compute, and format in a single forward pass, with no checkpoint in between. Decomposition buys four concrete things.

Error isolation. When the output is wrong you can see which step failed. A megaprompt gives you one opaque blob; a chain gives you a stack trace.
Per-step validation. Each intermediate output has a narrow, checkable contract (a number, a JSON object, a yes/no). You can assert, regex, parse, or schema-validate between steps and retry just the failing one.
Cost and model routing. Cheap, mechanical steps (classification, extraction, formatting) go to a small fast model; only the genuinely hard reasoning step pays for the frontier model. One megaprompt forces every token through the most expensive model you might need for any part.
Better compositional reasoning. This is the empirical result behind all three papers: keeping each step within the difficulty the model handles reliably, and feeding clean intermediate answers forward, closes the gap that a single pass leaves open.

Static chain versus agentic loop

There are two shapes a decomposition can take, and conflating them is a common mistake.

A static chain is a fixed directed graph you design: step A always feeds step B feeds step C. You wrote the control flow; the model only fills in the content of each node. Anthropic's engineering write-up calls this the prompt chaining workflow and is blunt about when to use it: when the task decomposes cleanly into fixed subtasks known in advance. Static chains are predictable, cheap to evaluate, and easy to debug precisely because the path never changes.

An agentic loop hands the model the wheel: it decides the next step at runtime based on what it just observed, and decides when it is done. Same underlying idea (decompose, solve, compose) but the graph is discovered, not authored. You reach for this only when the sequence of steps genuinely cannot be known ahead of time. The trade is real: you gain flexibility and pay with unpredictability, harder evaluation, and a wider failure surface. Most tasks labelled "agent" are static chains in disguise and are more reliable built as one. See the agents domain for where the loop earns its cost.

The decision rule: if you can draw the DAG before running anything, build the static chain. Only when the branching depends on intermediate results you cannot anticipate does the agentic loop pay for itself.

When it falls down

Errors compound across steps. A chain is a product of per-step reliabilities. Ten steps at 95% each is roughly 60% end-to-end. Long chains need validation gates between steps, not just longer prompts; without them, one early mistake poisons everything downstream.
Latency and cost multiply with length. Each step is a serial round-trip. A five-step chain is at least five sequential model calls, so time-to-answer and token spend scale with chain length. Parallelise independent branches; do not serialise steps that do not depend on each other.
Orchestration and state-passing complexity. You now own the plumbing: what each step outputs, how it is parsed, what happens on a malformed intermediate result, how state threads through. This is real software with real bugs, and it lives outside the model where prompt tweaks cannot reach it.
Some tasks should not be split. Decomposition assumes subproblems are separable. Tasks that need joint reasoning over the whole context (holistic tone judgements, tightly coupled constraint satisfaction, anything where the parts only make sense together) lose information at every cut. Forcing a decomposition there discards the cross-step context the task actually depends on, and a single well-constructed prompt beats the pipeline.

Three ways to decompose

Why a chain beats a megaprompt

Static chain versus agentic loop

When it falls down

Further reading