Multi-Agent Orchestration Patterns: When Coordination Beats One Agent, and When It Just Multiplies Cost

In June 2025, Anthropic published the internals of the system behind its Research feature and included a number that reframed the entire multi-agent debate: their orchestrator-with-subagents design beat a single Claude Opus 4 agent by 90.2% on an internal research eval, and it did so while consuming roughly 15 times the tokens of an ordinary chat (Anthropic, 2025, How we built our multi-agent research system). Both halves of that sentence are the point. Multi-agent orchestration is not free intelligence. It is a trade: you spend tokens, latency, and coordination risk, and sometimes you get back accuracy or coverage that one agent could never reach, and sometimes you get back nothing but the bill.

Why this matters: Most teams reach for "let's add another agent" the way they once reached for "let's add another microservice," assuming decomposition is always virtuous. It is not. The patterns below decide whether a second agent is a force multiplier or a tax, and the difference is usually visible before you write a line of code.

TL;DR

Four orchestration patterns cover almost everything in production: sequential pipeline, supervisor-worker (orchestrator dispatches specialists), debate / ensemble (agents critique or vote), and reflexive loop (an agent grades and retries its own work).
Multi-agent systems consume roughly 15x the tokens of a single chat in Anthropic's measurements; a lone agent already consumes about 4x. Coordination has to earn that 4x-to-15x jump.
Multiple agents help most on breadth-first tasks (many independent subproblems, parallel search) and on verification (catching errors a single pass misses). They help least when the task is one tightly coupled chain of reasoning that cannot be cleanly split.
A Berkeley-led study found that across seven popular frameworks and 200+ tasks, most multi-agent failures are not model failures but coordination failures: bad specifications, agents talking past each other, and missing verification (Cemri et al., 2025, Why Do Multi-Agent LLM Systems Fail?, arXiv:2503.13657).
Debate and sampling-with-voting reliably lift reasoning accuracy, but with sharply diminishing returns per added agent (Du et al., 2023, arXiv:2305.14325; Li et al., 2024, arXiv:2402.05120).
The strongest engineering principle to come out of 2025 is narrow: keep writes single-threaded and let extra agents add intelligence, not actions (Yan, 2025, Don't Build Multi-Agents, Cognition).

At a Glance

A single coordinator receives a task and chooses how much machinery to spend on it. The choice of pattern is really a choice about how the work decomposes.

flowchart LR
  T["User task"] --> R{"Decomposable?"}
  R -->|"One chain"| S["Single agent"]
  R -->|"Stages"| P["Sequential pipeline"]
  R -->|"Parallel parts"| H["Supervisor + workers"]
  R -->|"Needs checking"| D["Debate or reflexive loop"]
  S --> O["Answer"]
  P --> O
  H --> O
  D --> O
  classDef blue fill:#1e40af,stroke:#3b82f6,stroke-width:1px,color:#fff
  classDef purple fill:#6d28d9,stroke:#a78bfa,stroke-width:1px,color:#fff
  classDef teal fill:#0e7490,stroke:#22d3ee,stroke-width:1px,color:#fff
  class T blue
  class R,P,H,D purple
  class S,O teal

The rest of this article walks each branch: where it came from, how it works, what it costs in real numbers, where it breaks, and how to tell which one your task actually wants.

[IMAGE: A 2x2 decision matrix, axes "task coupling" (loose to tight) and "value per query" (low to high), with the four patterns plotted as labeled regions and "single agent" filling the tight/low quadrant]

Before Agents Talked to Each Other

The idea that several specialized processes should pool partial results on a shared structure is older than the Transformer by four decades. The HEARSAY-II speech understanding system, built between 1971 and 1976, introduced the blackboard: a shared workspace where independent knowledge sources (acoustic, phonetic, syntactic) posted competing hypotheses and opportunistically refined each other's guesses (Erman et al., 1980, The Hearsay-II Speech-Understanding System, ACM Computing Surveys). Barbara Hayes-Roth formalized the control problem in 1985, separating the question of what knowledge to apply from when to apply it (Hayes-Roth, 1985, A Blackboard Architecture for Control, Artificial Intelligence).

The blackboard solved a real problem (fusing heterogeneous, uncertain evidence) and exposed the permanent one: as the hypothesis space grew, coordination cost exploded faster than the answers improved. Every multi-agent system since has lived inside that same tension.

LLMs revived the idea because they made each "knowledge source" a general reasoner instead of a hand-built module. The modern lineage is compressed into a few years.

timeline
  title From blackboards to orchestrators
  1976 : HEARSAY-II blackboard fuses speech hypotheses
  1985 : Hayes-Roth formalizes blackboard control
  2023 : Multiagent debate improves factuality (Du et al.)
  2023 : AutoGen, MetaGPT, ChatDev frameworks land
  2024 : Sampling-and-voting scales accuracy (Li et al.)
  2025 : Failure taxonomy and the single-writer pushback

[IMAGE: A labeled schematic of the classic blackboard architecture, showing independent knowledge sources posting and reading partial hypotheses on a shared central panel, with a control component scheduling which source acts next]

By mid-2023 the framework wave had arrived. AutoGen framed everything as conversations between configurable agents (Wu et al., 2023, arXiv:2308.08155). MetaGPT and ChatDev embedded a software company's roles (product manager, engineer, reviewer) directly into the agent graph, with MetaGPT reporting strong pass@1 on code-generation benchmarks (Hong et al., 2023, MetaGPT, arXiv:2308.00352; Qian et al., 2023, ChatDev, arXiv:2307.07924). The patterns crystallized; the skepticism followed two years later, once teams had run the bills.

How the Four Patterns Actually Work

Strip away the frameworks and almost every production system is one of four shapes, sometimes nested. The useful axis is not "how many agents" but "how does information flow between them."

Sequential pipeline

The simplest composition: agent A's output is agent B's input, in a fixed order. A drafting agent writes, an editing agent revises, a formatting agent finalizes. Each stage sees the previous stage's result and nothing more.

This exists because some tasks genuinely have stages with different objectives, and giving each stage its own prompt and (optionally) its own model keeps each focused. The tradeoff is brittleness: an error introduced at stage one propagates downstream with no mechanism to catch it, and total latency is the sum of every stage. A pipeline is cheap to reason about and expensive to recover when one link is wrong.

Supervisor-worker (orchestrator dispatch)

A capable coordinator decomposes the task, spawns specialist workers, hands each a scoped subtask, and synthesizes their returns. This is the pattern behind Anthropic's Research system: a lead agent records a plan, then spins up subagents that each chase one independent thread inside their own context window, with a final pass that stitches and cites (Anthropic, 2025).

flowchart TD
  U["User query"] --> L["Lead agent plans"]
  L --> W1["Worker: source A"]
  L --> W2["Worker: source B"]
  L --> W3["Worker: source C"]
  W1 --> M["Synthesizer merges"]
  W2 --> M
  W3 --> M
  M --> C["Citation pass"]
  C --> A["Final answer"]
  classDef blue fill:#1e40af,stroke:#3b82f6,stroke-width:1px,color:#fff
  classDef purple fill:#6d28d9,stroke:#a78bfa,stroke-width:1px,color:#fff
  classDef teal fill:#0e7490,stroke:#22d3ee,stroke-width:1px,color:#fff
  class U blue
  class L,W1,W2,W3,M,C purple
  class A teal

Why it works: the workers run in parallel, so wall-clock latency stays near the slowest single worker rather than the sum, and each worker gets a clean context dedicated to one thread, sidestepping the dilution that hurts a single agent juggling ten subgoals at once. The cost is the token multiplier (every worker is a full inference loop, plus the orchestrator's planning and synthesis tokens) and a hard coordination problem: the lead must write subtask instructions precise enough that workers do not overlap or leave gaps. Anthropic found that vague delegation ("research the semiconductor shortage") produced redundant work, while scoped delegation ("find 2024 lead times for automotive-grade MCUs") did not.

[IMAGE: A before/after comparison of a single agent's context window juggling ten subgoals (crowded, diluted) versus three worker windows each holding one scoped subgoal (clean), annotated to show where attention is spent]

A common refinement uses an expensive model for the orchestrator and cheaper, specialized models for the workers, which can cut total cost substantially versus running the strong model everywhere.

Debate and ensemble

Instead of dividing the task, you replicate it. Several agents answer the same question, then either critique each other across rounds (debate) or vote (ensemble). Du et al. showed that multiple model instances proposing and then revising answers over a few rounds measurably improved math, strategic reasoning, and factual accuracy, because agents catch each other's mistakes (Du et al., 2023, arXiv:2305.14325). Li et al. showed the cheaper cousin: simply sampling many independent answers and majority-voting scales accuracy with the number of samples, to the point where a 13B model with enough agents matched a 70B model with few (Li et al., 2024, More Agents Is All You Need, arXiv:2402.05120).

sequenceDiagram
  participant Q as Question
  participant A1 as Agent 1
  participant A2 as Agent 2
  participant A3 as Agent 3
  participant J as Aggregator
  Q->>A1: Solve independently
  Q->>A2: Solve independently
  Q->>A3: Solve independently
  A1->>J: Answer + reasoning
  A2->>J: Answer + reasoning
  A3->>J: Answer + reasoning
  J->>A1: Here are the others. Revise.
  J->>A2: Here are the others. Revise.
  J->>A3: Here are the others. Revise.
  Note over A1,A3: Repeat 2 to 3 rounds
  A1->>J: Final
  A2->>J: Final
  A3->>J: Final
  J->>Q: Consensus answer

The mechanism is error decorrelation. If three agents fail independently, the cases where a majority is wrong are rarer than the cases where any one is wrong. The cost is multiplicative and unforgiving: a 3-agent, 3-round debate is on the order of nine full inferences plus aggregation, and the accuracy curve flattens fast, so the fourth and fifth agent buy far less than the second.

Reflexive loop

One agent (or a paired critic) evaluates its own output against a goal and retries until it passes a check or hits a budget. This is the verification pattern: generate, grade, refine.

stateDiagram-v2
  [*] --> Generate
  Generate --> Evaluate
  Evaluate --> Done: passes check
  Evaluate --> Refine: fails check
  Refine --> Generate
  Generate --> Abort: budget exhausted
  Done --> [*]
  Abort --> [*]

It shines when correctness is checkable: code that must compile and pass tests, JSON that must validate, a claim that can be looked up. The danger is the loop with no real oracle. If the evaluator is just the same model grading itself with no external signal, it can confidently approve a wrong answer, and you have paid for extra rounds that converge on confident nonsense.

[IMAGE: Side-by-side trace snippets rendered as a figure, left "pipeline" showing an error at stage 1 propagating to a wrong final output, right "reflexive loop" showing the same error caught at the evaluate step and corrected on retry]

By the Numbers

The economics are the part most architecture diagrams omit. Anthropic's published figures are the cleanest public anchor: single agents use roughly 4x the tokens of a chat interaction, and multi-agent systems roughly 15x (Anthropic, 2025). Those are their internal measurements, not a universal constant, but the order of magnitude recurs across reports.

The table below summarizes the cost-shape of each pattern. Token and latency columns are relative multipliers against a single-pass agent and are approximate, meant to convey shape rather than guarantee a figure for your workload.

Pattern	Token cost (approx.)	Latency shape	Parallelizable	Best leverage
Single agent	1x	One pass	n/a	Tightly coupled reasoning
Sequential pipeline	sum of stages	Additive	No	Distinct staged objectives
Supervisor-worker	high (orchestrator + N workers)	Near slowest worker	Yes	Breadth-first, independent subtasks
Debate / ensemble	N agents x R rounds	Near one round if parallel	Yes	Verifiable reasoning, error decorrelation
Reflexive loop	1x to Kx (K = retries)	Additive per retry	No	Checkable outputs (code, schemas)

The accuracy side has real anchors too. A March 2026 benchmark ran four orchestration architectures (sequential, parallel fan-out, hierarchical supervisor-worker, and reflexive self-correcting) over 10,000 SEC filings across 25 extraction field types. It reported that the reflexive architecture reached the highest field-level F1 at about 0.943 but at roughly 2.3x the cost of the sequential baseline, while the hierarchical supervisor-worker pattern sat at the most balanced cost-accuracy point (Benchmarking Multi-Agent LLM Architectures for Financial Document Processing, 2026, arXiv:2603.22651). Read those numbers as the paper's results on its corpus, not as a law: the ranking of patterns is task-dependent, and a different document type can reshuffle it.

Architecture (financial-doc benchmark)	Field-level F1	Relative cost	Position
Sequential pipeline	baseline	1.0x	Cheapest, lowest accuracy
Hierarchical supervisor-worker	high	moderate	Most balanced
Reflexive self-correcting	~0.943	~2.3x	Highest accuracy, highest cost

The debate literature adds the diminishing-returns caveat in numbers: gains rise with agents and rounds but the marginal lift per added agent shrinks, which is exactly why "more agents" is a knob with a sweet spot rather than a slider you push to the right (Li et al., 2024, arXiv:2402.05120).

[IMAGE: A line plot of accuracy versus number of agents for sampling-and-voting, annotated to show the steep early rise and the flattening tail past roughly 5 to 10 agents, with a second axis showing linear cost growth]

A Concrete Example

Make it tangible with one query: "Compare the data-residency commitments of the three largest cloud providers for EU customers as of this quarter, with sources."

A single agent would interleave three searches, lose context as each result crowds the window, and tend to over-weight whichever provider it researched last. Suppose it spends about 8,000 tokens and returns a serviceable but uneven answer, strong on one provider, thin on the others.

Route the same query through a supervisor-worker system and the accounting changes:

Step	Agent	Tokens (approx.)	Output
Plan	Lead	1,500	Three scoped subtasks, one per provider
Research A	Worker 1	6,000	Provider A residency terms + 4 source URLs
Research B	Worker 2	6,000	Provider B residency terms + 3 source URLs
Research C	Worker 3	6,500	Provider C residency terms + 5 source URLs
Synthesize	Lead	4,000	Merged comparison table
Cite	Citation pass	2,000	Claims mapped to URLs
Total		~26,000	Balanced, sourced answer

The system spent roughly 3x the tokens of the single agent. Each worker held one provider in a clean context, so coverage is even and citations are attached to specific claims. The three research steps ran in parallel, so wall-clock time is close to one 6,500-token worker plus the planning and synthesis bookends, not the sum.

Was it worth 3x? For a one-off curiosity, no: the single agent's answer was good enough. For a compliance memo a customer will act on, the even coverage and per-claim sourcing are the whole value, and 26,000 tokens is a rounding error against an analyst's hour. The pattern did not become correct or incorrect in the abstract. The task's value per query moved it across the line. That is the real decision variable, and it lives outside the architecture.

Where It Breaks

The failures are rarely about a weak model. A Berkeley-led team hand-analyzed 200+ traces from seven popular multi-agent frameworks and built MAST, a taxonomy of 14 failure modes in three families: specification problems, inter-agent misalignment, and task-verification gaps (Cemri et al., 2025, Why Do Multi-Agent LLM Systems Fail?, arXiv:2503.13657). The headline is uncomfortable: a large share of failures are coordination failures, and the paper found that on several benchmarks multi-agent systems did not clearly beat a well-built single agent. Adding agents added surface area for things to go wrong.

The recurring breakages:

Context fragmentation. When you split a task across agents, each one loses the others' intermediate decisions and assumptions. Two workers can make locally reasonable but mutually contradictory choices, and the synthesizer inherits the conflict. This is the core of Cognition's argument that breaking context across agents is fundamentally risky (Yan, 2025, Don't Build Multi-Agents).

Conflicting writes. If two agents can both take actions that mutate shared state (edit the same file, update the same record), they race and corrupt each other. Cognition's hard-won principle: keep the writer single-threaded. Extra agents can read, analyze, and advise, but one actor commits.

Unbounded loops and cost blowups. A reflexive loop without a strict budget, or a debate without a stop rule, can spiral. Because cost is multiplicative, a misconfigured 5-agent, 5-round debate is 25 inferences for a question one pass would have answered.

Evaluator blindness. Self-grading with no external oracle approves wrong answers. Verification only adds value when the check is grounded in something the generator cannot fake: a compiler, a test suite, a retrieval lookup, a human.

Error propagation in pipelines. Fixed sequential stages have no recovery path; stage one's mistake is stage three's foundation.

flowchart TD
  F["Multi-agent failure"] --> S["Specification: vague subtasks, overlap, gaps"]
  F --> I["Misalignment: agents contradict each other"]
  F --> V["Verification: no real checker, errors slip through"]
  S --> R["Wasted or redundant work"]
  I --> R
  V --> R
  classDef rose fill:#be123c,stroke:#fb7185,stroke-width:1px,color:#fff
  classDef amber fill:#b45309,stroke:#fbbf24,stroke-width:1px,color:#fff
  class F rose
  class S,I,V amber
  class R rose

[IMAGE: An annotated heatmap of the 14 MAST failure modes grouped by the three families, shaded by reported frequency, with the most common modes highlighted]

Alternative Designs

The honest framing is that "single agent" is itself one of the competing designs, and often the right one. The strong single-agent camp argues that a capable model with good tools and careful context engineering beats a fragile committee on most tasks, and that the engineering effort is better spent on the one agent's context than on inter-agent protocols.

Approach	Strengths	Weaknesses	Best when
Single agent + tools	Simple, no coordination risk, cheapest	Limited parallelism, context dilution on broad tasks	Tightly coupled reasoning, single-thread of work
Sequential pipeline	Clear stages, easy to debug	Error propagation, additive latency	Genuinely staged tasks with distinct objectives
Supervisor-worker	Parallel breadth, clean per-worker context	Token multiplier, hard delegation	Breadth-first search, independent subtasks
Debate / ensemble	Higher accuracy via error decorrelation	Multiplicative cost, diminishing returns	Verifiable reasoning where being right matters a lot
Reflexive loop	Catches checkable errors	Useless without a real oracle, loop risk	Code, schemas, anything with an automatic check

The camps are not as opposed as the "agent architecture wars" framing suggests. Anthropic's case for multi-agent is explicitly scoped to breadth-first research where subproblems are independent and the information exceeds one context window. Cognition's case against is scoped to long, stateful tasks where actions must stay coherent. Read together, they agree: parallelize reading and reasoning, serialize writing and acting.

How It Is Used in Practice

Production deployments cluster around the patterns' natural strengths. Deep-research products (Anthropic's Research, and similar features elsewhere) use supervisor-worker because the task is breadth-first by nature, and the economics only close on high-value queries: legal due diligence, competitive intelligence, literature review, where an analyst-hour dwarfs the token bill. Coding agents lean on reflexive loops because compilation and tests are free, trustworthy oracles. Document-extraction pipelines, like the SEC-filing benchmark above, mix hierarchical dispatch with a reflexive check per field.

[IMAGE: A stacked bar chart of token spend per agent in a real research run, orchestrator plus several workers plus synthesis and citation passes, annotated with the cumulative multiple over a single chat]

The operational lessons are consistent across reports. Make subtask specifications explicit, because vague delegation is the top source of wasted work. Cap every loop and debate with a hard budget. Keep one writer. Instrument token spend per agent, since a multi-agent system fails quietly by costing 5x more than expected long before it fails loudly. And measure against a strong single-agent baseline before committing to the orchestration, because the Berkeley study's most actionable finding is that many multi-agent systems never actually clear that bar.

[IMAGE: A schematic of a production research stack, showing the orchestrator, a pool of parallel workers each with isolated context windows, a shared read-only memory, and a single serialized write path to the final artifact]

Insights Worth Remembering

The real decision variable is not "how many agents" but "how does the work decompose and how much is a correct answer worth." Architecture follows from those two, not the other way around.
Parallelize reading and reasoning; serialize writing and acting. Almost every durable principle from 2025 reduces to this one line.
A multi-agent system's most likely failure is not a dumb model but a coordination gap: a vague subtask, two agents contradicting each other, or a missing checker.
Verification only adds value when the verifier knows something the generator cannot fake. A model grading its own homework with no external signal is theater you pay for.
Debate and voting work, but the accuracy curve is concave: the second agent earns its keep, the tenth rarely does.
"Add another agent" has the same seductive wrongness as "add another microservice." Decomposition is a cost you should be paid back for, not a virtue.
Token accounting belongs in the design review. A pattern that is 3x more expensive is fine for a compliance memo and absurd for a search box, and nothing in the diagram tells you which you are building.

Open Questions

How to verify without a ground-truth oracle is the live problem. Reflexive loops and debate both lean on checking, yet for open-ended generation there is often no compiler to call. Whether learned critics or cross-model verification can substitute reliably is unresolved; today the evidence supports verification only where a real check exists.

The right granularity of decomposition is also open. Too coarse and agents do a single agent's job with overhead; too fine and coordination cost swamps the work. Current practice tunes this by hand, and an open question is whether an orchestrator can learn task-appropriate granularity rather than having it prompted in.

Communication efficiency is an active research thread: sparser communication topologies between debating agents appear to preserve most of the accuracy gain at lower cost (Li et al., 2024, Improving Multi-Agent Debate with Sparse Communication Topology, arXiv:2406.11776), which hints that today's fully-connected debates overspend. How far that compresses is not yet settled.

Finally, the field still lacks a standard, contamination-resistant benchmark for orchestration patterns specifically. Most numbers come from task-specific studies whose rankings do not always transfer, so claims that one pattern dominates should be read as "on this corpus," and treated as a hypothesis about yours.