Dynamic Workflows: When the Agent Writes Its Own Orchestration
June 05, 2026 · 25 min read
There is a quiet line in Anthropic's December 2024 field guide that most readers skim past. A workflow, the authors write, is a system where "LLMs and tools are orchestrated through predefined code paths," while an agent is one where the model "dynamically directs its own processes" (Anthropic, 2024, Building Effective Agents). For two years that sentence drew a hard border. You either hand-drew the control flow in advance and accepted its rigidity, or you handed the whole task to a model loop and accepted its drift. Dynamic workflows erase the border by doing something neither camp tried: they let the model write the predefined code path, at runtime, for the task in front of it.
The harness is generated, not hand-written. That is the whole idea, and it is more consequential than it sounds.
Why this matters: Every multi-agent system you have seen is a tradeoff between control and flexibility. A static graph gives you reproducibility and loses adaptivity; a free-running agent gives you adaptivity and loses reproducibility. A generated harness keeps both, because the structure is fixed once the model commits it to code, but the model gets to choose the structure after it has read the problem.
TL;DR
- A "workflow" is the control flow around model calls: the loops, branches, fan-out, barriers, and verification steps. The interesting question is who writes that control flow and when.
- Static workflows (LangGraph-style graphs) fix the control flow before the task is known. Pure agents (ReAct loops) decide every step at runtime with no fixed structure. Dynamic workflows split the difference: the model emits an orchestration program once, then that program runs deterministically.
- This is CodeAct lifted from the action layer to the orchestration layer. Instead of writing code to call one tool, the model writes code to coordinate many sub-agents.
- The payoff is that cooperation, parallelism, and verification become ordinary programming constructs (
parallel,pipeline,for), so a fan-out of fifteen verifiers is one line, not a hand-drawn graph. - The costs are real: Anthropic measured multi-agent research using roughly 15x the tokens of a chat, and reports that token usage alone explains about 80% of performance variance on one browsing eval (Anthropic, 2025, Multi-Agent Research System).
- Generated harnesses inherit a new failure mode: bugs in the orchestration code itself, not just bad model outputs.
- Verification is the part that makes the pattern trustworthy. Adversarial sub-agents that try to refute a finding, run as code, turn "the model said so" into "three independent skeptics could not break it."
- This is not a coding-only technique. Deep research, evals generation, fact-checking, and data synthesis are all fan-out-then-verify shaped, which is exactly what generated harnesses are good at.
At a Glance
A dynamic workflow has two phases that are easy to conflate and important to separate: an authoring phase where a model writes the orchestration program, and an execution phase where that program runs and spawns sub-agents.
flowchart LR
T[Task] --> O[Orchestrator model]
O -->|writes| P["Orchestration program<br/>(loops, fan-out, verify)"]
P --> R{Runtime}
R -->|spawns| S1[Sub-agent]
R -->|spawns| S2[Sub-agent]
R -->|spawns| S3[Sub-agent]
S1 --> V[Verify and synthesize]
S2 --> V
S3 --> V
V --> Result[Structured result]
classDef blue fill:#1e40af,stroke:#3b82f6,stroke-width:1px,color:#fff
classDef purple fill:#6d28d9,stroke:#a78bfa,stroke-width:1px,color:#fff
classDef teal fill:#0e7490,stroke:#22d3ee,stroke-width:1px,color:#fff
classDef slate fill:#334155,stroke:#64748b,stroke-width:1px,color:#e2e8f0
class T blue
class O,P,S1,S2,S3 purple
class R slate
class V,Result teal
The program is deterministic once written. The intelligence went into composing it.
[IMAGE: A two-panel side-by-side. Left panel labeled "Static workflow" shows a fixed graph drawn at design time with greyed-out unused branches. Right panel labeled "Dynamic workflow" shows a model emitting a small program that then expands into a fan-out at runtime. Annotate the moment of authoring in each.]
Before Generated Harnesses
The orchestration question is older than the current agent wave, but the answers kept oscillating between two poles.
The first pole was the pure reasoning loop. ReAct (Yao et al., 2022, ReAct: Synergizing Reasoning and Acting in Language Models, arXiv:2210.03629) interleaved a thought and an action at every step, letting the model decide what to do next based on what it had just observed. It worked: on the ALFWorld decision benchmark ReAct beat imitation and reinforcement baselines by 34 absolute points. But the structure lived entirely inside the model's head, regenerated token by token, which made it adaptive and unreproducible in equal measure. Reflexion (Shinn et al., 2023, Reflexion, arXiv:2303.11366) added a verbal self-critique loop on top, the first widely-cited move toward building verification into the loop rather than bolting it on after.
Then came the autonomous-agent hype of early 2023 (AutoGPT and its imitators), which showed the failure mode of pure loops at full volume: they wandered, repeated themselves, and burned tokens with no convergence guarantee. The reaction was the second pole. Frameworks like LangGraph modeled agent systems as explicit state graphs, often acyclic, drawn by a human engineer before runtime. AutoGen modeled them as scripted conversations between named agents. Both bought back predictability by fixing the structure in advance.
Two research threads quietly undermined the dichotomy. Voyager (Wang et al., 2023, Voyager, arXiv:2305.16291) had an LLM write executable skills, store them in a growing library, and compose them later, unlocking Minecraft tech-tree milestones up to 15.3x faster than prior systems by treating behavior itself as generated code. CodeAct (Wang et al., 2024, arXiv:2402.01030, ICML 2024) showed that letting a model emit a single executable code action instead of a JSON tool call raised success rates by up to 20% across 17 models, because code can express loops and composition that a flat tool call cannot.
The synthesis arrived in 2025. Anthropic's multi-agent research system put an orchestrator model in charge of deciding how many sub-agents to spawn and what each should do, scaling structure to the task. And the November 2025 note on code execution with MCP made the underlying mechanism explicit: agents should write code to drive tools and other agents, keeping intermediate data out of the context window entirely (Anthropic, 2025, Code Execution with MCP).
[IMAGE: A horizontal "authorship locus" diagram tracking where the control flow is written across the four eras: inside the model's head (ReAct), in a human-drawn graph (LangGraph), in a generated skill (Voyager), and in a generated orchestration program (dynamic workflows). One marker sliding along an axis labeled "when is structure decided."]
timeline title From fixed loops to generated harnesses 2022 : ReAct interleaves reason and act 2023 : Reflexion adds self-critique : Voyager generates skill code : AutoGPT exposes loop drift 2024 : LangGraph and AutoGen fix structure in advance : CodeAct makes the action a program 2025 : Multi-agent orchestrator scales structure to task : Code execution with MCP keeps data out of context 2026 : Generated harnesses become a working primitive
How Dynamic Workflows Actually Work
The control-flow spectrum
Start by separating two things that usually travel together: the model calls and the control flow connecting them. A retrieval-augmented question is one model call. A deep research report is hundreds of model calls wired together by loops, fan-outs, and merges. That wiring is the workflow, and there are exactly three places to put the authorship.
In a static workflow, a human writes the wiring before the task arrives. You get a graph that runs identically every time, which is wonderful when the task shape is known (a fixed extract-transform-summarize pipeline) and wasteful when it is not. You either over-provision the graph for the hardest case or under-provision it and watch it fail on the long tail.
In a pure agent, no wiring exists ahead of time; the model improvises each step. This is maximally adaptive and minimally inspectable. You cannot diff two runs, you cannot reason about cost in advance, and a single bad turn can derail everything downstream.
A dynamic workflow puts the authorship in the model but moves it out of the per-step loop. The model reads the task, writes an orchestration program, and then steps back while that program executes. Authorship happens once, with full view of the task; execution is deterministic JavaScript or Python with real loops and real concurrency.
flowchart TD Q[Who writes the control flow?] Q --> A[Human, before the task] Q --> B[Model, every step] Q --> C[Model, once, as code] A --> A1["Static workflow<br/>reproducible, rigid"] B --> B1["Pure agent<br/>adaptive, unreproducible"] C --> C1["Dynamic workflow<br/>adaptive then deterministic"] classDef slate fill:#334155,stroke:#64748b,stroke-width:1px,color:#e2e8f0 classDef blue fill:#1e40af,stroke:#3b82f6,stroke-width:1px,color:#fff classDef rose fill:#be123c,stroke:#fb7185,stroke-width:1px,color:#fff classDef emerald fill:#047857,stroke:#34d399,stroke-width:1px,color:#fff class Q slate class A,A1 blue class B,B1 rose class C,C1 emerald
Orchestration as code
The mechanism is CodeAct one level up. Where CodeAct has the model emit results = [search(q) for q in queries] to drive a tool, a dynamic workflow has the model emit results = await parallel(queries.map(q => () => agent(researchPrompt(q)))) to drive a fleet of sub-agents. The action space is code; the things the code calls are other models.
Why code rather than a declarative graph spec? Because the constructs an orchestrator actually needs are the constructs of a programming language. Fan-out is map. A barrier is await Promise.all. A pipeline with no barrier is independent chains. "Keep going until two rounds find nothing new" is a while loop with a counter. Accumulating to a target is while (found.length < 10). None of these compress cleanly into a static DAG, and forcing them to is where graph frameworks get baroque. Expressed as code, they are one line each, and the model already writes code fluently.
A minimal vocabulary covers most real harnesses:
agent(prompt, opts)spawns one sub-agent and returns its output. With a schema, the return is validated structured data, not prose to be re-parsed.parallel(thunks)runs tasks concurrently and waits for all of them. It is a barrier: nothing past it runs until every branch finishes.pipeline(items, ...stages)runs each item through every stage independently, with no barrier between stages. Item A can be in stage three while item B is still in stage one.- Plain
for/whilegive loop-until-dry, loop-to-budget, and retry.
Barriers are the subtle part
The single most common orchestration mistake is inserting a barrier that the task does not need. A barrier (parallel between two stages) forces the slowest branch to gate every downstream branch. If five research sub-agents run and the slowest takes three times the fastest, a barrier wastes two thirds of the fast agents' wall-clock waiting.
A barrier is justified only when the next stage genuinely needs all prior results at once: deduplicating across the full result set, early-exiting when the total count is zero, or letting one branch reference "all the others." It is not justified by "I need to flatten the list first" (do that inside a pipeline stage) or "the stages feel conceptually separate." The default should be the pipeline; the barrier is the exception you reach for deliberately.
This is the kind of judgment that a generated harness encodes well, because the model can look at the data dependency and choose. A static graph drawn in advance tends to barrier everything, because the author did not know which stages were independent.
Verification as a first-class construct
The reason dynamic workflows earn trust is that verification stops being a hopeful afterthought and becomes a stage you can fan out. Reflexion showed self-critique helps; a generated harness goes further by spawning independent skeptics that never saw each other's reasoning. The canonical shape is an adversarial vote:
?wzxhzdk:0?
Three verifiers, each prompted to break the claim from a different angle, each blind to the others. A finding survives only on a majority. This is the structural answer to the chronic objection that "the model just made it sound plausible": plausibility does not survive three motivated refuters who default to rejection.
[IMAGE: A funnel diagram showing 23 candidate claims entering, narrowing through dedup to 14, then through a 3-skeptic gate to 9 survivors, with the rejected claims peeling off at each stage labeled by reason (duplicate, recency, unverifiable).]
Context isolation
The last mechanism is the least visible and arguably the most important at scale. Each sub-agent owns its own context window; only its structured return value crosses back to the orchestrator. The orchestrator never sees the 10,000 rows a sub-agent scanned, only the five it returned. This is the same insight as code execution with MCP, where keeping intermediate results in the execution environment rather than the model's context cut one example from 150,000 tokens to 2,000, a 98.7% reduction (Anthropic, 2025, Code Execution with MCP). A fleet of sub-agents is a context-management strategy as much as a parallelism strategy.
Seeing It in Motion
The execution-phase interaction between an orchestrator and its sub-agents, with a verification fan-out, looks like this:
sequenceDiagram participant O as Orchestrator participant F as Finder agents participant J as Verifier agents O->>F: spawn N finders (parallel) F-->>O: candidate findings Note over O: dedup across all findings O->>J: 3 skeptics per finding (parallel) J-->>O: refute / survive votes Note over O: keep majority-survivors O->>O: synthesize report
Note the two barriers, and that both are earned: the dedup needs every finder's output at once, and synthesis needs every verdict. Between findings, however, verification can be a pipeline, so a finding from the fastest finder can already be under verification while the slowest finder is still searching.
The loop-until-dry control structure, used when you do not know how many findings exist, is a small state machine:
stateDiagram-v2 [*] --> Search Search --> Dedup: candidates found Dedup --> Verify: fresh items Dedup --> DryCheck: nothing fresh Verify --> Search: reset dry counter DryCheck --> Search: dry < 2 DryCheck --> [*]: dry == 2
The agent keeps spawning rounds of finders until two consecutive rounds surface nothing new, deduplicating each round against everything seen so far. A naive "run five finders once" misses the long tail; the loop converges on it.
[IMAGE: An annotated timeline-style Gantt chart contrasting a barrier-heavy run with a pipelined run on the same workload, showing the wasted idle bands under barriers and how the pipeline packs them. Label total wall-clock for each.]
Finally, the system view: where a dynamic workflow sits relative to the host runtime and the tools.
graph TD User[User task] --> Host[Agent host / runtime] Host --> Orch[Orchestrator model] Orch --> Prog[Generated program] Prog --> Sched[Concurrency scheduler] Sched --> Pool[Sub-agent pool] Pool --> Tools[Tools and data sources] Pool --> Store[(Structured returns)] Store --> Orch Sched --> Dash[Monitoring dashboard] classDef blue fill:#1e40af,stroke:#3b82f6,stroke-width:1px,color:#fff classDef purple fill:#6d28d9,stroke:#a78bfa,stroke-width:1px,color:#fff classDef slate fill:#334155,stroke:#64748b,stroke-width:1px,color:#e2e8f0 classDef teal fill:#0e7490,stroke:#22d3ee,stroke-width:1px,color:#fff class User blue class Orch,Prog,Pool purple class Host,Sched,Dash slate class Tools,Store teal
The scheduler is the unglamorous load-bearing piece. It caps concurrency (a fleet of 100 sub-agents does not mean 100 simultaneous model calls), enforces a total-agent backstop against runaway loops, and feeds a monitoring view so a human can watch a long run without reading every transcript.
By the Numbers
Generated harnesses spend tokens to buy parallelism and verification. The quantities below are the published anchors worth memorizing; the complexity rows are analytical.
| Quantity | Figure | Source |
|---|---|---|
| Agent token use vs chat | ~4x | Anthropic, 2025 |
| Multi-agent token use vs chat | ~15x | Anthropic, 2025 |
| Multi-agent lift over single-agent Opus 4 | +90.2% on internal research eval | Anthropic, 2025 |
| Variance explained by token usage alone (BrowseComp) | ~80% | Anthropic, 2025 |
| Research time reduction, complex queries | up to 90% | Anthropic, 2025 |
| Context reduction, code execution example | 150k to 2k tokens (98.7%) | Anthropic, 2025 |
| CodeAct success-rate lift over JSON/text actions | up to +20% | Wang et al., 2024 |
| ReAct lift on ALFWorld | +34 absolute points | Yao et al., 2022 |
The scaling behavior is where intuition should live. Let \(n\) be the number of independent sub-tasks and \(T\) the wall-clock of the slowest single sub-task chain.
| Pattern | Wall-clock | Token cost | When to use |
|---|---|---|---|
| Sequential | \(O(n \cdot T)\) | \(O(n)\) | dependencies force order |
| Barrier fan-out | \(T_{\max}\) per stage, summed | \(O(n)\) | next stage needs all results |
| Pipeline | \(\approx\) slowest single chain | \(O(n)\) | stages are independent |
| Loop-until-dry | \(O(k \cdot T)\), \(k\) rounds | \(O(k \cdot n)\) | unknown count |
Token cost is roughly linear in the number of sub-agents regardless of pattern; what the patterns trade is wall-clock. The headline is uncomfortable but clear: the dominant lever on quality, per Anthropic's variance analysis, is simply spending more tokens, and parallel sub-agents are how you spend a lot of them in bounded wall-clock.
[IMAGE: A log-log plot of task quality versus total tokens spent, with three curves for single-agent, static multi-agent, and dynamic workflow, annotated with the ~15x token region and the point of diminishing returns.]
A Concrete Example
Take one of the use cases that motivates this pattern: a branching deep-research task with verification. The question is "Which open-source agent frameworks shipped a durable-execution feature in the last year, and what does each actually persist?" The numbers below are an illustrative trace, not measured data, but the shapes are realistic.
The orchestrator reads the question and writes a harness. It decides this is breadth-heavy (many frameworks, independent to investigate) and fact-sensitive (claims about what is persisted must be verified), so it composes: a fan-out of finders by framework, a dedup, a per-claim verification pipeline, and a synthesis.
Round 1, find. It spawns five finder sub-agents, one biased toward GitHub release notes, one toward documentation, one toward changelogs, one toward blog posts, one toward issue trackers. Each returns a small schema-validated list. Combined, they return 23 candidate claims of the form {framework, feature, what_it_persists, source_url}.
Dedup (barrier). Several finders found the same LangGraph checkpointing claim from different pages. Deduplicating by (framework, feature) collapses 23 candidates to 14 distinct claims. This stage needs all 23 at once, so the barrier is correct.
Verify (pipeline, per claim). Each of the 14 claims enters a three-skeptic fan-out. The "reproduce" skeptic tries to find the feature in the actual source tree; the "source" skeptic checks the cited URL says what the claim says; the "recency" skeptic checks the date is inside the window. A claim survives on a 2-of-3 majority.
Here is the intermediate state for a sample of the 14:
| Claim | Reproduce | Source | Recency | Verdict |
|---|---|---|---|---|
| LangGraph persists graph state to a checkpointer | survive | survive | survive | kept |
| Framework X "added durable execution" | refute | survive | refute | dropped |
| AutoGen persists conversation history | survive | survive | refute (older) | kept (2/3) |
| Framework Y persists full tool outputs | refute | refute | survive | dropped |
Of 14 claims, 9 survive. The five that drop split into two groups: two were real features but outside the one-year window (a recency failure the prose would have gotten wrong), and three were plausible-sounding but unverifiable against the source, exactly the hallucinations a single-pass agent would have published with confidence.
Loop check. The orchestrator runs a second finder round to catch the tail. It surfaces three new candidates, two of which are duplicates of kept claims and one of which is fresh and survives verification. A third round surfaces nothing new; the dry counter hits its threshold and the loop exits with 10 verified claims.
Synthesize. A final sub-agent receives only the 10 survivors plus their sources and writes the report. It never saw the 13 dropped or duplicate claims, so it cannot accidentally reintroduce them.
The token bill for this run is perhaps 12 to 18 times a single-shot answer. What you bought is a report where every claim cleared three independent skeptics and a recency gate, with the dropped claims auditable. For a fact-sensitive deliverable, that is the trade the user is choosing on purpose.
Where It Breaks
The honest failure modes are specific, and a practitioner should be able to name them.
Cost is the first wall. Fifteen times the tokens of a chat is not a rounding error. A dynamic workflow that fans out indiscriminately can turn a one-dollar query into a twenty-dollar one. The discipline is to scale structure to the task, the way Anthropic's prompts tell the lead agent that "simple fact-finding requires just 1 agent" while a broad comparison warrants several. A harness that always spawns its maximum fleet is a budget bug.
Barriers create bottlenecks even when correct. Synchronous coordination, where the orchestrator waits for each wave of sub-agents before proceeding, is simple to reason about and, in Anthropic's words, "creates bottlenecks." A single slow or stuck sub-agent stalls the whole wave. Timeouts and dropping a failed branch to null rather than failing the run are not niceties; they are what keep a 50-agent run from hanging on one.
Tightly coupled tasks resist the pattern. The clearest published caveat is that multi-agent fan-out suits breadth-heavy work and fits coding poorly, because "most coding tasks involve fewer truly parallelizable tasks than research" and sub-agents editing shared state collide. Parallelizing a refactor across files that import each other produces merge conflicts and incoherent edits, not speed. Fan-out wants independence; code wants consistency.
The harness itself can be buggy. This is the genuinely new failure mode. A static graph is reviewed by a human before it ever runs. A generated harness is written by a model under time pressure and may barrier where it should pipeline, dedup against the wrong key (a classic: deduplicating against confirmed findings instead of all seen findings, so rejected items reappear every round and the loop never converges), or set a concurrency cap that starves the pool. These bugs do not look like bad model outputs; they look like a workflow that is slow, expensive, or non-terminating for structural reasons.
Error compounding survives verification. Verification reduces false positives; it does not eliminate correlated error. If every finder and every verifier shares the same blind spot (a framework's docs are simply wrong about what it persists), three skeptics drawing on the same wrong source will happily agree. Independence of reasoning does not guarantee independence of evidence.
Observability degrades with scale. A 100-sub-agent run produces 100 transcripts no human will read. Without a monitoring layer that surfaces structure (which stage, which findings survived, where tokens went), a long dynamic workflow becomes an opaque box that either succeeds or fails with little in between. This is why the dashboard is not a luxury; it is the only way to debug the harness.
[IMAGE: A mockup of a workflow monitoring dashboard showing a live tree of phases and sub-agents, a token-spend gauge against a budget ceiling, and a panel of structured intermediate findings, annotated to show what a human actually watches during a long run.]
Alternative Designs
The dynamic workflow is one point in a design space, and it is not always the right one.
| Approach | Strengths | Weaknesses | Best when |
|---|---|---|---|
| Static graph (LangGraph) | reproducible, inspectable, easy to test | rigid; over- or under-provisioned for variable tasks | task shape is known and stable |
| Conversational multi-agent (AutoGen) | natural for negotiation and role-play; flexible dialogue | hard to bound cost; emergent, non-deterministic | open-ended collaboration, prototyping |
| Pure ReAct agent | maximally adaptive; minimal scaffolding | unreproducible; can drift and loop | short tasks, tight tool loop |
| Declarative pipeline (DSPy) | optimizable; separates logic from prompts | structure still authored ahead; less ad hoc fan-out | repeated pipelines you will tune |
| Dynamic workflow | adaptive then deterministic; fan-out and verify are one line each | token-heavy; harness can be buggy; weak for coupled tasks | breadth-heavy, verification-sensitive, variable-shape tasks |
The frameworks are not strictly competitors. A common production shape uses a static graph for the stable top-level flow and lets a node inside it generate a dynamic sub-workflow for the one stage whose shape depends on input. Declarative optimization (DSPy) can tune the prompts that a dynamic harness then orchestrates. The axis that actually separates them is when the structure is decided, and dynamic workflows are simply the only option that decides it after seeing the task and still runs deterministically.
How It Is Used in Practice
The clearest production example is Anthropic's own Research feature, where a lead agent decides how many sub-agents to spawn and parcels out subtasks, and the system as a whole beat single-agent Claude Opus 4 by 90.2% on the internal research eval. The same engineering note is candid that the approach is for "valuable tasks that involve heavy parallelization, information that exceeds single context windows, and interfacing with numerous complex tools," which is a precise description of when the token premium pays off.
In coding agents, the pattern shows up as orchestrated review and migration rather than orchestrated editing. A review harness fans out across dimensions (correctness, security, performance), then verifies each finding adversarially before reporting, which keeps the false-positive rate that plagues single-pass review in check. A migration harness discovers the call sites first, then transforms each in an isolated worktree so parallel edits do not collide, respecting the coupling constraint by manufacturing independence.
Beyond engineering, the shape generalizes cleanly to the use cases practitioners report success with: branching and parallel deep research (fan-out finders, verify, synthesize), session mining across an agent's own history (a sweep with a completeness critic), triage and bug hunting (loop-until-dry over a candidate space), fact-checking (the adversarial-vote pattern directly), evals generation (synthesize candidate items, filter by a judge panel), and "LLM councils" where several models judge the same artifact through different lenses and a synthesis stage reconciles them. None of these are coding tasks, which is the tell that the primitive is general.
[IMAGE: A grid of six small workflow icons, one per use case (deep research, session mining, triage, fact-check, evals gen, LLM council), each showing its characteristic fan-out-and-verify shape so the family resemblance is visible at a glance.]
The operational reality across all of these is that a monitoring dashboard becomes part of the product. A long dynamic workflow is a background process with tasks, metrics, and intermediate reports, and the human's role shifts from writing the steps to watching the run and reading the synthesized result. That is a different working relationship with an agent than the chat turn, and it is the one the pattern is pushing toward.
Insights Worth Remembering
- The interesting variable in any agent system is when the control flow is decided. Static graphs decide before the task; pure agents decide every step; dynamic workflows decide once, after seeing the task, and then freeze it.
- Generated orchestration is CodeAct at a higher altitude. If you believe code is a better action space than JSON for tools, the same argument makes code a better orchestration space than a graph spec.
- The default should be the pipeline, not the barrier. Most "stages" are independent, and a reflexive barrier between them silently throws away the parallelism you paid for.
- Verification is the feature, not the garnish. An adversarial vote among independent skeptics is the structural reason to trust a fan-out output over a single confident generation.
- Sub-agents are a context strategy before they are a speed strategy. Keeping a sub-agent's 10,000 intermediate rows out of the orchestrator's window is often worth more than the parallelism.
- The new bug class is the harness, not the output. A workflow that is slow, expensive, or non-terminating is usually a control-flow bug the model wrote, and it will not show up in any single transcript.
- Token spend is the crude but dominant quality lever; parallel sub-agents are how you spend a lot of it without spending a lot of wall-clock.
- Fan-out wants independence. Any task whose parts depend on each other (most coding) fights the pattern, and forcing it produces conflicts rather than speed.
Open Questions
What is established: multi-agent fan-out measurably outperforms single agents on breadth-heavy research, at a large and quantified token premium; verification via independent critics reduces false positives; and code is a more expressive orchestration medium than static graphs. Those are supported by the cited work.
What remains open is more interesting. Can a model reliably self-assess when a task warrants fan-out versus a single pass, rather than relying on hand-written scaling heuristics in its prompt? The current systems lean on rules of thumb baked into the orchestrator prompt, which is a sign the meta-decision is not yet learned. A second open problem is correlated error: independent reasoning does not buy independent evidence, and we have no clean way to measure or enforce evidential independence among verifiers. Third is the economics of the harness bug, since a generated orchestration program has no human reviewer in the loop, so the question of how to test, lint, or formally bound a model-written workflow before it spends real money is unsolved. Likely near-term developments, stated as expectation rather than fact, include asynchronous orchestration replacing synchronous barriers to kill the bottleneck, reusable harness libraries (a Voyager-style skill library, but for orchestration patterns rather than behaviors), and learned cost models that let an orchestrator predict a fan-out's token bill before committing to it. Whether dynamic workflows become a stable primitive or a transitional step toward something more learned is genuinely undecided.
Sources and Further Reading
Foundational Papers
- Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y., 2022, ReAct: Synergizing Reasoning and Acting in Language Models, arXiv:2210.03629
- Shinn, N., Cassano, F., Berman, E., Gopinath, A., Narasimhan, K., Yao, S., 2023, Reflexion: Language Agents with Verbal Reinforcement Learning, arXiv:2303.11366 (NeurIPS 2023)
- Wang, G., Xie, Y., Jiang, Y., et al., 2023, Voyager: An Open-Ended Embodied Agent with Large Language Models, arXiv:2305.16291
- Wang, X., Chen, Y., Yuan, L., Zhang, Y., Li, Y., Peng, H., Ji, H., 2024, Executable Code Actions Elicit Better LLM Agents, arXiv:2402.01030 (ICML 2024)
Important Follow-up Work
- Anthropic, 2024, Building Effective Agents
- Anthropic, 2025, How We Built Our Multi-Agent Research System
- Anthropic, 2025, Code Execution with MCP: Building More Efficient AI Agents