Why 'agents' is the wrong frame for most workflows you actually want

Every LLM feature shipped in the last eighteen months has been pitched as an "agent." Most of them are not. They are pipelines with one or two LLM calls, deterministic branches, and explicit handoffs to code. The teams who shipped them on time made one decision early: they refused the agent frame and asked instead, "what is the minimum LLM-step graph that solves this?" The teams still debugging in production reached for an autonomous loop on day one and have been paying for it ever since.

This is not a sermon against agentic systems. Open-ended exploration, coding agents like Cursor, and Claude's Computer Use are cases where the agent frame earns its keep. The argument is narrower: the default disposition should be a workflow. You should be forced into an agent by the shape of the problem, not pulled into one by vocabulary.

The distinction the industry keeps blurring

Anthropic's Building effective agents post (Schluntz and Zhang, December 2024) draws the cleanest line I have read. Workflows are systems where LLMs and tools are orchestrated through predefined code paths. Agents are systems where the LLM dynamically directs its own control flow and tool use. That is it. The difference is who decides what happens next: your code, or the model.

Simon Willison, after collecting 211 different definitions of "agent" and grouping them into thirteen categories, settled on a deliberately narrow one: "an LLM agent runs tools in a loop to achieve a goal." His point is mostly that jargon only works when people share definitions, and on this one they emphatically do not. Half the industry uses "agent" to mean "any feature that calls a model." The other half means a ReAct loop with arbitrary tool access and no upper bound on iterations. These are not the same engineering problem.

When a PM asks for "an agent that handles refunds," they usually mean a routing workflow with two LLM steps and a SQL query. If you build them a ReAct loop with tool access to your refund API, you have shipped a different product than they asked for - and a more dangerous one.

What a workflow actually buys you

A workflow is debuggable in the way a microservice is debuggable. Each LLM call has typed inputs, typed outputs, a known prompt, and a fixed position in the graph. When something goes wrong in production, you can look at the trace, find the step that misbehaved, and reproduce it deterministically. The blast radius of any single bad generation is one node.

An agent loop trades that for autonomy. The model decides which tool to call, when to stop, and what to do when a tool returns garbage. This is genuinely useful when you do not know the steps in advance - for instance, when the user says "find the customer's last invoice and email me a summary" and the path through your APIs depends on what the invoice contains. But it is a tax in every other case. You give up:

Determinism. The same input can produce different traces, sometimes different outcomes.
Latency predictability. A loop that usually finishes in three iterations sometimes takes twelve.
Cost predictability. Token spend per request is bounded only by your max-iterations cap.
Observability. Spans are dynamic. Your dashboards can no longer answer "how often does step 4 fail" because there may not be a step 4.

These are not theoretical costs. They show up the first time a customer support manager asks why one refund took 14 seconds and another took 90.

The pattern that breaks

The pattern that breaks: an agent loop with a long tool list, no strong stopping signal, and tool outputs that the model cannot distinguish from its own reasoning. The model proposes a tool call, the tool returns an ambiguous error, the model invents a plausible interpretation of the error, calls a different tool to "verify," gets another ambiguous response, and spirals. By iteration eight you are paying for hallucinated observations dressed up as a plan. The trace looks productive. The state of the world has not changed.

This failure mode has a name in the literature - plan drift - and it is the reason Reflexion (Shinn et al, 2023) had to add verbal self-critique on top of ReAct (Yao et al, 2022) just to keep loops on track. The fix is not "a better model." Frontier models drift too. The fix is to give the loop fewer tools, a sharper goal, and a hard upper bound, or to not run a loop at all.

A decision table

When a feature lands on my desk, I run the candidate design through this grid before writing any prompts.

Dimension	Choose a workflow when...	Choose an agent when...
Determinism	The same input must produce the same trace (refunds, compliance, billing)	The path genuinely depends on intermediate findings
Latency budget	You have a strict p95 (sub-3s chat, sub-500ms API)	The user is watching a progress bar and expects "work"
Failure mode	A wrong action has lasting consequences (write to DB, send email, spend money)	Failures are cheap to retry and easy to detect
Debuggability	On-call needs to reproduce any incident from a single trace ID	You can afford to read 200 lines of agent scratchpad to debug
Tool surface	Fewer than ~5 well-typed operations, each with clear pre / post conditions	Dozens of tools, many overlapping, schema discovered at runtime
Stopping condition	"Return JSON matching schema X" or "send the email"	"Done when the user's goal is met" (squishy)
Step count	Known and bounded (1-4 LLM calls)	Unknown, may need 10-50 calls
Evaluation	Golden-set evals on input / output pairs work	You need end-to-end task success metrics like SWE-bench or GAIA

The last row is the tell. If your evaluation strategy looks like SWE-bench (run the system, see if the patch passes the hidden tests) you are honestly in agent territory. If it looks like "compare this JSON to that JSON," you are in workflow territory and the agent frame is costing you.

Agent patterns that are actually workflows in disguise

The most useful contribution of the Anthropic post is the taxonomy that follows the workflow / agent definition. Almost everything teams call an "agent architecture" turns out to be one of these patterns, all of which are workflows by Anthropic's own definition:

Prompt chaining. Decompose a task into sequential LLM calls, each operating on the prior step's output. Generate an outline, then expand each section. Translate, then summarise. The control flow is your code.
Routing. Classify the input with a small LLM call, then dispatch to one of several specialised handlers. Customer support triage. Multi-language assistants. The "router" is a one-shot classifier, not a thinking agent.
Parallelisation. Split a task into independent subtasks, run them concurrently, and aggregate. Either sectioning (different prompts, different aspects of the input) or voting (same prompt, n samples, majority wins). This is asyncio.gather, not autonomy.
Orchestrator-workers. A central LLM breaks a task into dynamically-sized subtasks and dispatches workers, then synthesises the results. This one shades into agent territory because the orchestrator decides the decomposition at runtime, but the workers are still deterministic.
Evaluator-optimiser. One LLM produces a candidate output, a second LLM scores it against a rubric, and the loop continues until the score is above a threshold or budget runs out. Useful for translation polish, code generation against tests, copy editing.

If your "agent" is one of the above, call it what it is. The naming matters because the engineering disciplines that ship workflow systems on time (typed schemas, golden-set evals, deterministic replays, clear unit boundaries) are the ones teams skip when they think they are building an agent.

When the agent frame actually wins

I owe the strong cases their due. Three patterns where agentic loops are not just defensible but the only honest design:

Computer Use. Claude's Computer Use takes screenshots, decides where to click, types, and re-screenshots. The state space is the entire screen. There is no way to predefine the control flow because the next action depends on whatever pixel grid just rendered. This is the textbook case.

Coding agents. Cursor's composer mode, Devin, Claude Code, Aider. The task is "make this test pass" or "implement this feature," and the path involves reading files, editing, running, reading errors, editing again. Number of steps is unknowable. Tool surface is small but combinatorial. SWE-bench Verified resolved rates climbed from single digits in early 2024 to past 70% in 2025 precisely because the agent loop was the right primitive, not a workaround.

Open-ended research. "Find me everything we know about X and synthesise it." Deep research products. The browse tree is not predictable. The stopping condition is fuzzy ("when you have enough"). An agent loop with a hard iteration cap and a strong reflection prompt outperforms any chain you could hand-write.

Even here, the production lessons from Cognition's Don't build multi-agents post are worth reading carefully. Their argument, after eighteen months of shipping Devin, is that multi-agent systems are fragile because subagents act on conflicting assumptions without visibility into each other's work. Their fix is single-threaded agents with full context sharing. Even the people building the most ambitious agentic product on the market are pulling complexity out, not adding it.

A small worked example

Workflow vs agent, same task, ~20 lines each

Task: given a customer email, decide whether to refund, escalate, or auto-reply, and execute.

Workflow version:

def handle_email(email: str) -> Action:
    intent = llm.classify(
        email,
        labels=["refund_request", "complaint", "question", "spam"],
    )
    if intent == "refund_request":
        order = db.find_order(extract_order_id(email))
        if not order or not order.refundable:
            return escalate(email, reason="not_refundable")
        return refund(order)
    if intent == "complaint":
        return escalate(email, reason="complaint")
    if intent == "spam":
        return drop()
    reply = llm.draft_reply(email, knowledge_base=kb)
    return send(reply)

Agent version:

def handle_email(email: str) -> Action:
    return agent.run(
        goal=f"Handle this customer email correctly: {email}",
        tools=[
            find_order, refund_order, escalate_to_human,
            send_reply, search_kb, mark_spam,
        ],
        max_iterations=15,
    )

The agent version is shorter. It is also the one that will, occasionally, refund the wrong order. The workflow version is boring. Boring is what you want when money moves.

Build like this instead

When the next "build an agent" ticket lands, run this checklist before opening Anthropic's SDK.

Write the happy path as a sequence diagram. If you can draw it with fewer than five boxes and no loops, you have a workflow. Stop here.
List every tool the system might call. If it is fewer than five and each has a clear precondition, you have a workflow. Use a router or a chain.
Identify the stopping condition. If it is a schema match, a boolean check, or "after step N," workflow. If it is "when the goal is achieved," you are in agent territory - proceed with caution.
Estimate the cost of a wrong action. Sending a duplicate email is recoverable. Issuing a refund is not. The more irreversible the action, the more aggressively you should pull it out of any loop and put it behind a deterministic gate.
Set a hard iteration cap and a token budget. Even if you genuinely need a loop, cap it. Five iterations is usually plenty. Ten is the limit. Anything above that and you are paying for plan drift.
Make every tool call observable. Structured logs, span per call, replayable from trace ID. If you cannot reproduce a production incident in a notebook, you do not have an engineered system, you have a generator.
Evaluate the workflow with golden sets first, end-to-end task success second. The former tells you which step regressed when a model upgrade ships. The latter is the only honest metric for true agentic systems but it is too coarse to debug with.
Default to a workflow. Earn the agent. The burden of proof is on the autonomy, not on the structure.

The agent hype cycle conflates two engineering problems that look superficially similar and are not. One is "compose LLM calls into a reliable pipeline." The other is "give a model latitude to figure out a fuzzy goal." Both are interesting. Only one is what your refund flow needs.

If you find yourself defending an agent design, the question to ask out loud is: what does the loop earn me that a three-step pipeline does not? If the answer is "it sounds more impressive in the demo," you have your answer.