Agent Evaluation Harnesses

A chatbot is evaluated on the quality of one response. An agent runs a loop of plan, act, observe over many steps, and any single step can derail the whole task. That difference breaks the usual evaluation playbook: you cannot grade an agent by comparing its final string to a reference answer, because success is a property of the entire trajectory and the state of the world it leaves behind. The benchmark numbers make the difficulty concrete. On SWE-bench, Claude 2 resolved 1.96% of real GitHub issues. On WebArena, the best GPT-4 agent completed 14.41% of web tasks against 78.24% for humans. On tau-bench, GPT-4o landed below 50%. Agents that demo beautifully score in the single and low double digits on honest harnesses.

Why agent evaluation is different

Three properties separate agent evaluation from standard model evaluation:

Trajectory, not output. The unit of success is a multi-step path, and partial progress is real. A harness must decide whether to score only the end state or to give credit along the way, and both choices distort.
Environment state. Success often means the world changed correctly (the issue is fixed, the form is submitted, the order is placed), not that the model emitted the right words. Grading requires executing in an environment and inspecting its state.
Reliability, not just capability. An agent that succeeds once in eight tries is not production-ready even if its best run is perfect. Consistency is a first-class metric.

The benchmarks worth knowing

SWE-bench gives a model a real GitHub issue and the repository, and asks it to produce a patch. Success is execution-verified: the patch must make the project's test suite pass. This is the gold standard for agent eval because the grader is the code itself, not a judge model, so it is hard to game. The original headline (Claude 2 at 1.96%) showed how far early agents were from real software engineering; the frontier has since climbed sharply, which itself makes contamination a concern.
WebArena stands up realistic, reproducible websites (e-commerce, a forum, a code host, a CMS) and scores whether the agent accomplishes a functional goal in that live environment. It measures web navigation and tool use under genuine UI complexity.
tau-bench evaluates the tool-agent-user triangle: an agent must use domain tools, follow domain-specific policy rules, and interact with a user simulated by another model. Its key contribution is the pass^k metric, the probability of succeeding in all k independent attempts, which exposes reliability rather than peak capability. Agents that pass once routinely fail pass^8.

Building your own harness

Public benchmarks measure general capability; they rarely match your task. A useful in-house harness needs three things the benchmarks model: an environment the agent acts in (sandboxed, resettable to a known state), a verifiable success condition (prefer programmatic checks, an LLM-judge only where you must, see custom-evals-llm-judge), and instrumentation that records the full trajectory so a failure can be traced to the step that caused it. Run every task multiple times and report a reliability metric, not a single pass rate.

Why agent evaluation is different

The benchmarks worth knowing

Building your own harness

Keep reading with Pro.