← Concept library

Agents & Tool Use

Agent Evaluation Harnesses

Single-output accuracy says nothing about an agent that takes thirty steps; evaluating agents means scoring trajectories, environment state, and reliability across runs.

advanced · 10 min read · Premium

A chatbot is evaluated on the quality of one response. An agent runs a loop of plan, act, observe over many steps, and any single step can derail the whole task. That difference breaks the usual evaluation playbook: you cannot grade an agent by comparing its final string to a reference answer, because success is a property of the entire trajectory and the state of the world it leaves behind. The benchmark numbers make the difficulty concrete. On SWE-bench, Claude 2 resolved 1.96% of real GitHub issues. On WebArena, the best GPT-4 agent completed 14.41% of web tasks against 78.24% for humans. On tau-bench, GPT-4o landed below 50%. Agents that demo beautifully score in the single and low double digits on honest harnesses.

Why agent evaluation is different

Three properties separate agent evaluation from standard model evaluation:

Keep reading with Pro.

You're reading the preview. Unlock the full concept plus the library, study plans, the AI mentor, and daily emails.

Sign in to save and react.
Share Copied