Agents & Tool Use
Agent Evaluation Harnesses
Single-output accuracy says nothing about an agent that takes thirty steps; evaluating agents means scoring trajectories, environment state, and reliability across runs.
advanced · 10 min read · Premium
A chatbot is evaluated on the quality of one response. An agent runs a loop of plan, act, observe over many steps, and any single step can derail the whole task. That difference breaks the usual evaluation playbook: you cannot grade an agent by comparing its final string to a reference answer, because success is a property of the entire trajectory and the state of the world it leaves behind. The benchmark numbers make the difficulty concrete. On SWE-bench, Claude 2 resolved 1.96% of real GitHub issues. On WebArena, the best GPT-4 agent completed 14.41% of web tasks against 78.24% for humans. On tau-bench, GPT-4o landed below 50%. Agents that demo beautifully score in the single and low double digits on honest harnesses.
Why agent evaluation is different
Three properties separate agent evaluation from standard model evaluation:
Keep reading with Pro.
You're reading the preview. Unlock the full concept plus the library, study plans, the AI mentor, and daily emails.