You wired up a multi-agent system, gave it tools, and ran it with the Agent SDK—so how do you measure whether the agent is actually doing its job? For a single output, you can score it with AI evals. But an agent plans over many steps, calls tools, and acts with state. Even if the last sentence looks right, it might have crossed a dangerous bridge along the way. This is where agent evals take center stage.

This article lays out, based on official information, what agent evals are, how they differ from LLM evals, what to measure (5 dimensions), how to grade (3 graders), the unique pitfalls, and the practical workflow plus benchmarks. The key points up front. ① Agent evals measure not just "the final output" but "the trajectory" of actions. ② Anthropic recommends grading the outcome (final state), not the path—because rote step-checking is brittle. ③ Start with a small eval set of 20–50 tasks drawn from real failures, and run automated grading.

AGENT EVALS

Measure both "the final answer" and "the road walked"

— output evals aren't enough; agents are multi-step, tool-using, stateful

One agent run (trajectory)
plan search() read_file() db.write() final state
1. Outcome
Does the final state match the goal? Not "I booked it" but whether a reservation exists in the DB.
2. Trajectory
Did it call the right tools correctly? Any wasted or dangerous moves?

Anthropic recommends grading outcome over path—but the trajectory tells you why it failed. Use both, for the right job.

1. What are agent evals?

Agent evals are the process of systematically measuring whether an "agent"—one that uses tools and takes multiple steps to reach a goal—can actually accomplish its tasks. They're an evolution of LLM evals, which judge a single prompt's quality; the target expands from "one output" to "a sequence of actions."

Why it matters: in its guide on agent evals, Anthropic notes that "evals get harder to build the longer you wait. Early on, product requirements naturally translate into test cases," and recommends that "20–50 simple tasks drawn from real failures is a great start." In other words, agent evals turn "it seems to be working" into reproducible numbers. This pairs with AI observability (observing the run)—the traces you observe become the material for evaluation.

2. Why they differ from LLM evals (output vs trajectory)

Traditional LLM evals essentially score "input → one output." But an agent plans, calls tools, looks at the results, decides the next move, and updates state. So looking only at the final output isn't enough. Google likewise states that "it's not enough to simply check the outputs; we need to understand the 'why' behind an agent's actions," and splits evaluation into two families: "final response" and "trajectory." Microsoft, too, says you must "evaluate not just the final output, but also the quality and efficiency of each step," dividing it into system (end-to-end) and process (step-by-step) evaluation.

💡 The core idea: a "correct final answer" can hide a broken process. Conversely, the answer may be right but reached by luck, chance, or a dangerous shortcut. So for agents, you look at both the "result" and the "process." For the basics of single-output evaluation and LLM-as-judge, see the AI evals article.

3. What to measure: 5 dimensions

Here are five commonly used lenses for agent evaluation.

1. Outcome (task success)

Did it reach the goal? Judge by the final state—whether a reservation exists in the DB—not by the utterance "I booked it."

2. Trajectory (process)

Did it take reasonable steps? Did it use the right tools in the right order and count? Any pointless detours or dangerous moves?

3. Tool-use correctness

Did it pick the right tool and pass the right arguments? Check function names, argument types, and values (and detect needless calls).

4. Efficiency (steps, cost)

How many steps, tokens, dollars, and seconds? A correct answer is impractical if the cost balloons. Needs linking to observed metrics.

5. Final-response quality

Is the output relevant, accurate, and complete? Score open-ended content with LLM-as-judge or a rubric.

Note: 4. efficiency (tokens, cost, latency) isn't formally codified as an "eval metric" by any one vendor; in practice it's often observability signals brought into evaluation. Even so, it's an essential dimension for stopping an agent that loops and runs away.

4. How to grade: 3 graders and "outcome vs path"

There are broadly three kinds of grader. Following Anthropic's framing, each has trade-offs.

GraderStrengthsWeaknesses
Code (programmatic)Fast, cheap, objective, reproducibleBrittle to valid variations / alternatives
LLM-as-judge (model)Flexible, scalable, captures nuanceNon-deterministic, pricier, needs calibration with humans
HumanGold standard for qualityExpensive, slow (avoid if possible)

The standard play: grade what you can with code, hand only the subjective, open-ended parts to a different model as LLM-as-judge, and use humans for spot-checks at key points. The design of LLM-as-judge (detailed rubrics, discrete outputs, judge bias) is covered in depth in the LLM evals article.

Rote "trajectory matching" is brittle

So how do you grade the trajectory? Here Anthropic takes a strong stance: "There is a common instinct to check that agents followed very specific steps like a sequence of tool calls in the right order. We've found this approach too rigid and results in overly brittle tests, as agents regularly find valid approaches that eval designers didn't anticipate. So as not to unnecessarily punish creativity, it's often better to grade what the agent produced, not the path it took." For a flight booking, for instance, you measure whether a reservation actually exists in the environment's SQL DB as the final state—not the utterance "I booked it."

Meanwhile, Google and Microsoft offer trajectory-match degrees (exact / in-order / any-order, etc.) as formal metrics. The two aren't contradictory—trajectory evals are good at diagnosing "why it failed," and outcome evals avoid punishing valid creativity. In practice, the realistic middle ground is to avoid strict exact match and loosen to a key-action check: "were the critical tools called?"

5. Problems unique to agent evals

Agent evals carry difficulties that single-output evaluation doesn't.

  • Non-determinism (the same input takes different paths): one success doesn't mean it reproduces. You need reliability metrics like whether it succeeds across all k runs (pass^k). The τ-bench paper reports that "models degrade considerably as k increases, revealing their unreliability" (the figures are point-in-time).
  • Compounding errors: if a single step succeeds with probability p, then t steps succeed at roughly pt. The longer the chain, the faster it collapses—which is why success drops sharply on long-horizon tasks.
  • Reward hacking / specification gaming: behavior that satisfies the letter of the grader without achieving the real goal. In DeepMind's example, a robot arm positioned itself between the camera and the object, fooling evaluators into thinking it had grasped the item when it hadn't. Catching "looks right but dangerous path" requires evaluating the trajectory and side effects.
  • Eval sets going stale / contamination: when a benchmark leaks into training data (contamination), scores stop reflecting real ability. You have to keep updating your regression evals as the agent matures.

The "dangerous path" problem is continuous with AI guardrails. An eval that looks only at the final answer walks right past these traps.

6. The workflow and benchmarks

Anchored on Anthropic's recommendations, the workflow is simple.

  1. Build small, from real failures: you don't need hundreds. Turn 20–50 failures that happened in production into test cases.
  2. Run automated grading: code first, LLM-as-judge only for the open-ended parts. Prioritize volume over hand-graded quality.
  3. Separate two kinds: capability evals (what is it good at?) and regression evals (can it still do what it used to?).
  4. Put it on a lifecycle: ① pre-launch automated evals (built into CI) → ② production monitoring → ③ A/B testing → ④ user feedback and trace review, layered up.
  5. Write them early: evals get harder to build the longer you wait.

Well-known agent benchmarks are also useful references for building your own evals (the key is to read "what each measures"; scores move by model and version, so don't take them at face value).

BenchmarkWhat it measures
SWE-bench / VerifiedResolve real GitHub issues with a patch, graded pass/fail by the test suite (execution-based)
τ-bench / τ²-benchMulti-turn tool×user dialogue in retail, airline, etc. + policy following; graded on final DB state
WebArenaAutonomous web-operation task completion on realistic site replicas
GAIAGeneral-assistant tasks easy for humans, hard for AI (reasoning + tools + browsing)
OSWorldComputer use operating a GUI on a real OS, evaluated execution-based
BFCLAccuracy of function/tool calling (function names, argument structure, executability)

As for tooling, LangSmith, Braintrust, DeepEval, and Arize Phoenix support trajectory and tool-call evaluation. Most build on traces, scoring at the step, run, and dataset levels. Note that Claude Managed Agents ships outcomes-based grading—where a separate grader evaluates against your rubric—built in.

Summary

Agent evals are the process of measuring whether a tool-using, multi-step agent can actually accomplish its tasks. Unlike LLM evals, which look at a single output, they examine both "the final answer (outcome)" and "the road walked (trajectory)." The dimensions are ① outcome ② trajectory ③ tool use ④ efficiency ⑤ final quality. Grade with code → LLM-as-judge → human, and Anthropic recommends "grade the outcome (final state), not the path" (rote step-checking is brittle).

The unique pitfalls are non-determinism (pass^k), compounding errors, reward hacking, and stale eval sets. In practice, the standard play is to start small with 20–50 cases from real failures, run automated grading in CI, separate capability and regression evals, and write them early. Related: AI evals, observability, how to build a multi-agent system, Managed Agents.

FAQ

Q. What are agent evals?
A. The process of systematically measuring whether a tool-using, multi-step agent can actually accomplish its tasks. They're an evolution of LLM evals, which score a single prompt; the target expands from "one output" to "a sequence of actions." The hallmark is looking not just at the final answer but at the trajectory that led there (which tools were called, how).

Q. How do they differ from ordinary LLM evals?
A. In whether you look at "an output" or "a chain of actions." Because an agent plans, calls tools, and updates state, the final output alone isn't enough. A correct answer can hide a broken process, and a right answer may have come via a dangerous shortcut. So you evaluate both the outcome (final state) and the trajectory (process).

Q. What should I measure?
A. The common five dimensions: ① outcome (task success = does the final state match the goal?) ② trajectory (reasonable steps?) ③ tool-use correctness (right tool, right arguments?) ④ efficiency (steps, tokens, cost, latency) ⑤ final-response quality (relevant, accurate, complete?). Dimension 4 brings in observability signals and is important for stopping runaways.

Q. Should I check the trajectory (steps) for exact match?
A. No—strict exact match tends to be brittle. Anthropic recommends: "checking that tool calls followed the right order is too rigid and brittle; agents find valid alternatives, so it's better to grade the outcome, not the path." In practice, avoid exact match and loosen to a key-action check: "were the critical tools called?" That said, the trajectory is useful for diagnosing why it failed, so use each where it fits.

Q. How do I get started?
A. Begin by turning 20–50 production failures into test cases. As Anthropic puts it, "you don't need hundreds; 20–50 simple tasks drawn from real failures is a great start." Grade automatically—code for what code can measure, a separate-model LLM-as-judge only for open-ended parts—and put it in CI to catch regressions. Separate capability evals (what it's good at) from regression evals (keeping what worked), and write your evals early.