After you build an AI agent, you always hit the same wall: "OK, but is it actually working?" You changed the prompt, swapped the model, added a tool — and the mechanism for deciding whether that made things better or worse with data instead of gut feel is evals (evaluations).

An LLM can produce a different output every time for the same input (it's probabilistic). So shipping on "seems to work" leads to silent regressions and edge-case failures in production. This article covers what evals are, five ways to measure quality, agent-specific evaluation, and how to start small — written for practitioners.

The bottom line, in 30 seconds

If you only read one thing

What evals are
A scoring mechanism that measures AI output quality with numbers. Judge with data, not gut feel.
Why you need them
LLMs are probabilistic and vary. Unit tests don't fit well, and regressions slip through.
Where to start
Start with a 20-item eval set. Even a few make "better/worse per change" visible.

1. Why you need evals

Ordinary software is deterministic: same input, same output. That's why a unit test that checks "does the output match the expected value" works. But an LLM is probabilistic — even the same question comes back worded or framed a little differently each time. In the terms of AI agents vs RPA, it's not a deterministic "hand" but a probabilistic "brain," so exact-match tests don't work as-is.

Three failure modes tend to show up here.

😵 Gut-feel debugging

You try a couple of examples by hand and decide it "feels better." You never notice that another case broke.

🐛 Silent regression

You change a prompt or model and only one kind of input gets worse. You find out from a production complaint.

🎲 Non-reproducible bugs

"It sometimes returns something weird." Because it's probabilistic, one try won't reproduce it, so you can't trace the cause.

Evals prevent all three at once. Prepare an evaluation dataset, score the whole set on every change, and compare the scores — that alone turns "gut feel" into "data" and makes regressions visible. The more judgment you delegate to an agent, the more evals become the foundation of quality, right alongside guardrails.

2. What evals are

Evals (evaluations) = measuring whether an AI's output or an agent's behavior works correctly and stably, as expected. In human terms, it's grading. The building blocks are simple and break down into three parts.

① Dataset

The set of inputs you evaluate on. Gather real usage examples, past logs, and expected edge cases.

② Scorer

How you turn output into a score: exact match, rule checks, or grading by another AI.

③ Run & compare

Score the whole set and compare before vs. after a change to decide better or worse.

Evals aren't "build once and done" — the essence is running them as a regression test every time you change a prompt, model, or tool. Like test code, it's an asset you grow.

3. Five ways to measure quality

There are five representative scoring approaches. The rule of thumb is to pick by the nature of the task and combine several.

① Ground-truth matching

Prepare the expected output (gold label) for each input and score by match rate. Best for tasks with a fixed answer: classification, extraction, yes/no.

② Rule-based checks

Mechanically check regex, exact match, JSON validity, presence of required keys. Strong for verifying "must always return this format" — fast and cheap.

③ LLM-as-judge

Have another LLM grade against a rubric. For tasks where the answer isn't unique: summary quality, tone, relevance.

④ Regression testing

Compare scores on the same dataset before and after a prompt/model change. Catches a "hidden regression" where the whole rises but part drops.

⑤ Production monitoring

Continuously score and observe live logs. Watch failure rate, cost, latency, and input drift to catch degradation early.

MethodFitsCostObjectivity
① Ground-truthClassification, extraction, decisionsLow◎ High
② Rule-basedFormat / structure checksLow◎ High
③ LLM-as-judgeSummary, generation, dialogue qualityMed○ Depends on rubric
④ RegressionDetecting regressions from changesMed◎ Relative
⑤ Production monitoringDetecting live degradationMed–High○ Ongoing

The key is the layering: "measure mechanically what you can (① ②), use LLM-as-judge for quality you can't (③), and keep running it through regression and production (④ ⑤)." LLM-as-judge (③) is handy, but the judging LLM itself varies, so write out the rubric explicitly and, where possible, calibrate against human grades.

4. Agent-specific evaluation

For a single response (one input → one output), the five above suffice. But an AI agent takes multiple steps, calls tools itself, and makes decisions along the way. So you must evaluate not just the final output but the process.

🎯 Task success rate

Did it achieve the goal in the end (e.g., booked the right reservation)? The primary agent metric.

🛠️ Correct tool calls

Did it call the right tool, with the right arguments, in the right order? Catch wrong or redundant calls.

🧭 Trajectory

Is the path of steps and decisions reasonable? Evaluate detours, infinite loops, and needless retries.

💰 Cost and steps

For the same success, fewer tokens, steps, and less latency is better. It matters in production.

Observing these requires tracing that records every step (input, thinking, tool call, result). Many frameworks and the tools below ship tracing and evaluation together. For a multi-agent setup, keep hierarchical traces so you can pinpoint which agent failed.

5. How to start — build small

You don't need a perfect eval platform from day one. Starting with a 20-item dataset is realistic.

  1. Collect failure examples: first, 10–20 "inputs that went wrong." Real logs and complaints are a gold mine — this is the core of the eval set.
  2. Write the expected behavior: attach a "correct answer" or "conditions to satisfy" to each input. Not everything needs a strict answer (measure quality with ③).
  3. Pick a scorer: format checks → ② rule-based; fixed answer → ① ground-truth; quality → ③ LLM-as-judge. One or two to start is fine.
  4. Run once and baseline: record the current score. That's your reference point.
  5. Run it on every change: after a prompt/model change, re-run and compare with ④ regression. If it drops, don't ship.
  6. Add observation in production: once live, keep watching failure rate and cost with ⑤ monitoring, and feed bad real examples back into the eval set.

💡 Tip: weight your eval set toward "failures you don't want to happen" rather than "common successes." Including edge cases, adversarial inputs, and vague requests lets you guard proactively against what breaks on change. A good rubric, like good prompt design, gets more reproducible the more concrete it is.

6. Common pitfalls

  • Dataset too small / too skewed: collecting only successes misses real-world failures. Deliberately mix in failures and edge cases.
  • Blindly trusting LLM-as-judge: the judging LLM also varies and has biases. Write the rubric explicitly and periodically calibrate against human grades. Beware self-dealing (the same model writes and praises its own output).
  • Looking only at final output: process is everything for agents. Without tool calls, trajectory, and cost, you'll bless a "got lucky" result.
  • Deciding on one run: since it's probabilistic, for important evals run several times and look at the variance.
  • Not updating the evals: specs and usage change. Keep adding new production failures to the eval set.

7. Key tools

You can start with your own scripts, but there's a growing set of dedicated tools that handle tracing and evaluation together. Representative examples (all official sites).

ToolWhat it does
Anthropic Console / EvalsTest and evaluate prompts for Claude in a UI. Also for comparing model choices.
OpenAI EvalsAn OSS framework to define and run evals. The basic dataset + scorer shape.
LangSmithTracing + evaluation. Records each agent step, through regression and production monitoring.
LangfuseOSS LLM observability. Tracing, evaluation, and cost monitoring together.
RagasEvaluation specialized for RAG (retrieval-augmented generation): relevance, faithfulness, and more.

Whichever you use, the essence is the same: a dataset + a scorer + the discipline of comparing. Tools just make that easier. The best start is one small eval set, even in a script on your machine.

Summary

  • What evals are: "grading" that measures AI output and behavior with numbers — deciding better/worse with data, not gut feel.
  • Why you need them: LLMs are probabilistic and vary, so unit tests don't fit and regressions and edge cases slip through.
  • Five methods: ① ground-truth ② rule-based ③ LLM-as-judge ④ regression ⑤ production monitoring. Measure mechanically what you can, judge quality with an LLM, and keep running it.
  • Agents need process evaluation too: task success rate, tool calls, trajectory, cost. Tracing is the prerequisite.
  • How to start: 20 failure examples. Baseline them, then run on every change.

Between "I built it" and "it's usable" sits a bridge called evals. If guardrails are the defense that stops runaway behavior, evals are the offense that measures quality and keeps raising it. A single small eval set turns agent development from "gut feel" into engineering.

FAQ

Q. How do evals differ from regular unit tests?

Unit tests check "does the output exactly match the expected value." But an LLM is probabilistic and produces different output each time, so exact match doesn't work as-is. Evals differ by combining measurement suited to probabilistic output — rule-based checks, grading by an LLM, and observing variance across several runs — on top of ground-truth matching.

Q. Can I trust LLM-as-judge (letting an AI grade)?

It's handy but not a silver bullet. The judging LLM can vary and be biased. What matters is writing a concrete rubric, calibrating against human grades periodically, and separating the roles/models for generation and grading to avoid self-dealing. Relative comparison (which of A or B is better) tends to be more stable than absolute scores.

Q. How many eval items do I need?

You can start well with 10–20. Even a few help with the relative comparison of "did the score go up or down after a change." Realistically, grow it by adding failures found in production. More important than count is properly including failures, exceptions, and edge cases.

Q. Do I really need to evaluate an agent's "trajectory"?

If you run it in production, yes. Even when the final output is correct, detours, unnecessary tool calls, and infinite loops hurt cost and reliability. Add tracing that records each step and look at the process alongside task success rate. The more the use case involves permissions and side effects — like business automation use cases or automating cloud operations — the more process evaluation pays off.