Contents
After you build an AI agent, you always hit the same wall: "OK, but is it actually working?" You changed the prompt, swapped the model, added a tool — and the mechanism for deciding whether that made things better or worse with data instead of gut feel is evals (evaluations).
An LLM can produce a different output every time for the same input (it's probabilistic). So shipping on "seems to work" leads to silent regressions and edge-case failures in production. This article covers what evals are, five ways to measure quality, agent-specific evaluation, and how to start small — written for practitioners.
The bottom line, in 30 seconds
If you only read one thing
1. Why you need evals
Ordinary software is deterministic: same input, same output. That's why a unit test that checks "does the output match the expected value" works. But an LLM is probabilistic — even the same question comes back worded or framed a little differently each time. In the terms of AI agents vs RPA, it's not a deterministic "hand" but a probabilistic "brain," so exact-match tests don't work as-is.
Three failure modes tend to show up here.
You try a couple of examples by hand and decide it "feels better." You never notice that another case broke.
You change a prompt or model and only one kind of input gets worse. You find out from a production complaint.
"It sometimes returns something weird." Because it's probabilistic, one try won't reproduce it, so you can't trace the cause.
Evals prevent all three at once. Prepare an evaluation dataset, score the whole set on every change, and compare the scores — that alone turns "gut feel" into "data" and makes regressions visible. The more judgment you delegate to an agent, the more evals become the foundation of quality, right alongside guardrails.
2. What evals are
Evals (evaluations) = measuring whether an AI's output or an agent's behavior works correctly and stably, as expected. In human terms, it's grading. The building blocks are simple and break down into three parts.
The set of inputs you evaluate on. Gather real usage examples, past logs, and expected edge cases.
How you turn output into a score: exact match, rule checks, or grading by another AI.
Score the whole set and compare before vs. after a change to decide better or worse.
Evals aren't "build once and done" — the essence is running them as a regression test every time you change a prompt, model, or tool. Like test code, it's an asset you grow.
3. Five ways to measure quality
There are five representative scoring approaches. The rule of thumb is to pick by the nature of the task and combine several.
Prepare the expected output (gold label) for each input and score by match rate. Best for tasks with a fixed answer: classification, extraction, yes/no.
Mechanically check regex, exact match, JSON validity, presence of required keys. Strong for verifying "must always return this format" — fast and cheap.
Have another LLM grade against a rubric. For tasks where the answer isn't unique: summary quality, tone, relevance.
Compare scores on the same dataset before and after a prompt/model change. Catches a "hidden regression" where the whole rises but part drops.
Continuously score and observe live logs. Watch failure rate, cost, latency, and input drift to catch degradation early.
| Method | Fits | Cost | Objectivity |
|---|---|---|---|
| ① Ground-truth | Classification, extraction, decisions | Low | ◎ High |
| ② Rule-based | Format / structure checks | Low | ◎ High |
| ③ LLM-as-judge | Summary, generation, dialogue quality | Med | ○ Depends on rubric |
| ④ Regression | Detecting regressions from changes | Med | ◎ Relative |
| ⑤ Production monitoring | Detecting live degradation | Med–High | ○ Ongoing |
The key is the layering: "measure mechanically what you can (① ②), use LLM-as-judge for quality you can't (③), and keep running it through regression and production (④ ⑤)." LLM-as-judge (③) is handy, but the judging LLM itself varies, so write out the rubric explicitly and, where possible, calibrate against human grades.
4. Agent-specific evaluation
For a single response (one input → one output), the five above suffice. But an AI agent takes multiple steps, calls tools itself, and makes decisions along the way. So you must evaluate not just the final output but the process.
Did it achieve the goal in the end (e.g., booked the right reservation)? The primary agent metric.
Did it call the right tool, with the right arguments, in the right order? Catch wrong or redundant calls.
Is the path of steps and decisions reasonable? Evaluate detours, infinite loops, and needless retries.
For the same success, fewer tokens, steps, and less latency is better. It matters in production.
Observing these requires tracing that records every step (input, thinking, tool call, result). Many frameworks and the tools below ship tracing and evaluation together. For a multi-agent setup, keep hierarchical traces so you can pinpoint which agent failed.
5. How to start — build small
You don't need a perfect eval platform from day one. Starting with a 20-item dataset is realistic.
- Collect failure examples: first, 10–20 "inputs that went wrong." Real logs and complaints are a gold mine — this is the core of the eval set.
- Write the expected behavior: attach a "correct answer" or "conditions to satisfy" to each input. Not everything needs a strict answer (measure quality with ③).
- Pick a scorer: format checks → ② rule-based; fixed answer → ① ground-truth; quality → ③ LLM-as-judge. One or two to start is fine.
- Run once and baseline: record the current score. That's your reference point.
- Run it on every change: after a prompt/model change, re-run and compare with ④ regression. If it drops, don't ship.
- Add observation in production: once live, keep watching failure rate and cost with ⑤ monitoring, and feed bad real examples back into the eval set.
💡 Tip: weight your eval set toward "failures you don't want to happen" rather than "common successes." Including edge cases, adversarial inputs, and vague requests lets you guard proactively against what breaks on change. A good rubric, like good prompt design, gets more reproducible the more concrete it is.
6. Common pitfalls
- Dataset too small / too skewed: collecting only successes misses real-world failures. Deliberately mix in failures and edge cases.
- Blindly trusting LLM-as-judge: the judging LLM also varies and has biases. Write the rubric explicitly and periodically calibrate against human grades. Beware self-dealing (the same model writes and praises its own output).
- Looking only at final output: process is everything for agents. Without tool calls, trajectory, and cost, you'll bless a "got lucky" result.
- Deciding on one run: since it's probabilistic, for important evals run several times and look at the variance.
- Not updating the evals: specs and usage change. Keep adding new production failures to the eval set.
7. Key tools
You can start with your own scripts, but there's a growing set of dedicated tools that handle tracing and evaluation together. Representative examples (all official sites).
| Tool | What it does |
|---|---|
| Anthropic Console / Evals | Test and evaluate prompts for Claude in a UI. Also for comparing model choices. |
| OpenAI Evals | An OSS framework to define and run evals. The basic dataset + scorer shape. |
| LangSmith | Tracing + evaluation. Records each agent step, through regression and production monitoring. |
| Langfuse | OSS LLM observability. Tracing, evaluation, and cost monitoring together. |
| Ragas | Evaluation specialized for RAG (retrieval-augmented generation): relevance, faithfulness, and more. |
Whichever you use, the essence is the same: a dataset + a scorer + the discipline of comparing. Tools just make that easier. The best start is one small eval set, even in a script on your machine.
Summary
- What evals are: "grading" that measures AI output and behavior with numbers — deciding better/worse with data, not gut feel.
- Why you need them: LLMs are probabilistic and vary, so unit tests don't fit and regressions and edge cases slip through.
- Five methods: ① ground-truth ② rule-based ③ LLM-as-judge ④ regression ⑤ production monitoring. Measure mechanically what you can, judge quality with an LLM, and keep running it.
- Agents need process evaluation too: task success rate, tool calls, trajectory, cost. Tracing is the prerequisite.
- How to start: 20 failure examples. Baseline them, then run on every change.
Between "I built it" and "it's usable" sits a bridge called evals. If guardrails are the defense that stops runaway behavior, evals are the offense that measures quality and keeps raising it. A single small eval set turns agent development from "gut feel" into engineering.
FAQ
Q. How do evals differ from regular unit tests?
Unit tests check "does the output exactly match the expected value." But an LLM is probabilistic and produces different output each time, so exact match doesn't work as-is. Evals differ by combining measurement suited to probabilistic output — rule-based checks, grading by an LLM, and observing variance across several runs — on top of ground-truth matching.
Q. Can I trust LLM-as-judge (letting an AI grade)?
It's handy but not a silver bullet. The judging LLM can vary and be biased. What matters is writing a concrete rubric, calibrating against human grades periodically, and separating the roles/models for generation and grading to avoid self-dealing. Relative comparison (which of A or B is better) tends to be more stable than absolute scores.
Q. How many eval items do I need?
You can start well with 10–20. Even a few help with the relative comparison of "did the score go up or down after a change." Realistically, grow it by adding failures found in production. More important than count is properly including failures, exceptions, and edge cases.
Q. Do I really need to evaluate an agent's "trajectory"?
If you run it in production, yes. Even when the final output is correct, detours, unnecessary tool calls, and infinite loops hurt cost and reliability. Add tracing that records each step and look at the process alongside task success rate. The more the use case involves permissions and side effects — like business automation use cases or automating cloud operations — the more process evaluation pays off.