Agent evals are the process of systematically measuring whether an agent — one that uses tools and takes multiple steps to reach a goal — can actually accomplish its tasks. They are an evolution of LLM evals, expanding the target from "one output" to "a sequence of actions." Because an agent plans, calls tools, and updates state, the final output alone is not enough; Google notes you must understand the "why" behind an agent's actions and splits evaluation into final response and trajectory. The five dimensions are: outcome (task success, judged by the final state — whether a reservation exists in the DB, not the utterance "I booked it"), trajectory (reasonable steps, right tools in the right order), tool-use correctness (right tool and arguments, checking function names and types), efficiency (steps, tokens, cost, latency — often observability signals brought into evaluation), and final-response quality (via LLM-as-judge or a rubric). Graders are code (fast/cheap/reproducible but brittle), LLM-as-judge (flexible but non-deterministic and needs calibration), and human (gold standard but expensive — avoid if possible). Anthropic recommends grading the outcome, not the path: rote trajectory matching is "too rigid and brittle" because agents find valid alternatives, while Google and Microsoft offer trajectory-match metrics for diagnosing failures. The unique pitfalls are non-determinism (pass^k), compounding errors (p^t), reward hacking (DeepMind's robot arm faking a grasp), and stale or contaminated eval sets. The practical play, per Anthropic: turn 20-50 production failures into test cases, run automated grading in CI, separate capability and regression evals, and write them early. Benchmarks like SWE-bench, tau-bench, WebArena, GAIA, OSWorld, and BFCL are useful references (scores move by version, so do not take them at face value). Based on official information, with uncertainties flagged.

2026/06/20

Latest Articles

View All

Claude AI Dev & Programming Beginners

What Are Agent Evals? Measuring Both Outcome and Trajectory

2026/06/20

Claude AI Dev & Programming Beginners

What Are Claude Code Hooks? Run Shell Commands Deterministically

Claude Code hooks are user-defined shell commands that run automatically at specific points in Claude Code's lifecycle, making "this must always happen" real and deterministic without relying on the LLM's judgment. The classic events are nine—SessionStart, UserPromptSubmit, PreToolUse, PostToolUse, Notification, Stop, SubagentStop, SessionEnd, PreCompact—of which PreToolUse and others can block (stopping protected-file edits or dangerous commands). You configure them in settings.json under the "hooks" key as event name -> matcher -> type + command. The I/O contract: a hook receives JSON on stdin (session_id, tool_input, etc.) and returns via exit code 0 (success) / 2 (block, with stderr passed back to Claude) or structured JSON (continue, decision:block, permissionDecision: deny/allow/ask). The key principle is "hooks can tighten but not loosen restrictions" (deny always wins, blocks even under bypassPermissions). Classic use cases: auto-format after edits (PostToolUse + Edit|Write), protect critical files, stop dangerous commands, re-inject context (SessionStart), notifications/audit logging, and test-before-stop (Stop). On security, hooks run arbitrary shell commands with your privileges, so only configure trusted ones and validate/quote inputs; hook config is captured at session startup (a safety feature) so mid-session changes do not apply. Based on the official documentation, anchored on the nine classic events and the I/O contract.

2026/06/20

Claude AI Dev & Programming Beginners

What Are Claude Code Checkpointing and /rewind? Roll Back Changes

Checkpointing and /rewind are a safety net: Claude Code automatically tracks Claude's file edits as you work, so you can roll back to "before it went wrong" in a few keystrokes. A snapshot is taken before each edit, every prompt you send becomes a restore point, and checkpoints persist across sessions. To use it, type /rewind or press Esc twice when the input is empty to open the menu, then pick a point and choose Restore code and conversation / Restore conversation / Restore code (note: if the input has text, Esc twice clears it instead). The most important caveat: only changes made by Claude's edit tools (Write/Edit/NotebookEdit) are restored — file changes by bash commands (rm/mv/cp), changes outside the session or from other sessions, directory operations, remote files, and database state are NOT undone by rewinding. The docs frame it as "checkpoints = local undo, Git = permanent history," stating it complements but does not replace version control, so committing to Git at milestones is the rule. /rewind is also the recovery for the 400 error tied to tool-use concurrency and thinking blocks (the product itself prompts you to run it), though versions before v2.1.156 may not clear it so claude update comes first. It is on by default in the interactive CLI, opt-in in the Agent SDK, and retained with sessions for 30 days (configurable). Based on the official documentation, with uncertainties flagged.

2026/06/20

Claude AI Dev & Programming Beginners

What Are Claude Managed Agents? Anthropic's Fully Managed Cloud

Claude Managed Agents launched as a public beta on April 8, 2026 as a suite of composable APIs for building and deploying cloud-hosted agents at scale. Instead of building your own agent loop, tool execution, and runtime, you get a fully managed environment where Claude can read files, run commands, browse the web, and execute code securely, with prompt caching, context compaction, sandboxing, and state persistence built in. It is organized around four concepts (Agent, Environment, Session, Events), and the Environment can be an Anthropic-managed cloud sandbox or a self-hosted one. The difference from the self-hosted Agent SDK (where you run the loop, tools, and infrastructure) is "you run it vs Anthropic runs it" — not competitors but a choice about how much of the operations you hold. A signature feature is workspace-scoped persistent memory (a memory store) mounted in the sandbox at /mnt/memory, which the agent reads and writes with normal file operations and which persists across sessions (immutable versions, 30-day retention, limits like 100 kB per memory). Dreaming is an async job that reads the existing memory and past transcripts to produce a reorganized memory store — merging duplicates, updating stale values, and surfacing new insights (a research preview requiring access; some call it "scheduled" but the docs describe an on-demand async job). It also has outcomes-based grading (a separate grader evaluates against your rubric; reported up to a 10-point improvement) and multi-agent orchestration. Pricing is tokens + $0.08 per session-hour (metered to the millisecond, only while running; about $0.705 for a 1-hour Opus 4.8 session). Enabled by default for all API accounts, but stateful so not eligible for ZDR or a HIPAA BAA. Based on official information, with uncertainties flagged.

2026/06/20

Claude AI Dev & Programming Beginners

What Are Claude Code Plugins and the Plugin Marketplace? A Complete Guide

As you use Claude Code more, your own slash commands, subagents, MCP servers, and hooks accumulate. A plugin bundles them into one unit you can version, share, and reuse across teams and projects (plugins arrived in public beta in October 2025; an official directory and /plugin search and list were added in 2026), and a marketplace is where they are distributed. This article covers what a plugin is (a self-contained directory bundling skills, commands, subagents, hooks, and MCP servers), its structure (only plugin.json inside .claude-plugin/, with commands/agents/skills/hooks at the root; the plugin.json name/description/version/author manifest), how to use it (the /plugin tabbed manager, /plugin marketplace add owner/repo then /plugin install name@market, /plugin enable|disable|uninstall, /plugin list --enabled, the search bar, /reload-plugins), what a marketplace is (a .claude-plugin/marketplace.json catalog; the official claude-plugins-official available from first launch and browsable at claude.com/plugins, plus the community claude-plugins-community), how to build and publish your own (place plugin.json and SKILL.md and test with claude --plugin-dir, then put marketplace.json at the git root; versioning resolves plugin.json then marketplace then commit SHA; claude plugin validate), distribution scope (user/project/local/managed, with team distribution via .claude/settings.json extraKnownMarketplaces and enabledPlugins), and safety (plugins can run arbitrary code with your privileges, Anthropic does not verify third-party plugins, and strictKnownMarketplaces restricts sources) — all based on the official documentation.

2026/06/20

Claude AI Dev & Programming Beginners

Claude Code Subagents vs Agent Teams: The Difference and Which to Use

When you want several AIs to divide up work in Claude Code, there are two similar-but-different mechanisms — subagents and Agent Teams — whose roles and coordination differ fundamentally. This article sorts them out accurately. Subagents are a built-in feature: the main agent automatically delegates a specific task to a helper that has its own context window, system prompt, and tool permissions, then receives only a summary (hierarchical, ephemeral, the helper does not see your conversation history; managed via /agents, defined in .claude/agents/ YAML files, with built-ins like Explore and Plan, and nesting up to 5 levels deep). Agent Teams, by contrast, are an experimental opt-in feature disabled by default — they require CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1, and let multiple independent sessions (a team lead and teammates) coordinate as peers and message each other directly through a shared task list and a mailbox (persistent; the v2.1.178 change only removed the TeamCreate setup step so each session has one implicit team, not enabling teams by default). The article covers the decisive differences (hierarchical vs peer, summary vs direct messaging, built-in vs experimental flag, and that 5-level nesting is a subagents feature while Teams are flat), which to use (subagents when you want only the result or to avoid polluting context, Agent Teams for parallel work where workers must share and self-coordinate, and a single session for sequential work, the same files, or quick fixes), a usage cheat sheet, and Agent Teams caveats (high token cost, 3-5 teammates recommended, no worktree isolation so files conflict, /resume limits, and split panes needing tmux/iTerm2) — all based on the official documentation.

2026/06/20