Skip to content
Topics

Beginners

New to AI? Start here. Beginner-friendly guides on AI concepts, tool selection, and practical first steps.

115 articles

Sort articles to find what you need

What Are Agent Evals? Measuring Both Outcome and Trajectory

What Are Agent Evals? Measuring Both Outcome and Trajectory

Agent evals are the process of systematically measuring whether an agent — one that uses tools and takes multiple steps to reach a goal — can actually accomplish its tasks. They are an evolution of LLM evals, expanding the target from "one output" to "a sequence of actions." Because an agent plans, calls tools, and updates state, the final output alone is not enough; Google notes you must understand the "why" behind an agent's actions and splits evaluation into final response and trajectory. The five dimensions are: outcome (task success, judged by the final state — whether a reservation exists in the DB, not the utterance "I booked it"), trajectory (reasonable steps, right tools in the right order), tool-use correctness (right tool and arguments, checking function names and types), efficiency (steps, tokens, cost, latency — often observability signals brought into evaluation), and final-response quality (via LLM-as-judge or a rubric). Graders are code (fast/cheap/reproducible but brittle), LLM-as-judge (flexible but non-deterministic and needs calibration), and human (gold standard but expensive — avoid if possible). Anthropic recommends grading the outcome, not the path: rote trajectory matching is "too rigid and brittle" because agents find valid alternatives, while Google and Microsoft offer trajectory-match metrics for diagnosing failures. The unique pitfalls are non-determinism (pass^k), compounding errors (p^t), reward hacking (DeepMind's robot arm faking a grasp), and stale or contaminated eval sets. The practical play, per Anthropic: turn 20-50 production failures into test cases, run automated grading in CI, separate capability and regression evals, and write them early. Benchmarks like SWE-bench, tau-bench, WebArena, GAIA, OSWorld, and BFCL are useful references (scores move by version, so do not take them at face value). Based on official information, with uncertainties flagged.

What Are Claude Code Hooks? Run Shell Commands Deterministically

What Are Claude Code Hooks? Run Shell Commands Deterministically

Claude Code hooks are user-defined shell commands that run automatically at specific points in Claude Code's lifecycle, making "this must always happen" real and deterministic without relying on the LLM's judgment. The classic events are nine—SessionStart, UserPromptSubmit, PreToolUse, PostToolUse, Notification, Stop, SubagentStop, SessionEnd, PreCompact—of which PreToolUse and others can block (stopping protected-file edits or dangerous commands). You configure them in settings.json under the "hooks" key as event name -> matcher -> type + command. The I/O contract: a hook receives JSON on stdin (session_id, tool_input, etc.) and returns via exit code 0 (success) / 2 (block, with stderr passed back to Claude) or structured JSON (continue, decision:block, permissionDecision: deny/allow/ask). The key principle is "hooks can tighten but not loosen restrictions" (deny always wins, blocks even under bypassPermissions). Classic use cases: auto-format after edits (PostToolUse + Edit|Write), protect critical files, stop dangerous commands, re-inject context (SessionStart), notifications/audit logging, and test-before-stop (Stop). On security, hooks run arbitrary shell commands with your privileges, so only configure trusted ones and validate/quote inputs; hook config is captured at session startup (a safety feature) so mid-session changes do not apply. Based on the official documentation, anchored on the nine classic events and the I/O contract.

What Are Claude Code Checkpointing and /rewind? Roll Back Changes

What Are Claude Code Checkpointing and /rewind? Roll Back Changes

Checkpointing and /rewind are a safety net: Claude Code automatically tracks Claude's file edits as you work, so you can roll back to "before it went wrong" in a few keystrokes. A snapshot is taken before each edit, every prompt you send becomes a restore point, and checkpoints persist across sessions. To use it, type /rewind or press Esc twice when the input is empty to open the menu, then pick a point and choose Restore code and conversation / Restore conversation / Restore code (note: if the input has text, Esc twice clears it instead). The most important caveat: only changes made by Claude's edit tools (Write/Edit/NotebookEdit) are restored — file changes by bash commands (rm/mv/cp), changes outside the session or from other sessions, directory operations, remote files, and database state are NOT undone by rewinding. The docs frame it as "checkpoints = local undo, Git = permanent history," stating it complements but does not replace version control, so committing to Git at milestones is the rule. /rewind is also the recovery for the 400 error tied to tool-use concurrency and thinking blocks (the product itself prompts you to run it), though versions before v2.1.156 may not clear it so claude update comes first. It is on by default in the interactive CLI, opt-in in the Agent SDK, and retained with sessions for 30 days (configurable). Based on the official documentation, with uncertainties flagged.

What Are Claude Managed Agents? Anthropic's Fully Managed Cloud

What Are Claude Managed Agents? Anthropic's Fully Managed Cloud

Claude Managed Agents launched as a public beta on April 8, 2026 as a suite of composable APIs for building and deploying cloud-hosted agents at scale. Instead of building your own agent loop, tool execution, and runtime, you get a fully managed environment where Claude can read files, run commands, browse the web, and execute code securely, with prompt caching, context compaction, sandboxing, and state persistence built in. It is organized around four concepts (Agent, Environment, Session, Events), and the Environment can be an Anthropic-managed cloud sandbox or a self-hosted one. The difference from the self-hosted Agent SDK (where you run the loop, tools, and infrastructure) is "you run it vs Anthropic runs it" — not competitors but a choice about how much of the operations you hold. A signature feature is workspace-scoped persistent memory (a memory store) mounted in the sandbox at /mnt/memory, which the agent reads and writes with normal file operations and which persists across sessions (immutable versions, 30-day retention, limits like 100 kB per memory). Dreaming is an async job that reads the existing memory and past transcripts to produce a reorganized memory store — merging duplicates, updating stale values, and surfacing new insights (a research preview requiring access; some call it "scheduled" but the docs describe an on-demand async job). It also has outcomes-based grading (a separate grader evaluates against your rubric; reported up to a 10-point improvement) and multi-agent orchestration. Pricing is tokens + $0.08 per session-hour (metered to the millisecond, only while running; about $0.705 for a 1-hour Opus 4.8 session). Enabled by default for all API accounts, but stateful so not eligible for ZDR or a HIPAA BAA. Based on official information, with uncertainties flagged.

What Are Claude Code Plugins and the Plugin Marketplace? A Complete Guide

What Are Claude Code Plugins and the Plugin Marketplace? A Complete Guide

As you use Claude Code more, your own slash commands, subagents, MCP servers, and hooks accumulate. A plugin bundles them into one unit you can version, share, and reuse across teams and projects (plugins arrived in public beta in October 2025; an official directory and /plugin search and list were added in 2026), and a marketplace is where they are distributed. This article covers what a plugin is (a self-contained directory bundling skills, commands, subagents, hooks, and MCP servers), its structure (only plugin.json inside .claude-plugin/, with commands/agents/skills/hooks at the root; the plugin.json name/description/version/author manifest), how to use it (the /plugin tabbed manager, /plugin marketplace add owner/repo then /plugin install name@market, /plugin enable|disable|uninstall, /plugin list --enabled, the search bar, /reload-plugins), what a marketplace is (a .claude-plugin/marketplace.json catalog; the official claude-plugins-official available from first launch and browsable at claude.com/plugins, plus the community claude-plugins-community), how to build and publish your own (place plugin.json and SKILL.md and test with claude --plugin-dir, then put marketplace.json at the git root; versioning resolves plugin.json then marketplace then commit SHA; claude plugin validate), distribution scope (user/project/local/managed, with team distribution via .claude/settings.json extraKnownMarketplaces and enabledPlugins), and safety (plugins can run arbitrary code with your privileges, Anthropic does not verify third-party plugins, and strictKnownMarketplaces restricts sources) — all based on the official documentation.

Claude Code Subagents vs Agent Teams: The Difference and Which to Use

Claude Code Subagents vs Agent Teams: The Difference and Which to Use

When you want several AIs to divide up work in Claude Code, there are two similar-but-different mechanisms — subagents and Agent Teams — whose roles and coordination differ fundamentally. This article sorts them out accurately. Subagents are a built-in feature: the main agent automatically delegates a specific task to a helper that has its own context window, system prompt, and tool permissions, then receives only a summary (hierarchical, ephemeral, the helper does not see your conversation history; managed via /agents, defined in .claude/agents/ YAML files, with built-ins like Explore and Plan, and nesting up to 5 levels deep). Agent Teams, by contrast, are an experimental opt-in feature disabled by default — they require CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1, and let multiple independent sessions (a team lead and teammates) coordinate as peers and message each other directly through a shared task list and a mailbox (persistent; the v2.1.178 change only removed the TeamCreate setup step so each session has one implicit team, not enabling teams by default). The article covers the decisive differences (hierarchical vs peer, summary vs direct messaging, built-in vs experimental flag, and that 5-level nesting is a subagents feature while Teams are flat), which to use (subagents when you want only the result or to avoid polluting context, Agent Teams for parallel work where workers must share and self-coordinate, and a single session for sequential work, the same files, or quick fixes), a usage cheat sheet, and Agent Teams caveats (high token cost, 3-5 teammates recommended, no worktree isolation so files conflict, /resume limits, and split panes needing tmux/iTerm2) — all based on the official documentation.

What Are Claude Design and /design-sync? Bridging Design and Code

What Are Claude Design and /design-sync? Bridging Design and Code

Claude Design is an Anthropic Labs design tool where you describe what you want in conversation and Claude generates UI designs, prototypes, slides, and one-pagers, which you refine via chat, inline comments, direct edits, and sliders (launched April 17, 2026 as a research preview, with over a million users in the first week). The June 17, 2026 major overhaul sharply shrank the designer-developer round-trip. This article covers what Claude Design is (it ran on Opus 4.7 at the April launch, but the June announcement does not restate the model so we do not assert it), the June overhaul (design-system imports from a GitHub repo / design files / raw uploads so Claude builds with your real components and checks its output, two-way sync with Claude Code via /design-sync, direct canvas editing with drag/resize/align and many stability fixes, a token-burning fix via shared usage limits with chat/Cowork/Claude Code, enterprise brand controls where an admin approves and locks a standard system, and expanded export connectors to Adobe/Canva/Miro/Vercel/Wix and PDF/PowerPoint), the two directions of /design-sync (Code: pull the design system into the repo and build with real components; Design: push your code back to the canvas to keep editing; and a Design-to-Code handoff that continues from existing work instead of a screenshot, plus /design from the terminal), availability (a Pro/Max/Team/Enterprise beta at no extra charge, off by default on Enterprise, the canvas web/desktop only and /design-sync in the CLI), and why it matters (closing the rebuild-from-a-screenshot gap and connecting designers and developers in one line) — all based on official information, with uncertainties flagged.

What Is Claude Code Artifacts? Turn a Session into a Live Shared Page

What Is Claude Code Artifacts? Turn a Session into a Live Shared Page

On June 18, 2026, Anthropic shipped Claude Code Artifacts (beta), a feature that turns a terminal coding session into a live web page your team can share. Instead of streaming endless git diff and logs as text, Claude Code can publish an annotated PR walkthrough, a self-updating dashboard, an incident timeline, a release checklist that checks itself off, or an architecture map as one page at a private claude.ai URL. This article explains what Artifacts is (built from the whole session and MCP-connector data, and the open page refreshes in place as work progresses), how it differs from 2024 claude.ai canvas Artifacts (session-sourced, self-updating, org-only and not publishable), what it is good for, how to use it (no /artifact command — you ask in plain language, Claude writes a .html and asks permission to publish, prints the URL, Ctrl+] reopens, each update is a new version at the same URL, Share grants org access, view-only), its limits (a capture of work not an app — no backend, a strict CSP blocks external requests, single page, only .html/.htm/.md, 16 MiB cap, more tokens), and availability (Team/Enterprise beta, must be signed in via /login so API keys cannot publish, Anthropic API only and not Bedrock/Vertex/Foundry, disabled under CMEK/HIPAA/ZDR, plus admin controls and an audit log) — all based on the official documentation.

Claude Code Auth & Login Errors (Invalid API key / Not logged in): Causes and Fixes

Claude Code Auth & Login Errors (Invalid API key / Not logged in): Causes and Fixes

When Claude Code throws "Not logged in · Please run /login", "Invalid API key", "This organization has been disabled", or "OAuth token has expired", these are mostly 401/403 authentication (who-are-you) problems. This article covers the number-one true cause (an environment variable ANTHROPIC_API_KEY silently overriding your subscription Pro/Max login by precedence, which produces unexpected pay-as-you-go charges, organization disabled, and Invalid API key; the key often comes from .zshrc/.bashrc/.profile, direnv/dotenv, an IDE terminal .env, a leftover from a previous job, or CI), how to detect it (/status to see the active credential, env | grep ANTHROPIC) and fix it (unset ANTHROPIC_API_KEY plus removing it from your shell config), other causes (token revoked/expired, system clock skew, a locked macOS Keychain, a missing Console role causing 403, OAuth redirect failures over WSL/SSH/containers fixed by pasting the code, and server-side org policy that cannot be overridden locally), where credentials are stored (macOS Keychain, Linux ~/.claude/.credentials.json, Windows %USERPROFILE%), the diagnostic workflow (/status → env grep → unset → /logout → /login → clock/Keychain/Console role), and how to tell it apart from usage limit (quota), 429 (rate), 529/500 (server), and Credit balance (prepaid balance) — all based on official information.

Claude Code "command not found: claude": Install and PATH Error Fixes

Claude Code "command not found: claude": Install and PATH Error Fixes

You installed Claude Code, but typing claude gives "zsh: command not found: claude", "bash: claude: command not found", or "is not recognized as an internal or external command" on Windows. In most cases the install dir is simply not on your PATH, and the install itself succeeded. This article explains how the shell searches PATH folders, the install methods and locations (the native installer is recommended and lands in ~/.local/bin, Windows %USERPROFILE%\.local\bin; npm needs Node 18+ and installs the same native binary; Homebrew/WinGet; installing only the VS Code extension does not add claude to PATH), the main causes and fixes (add ~/.local/bin to PATH and restart the terminal, an npm EACCES permission error should switch to native rather than sudo, Node too old, multiple-install conflicts checked with which -a / where.exe and reduced to one native install, and the native-binary-not-found case from skipping optional deps), Windows-specific traps (the wrong-shell mix-up like running irm in CMD, restarting the terminal, the old Claude Desktop WindowsApps Claude.exe conflict, and CLAUDE_CODE_GIT_BASH_PATH for Git Bash), auto-update and updating (claude update, claude install, claude doctor for the update result, DISABLE_AUTOUPDATER / DISABLE_UPDATES), and the diagnostic workflow (claude doctor to which -a to PATH to removing extras to a native reinstall) — all based on official information.

Claude Code Network, Proxy and TLS Certificate Errors (Unable to connect): Causes and Fixes

Claude Code Network, Proxy and TLS Certificate Errors (Unable to connect): Causes and Fixes

On a corporate machine or over VPN, Claude Code fails to connect with "Unable to connect to API", "Unable to connect to API (ECONNREFUSED)", "SSL certificate verification failed", or "fetch failed" — these are network errors where the request never reached Anthropic server (api.anthropic.com), which is different from auth (401/403), server overload (529/500), and rate limiting (429). This article covers the three enterprise blockers (an unconfigured proxy, a TLS inspection proxy replacing certificates, and a firewall blocking domains) plus DNS/VPN/Docker, proxy setup (HTTPS_PROXY/HTTP_PROXY/NO_PROXY, authenticating proxies, SOCKS unsupported, NTLM/Kerberos via an LLM gateway with ANTHROPIC_BASE_URL, and MCP servers needing the vars in their own env), TLS and corporate CA certs (recent Claude Code trusts both its bundled CA set and the OS trust store so a corporate root in the OS store often works with no config; otherwise point NODE_EXTRA_CA_CERTS at the PEM, mTLS via CLAUDE_CODE_CLIENT_CERT/KEY, and curl --cacert at install time), the critical security rule to never use NODE_TLS_REJECT_UNAUTHORIZED=0 (it exposes all traffic including api.anthropic.com to man-in-the-middle attacks), the firewall allowlist domains (api.anthropic.com, claude.ai, platform.claude.com, downloads.claude.ai, raw.githubusercontent.com, with statsig/sentry optional), the diagnostic workflow (curl -I https://api.anthropic.com for reachability, /doctor and proxy check, cert/proxy settings, a direct connection to confirm, then DNS/VPN/Docker), and how to tell it apart from auth/server/rate by whether the request reached the server — all based on official information.

Claude Code "529 Overloaded" and "500" Server Errors: Causes and Fixes

Claude Code "529 Overloaded" and "500" Server Errors: Causes and Fixes

When Claude Code suddenly stops with "API Error: 529 {\"type\":\"overloaded_error\",\"message\":\"Overloaded\"}" or "500 Internal server error", these are transient server-side events — not a mistake in your request or settings, and not your usage running out. This article explains the meaning of 529 Overloaded (Anthropic API temporarily over capacity, congestion across all users) and 500 (an unexpected internal error, with related 504 timeout_error; 502/503 usually come from upstream infrastructure), the fact that neither consumes your usage quota, how Claude Code auto-retries up to 10 times with exponential backoff before showing anything (Retrying in Ns, attempt x/y; CLAUDE_CODE_MAX_RETRIES default 10, API_TIMEOUT_MS default 10 minutes), the user fixes (wait and retry, switch model with /model since capacity is per model so Sonnet often works when Opus is busy, check status.claude.com, /feedback with request_id if a 500 persists), how to tell it apart from confusable errors (529/500 = server-side and no quota used, 429 = your rate limit with a retry-after header and quota, usage limit = plan allowance, 400 = a bad request), developer guidance (typed SDK exceptions and auto-retry, exponential backoff plus jitter, retry-after only on 429, --fallback-model, Priority Tier/Batch), and how to tell a transient spike from a continuing incident — all based on official information.