What Is AI Observability? Monitoring and Tracing LLMs and Agents, for Beginners
In "How to build a multi-agent system" we said to instrument every handoff before adding agents; the tech that powers that instrumentation in production is AI observability. It makes visible what LLMs and agents actually do in production (which model with what prompt, which tools and searches, what was returned, and how long and how much it cost) so you can trace back to the cause. The decisive difference from ordinary app monitoring: AI can return 200 OK in 50ms and still confidently hallucinate, so most AI failures are quality failures (hallucination, weak retrieval, unsafe answers, incomplete tasks, poor tool use, post-prompt-change regressions), not infrastructure failures. Observability rests on three pillars: traces (one request as a tree of spans showing LLM calls, tools, retrieval, reasoning chains; the star of AI observation), metrics (latency, cost, tokens, error rate, throughput), and logs (per-event detail). The industry standard OpenTelemetry GenAI conventions capture prompts, responses, token usage, and tool/agent calls in a vendor-neutral schema feedable into Datadog/Grafana. The most-confused distinction is observability vs evaluation (evals): observability shows what happened (easy to measure, but cannot tell if the answer is correct), while evals measure whether the answer is good (accuracy, groundedness, safety) and require explicit evaluation. Because cost and latency are easy to measure but answer quality is not, 2026 tools combine trace display with output scoring and degradation alerts. Metrics split into operational (cost, latency, tokens, error rate) and quality (hallucination, groundedness/faithfulness which is most critical for RAG, safety, task completion), with hallucination detection via LLM-as-a-judge, semantic similarity, and groundedness scores. Major tools: LangSmith (LangChain), Langfuse (open-source self-host), Arize Phoenix (RAG debugging), MLflow (lifecycle), AgentOps (agents), and OpenTelemetry (the standard). Start by capturing traces (OpenTelemetry-compliant), visualize operational metrics, then connect evals before shipping. For multi-agent systems observation is essential since failures hide in multi-step chains visible only in a full-session trace. Observe plus evaluate is what makes AI production-grade. Figures and traits are quoted from public materials, directional.