Skip to content
Topics

AI Dev & Programming

Build smarter with AI-powered development. Code generation, app building, debugging, and test automation guides.

63 articles

Sort articles to find what you need

Claude Code "usage limit reached": Causes and Fixes — 5-Hour and Weekly Limits, and the API Escape Hatch

Claude Code "usage limit reached": Causes and Fixes — 5-Hour and Weekly Limits, and the API Escape Hatch

Working in Claude Code, you suddenly see "Claude usage limit reached. Your limit will reset at 3pm" and stop cold. This is not an error or a bug: it is how the Pro/Max subscription usage limits work. This article explains the two-tier structure (a rolling 5-hour window that recovers ~5 hours after your first prompt, plus a weekly window that resets every 7 days, and on Max a separate weekly cap just for Opus), the fact that Claude Code and the Claude apps share the same plan allowance, the four biggest consumption drivers (model choice where Opus burns far more than Sonnet, context size, long continuous sessions, and subagents/MCP), five ways to keep working when you hit the cap (drop to Sonnet with /model, trim context with /compact, wait out a 5-hour window, switch to pay-as-you-go API, or buy credits / upgrade), how to see what is left (/usage, /status, and Settings to Usage for the weekly reset date), and the difference between subscription limits and API limits (429, retry-after, tiers). Because the exact numbers get revised over time, it avoids asserting current figures and recommends checking the live official view.

Claude Code "Prompt is too long": Causes and Fixes for the Context Window Error

Claude Code "Prompt is too long": Causes and Fixes for the Context Window Error

The "Prompt is too long" error in Claude Code and the API (on the API: "prompt is too long: 233153 tokens > 200000 maximum") is not a usage limit — it means the input you tried to send (conversation history + attached/read files + tool definitions) exceeded the model context window. This article explains what fills the window (the dynamic factors of ever-growing conversation history, files you read, and tool results, plus the fixed factors of MCP tool definitions, CLAUDE.md, and the system prompt), how Claude Code avoids it by default with auto-compact, the window sizes (standard 200K vs 1M, where 1M is at standard pricing as of 2026 but subscriptions may need usage credits and the new tokenizer consumes roughly 30-35% more tokens), the fixes (/compact to summarize, /clear to restart, offloading big reads to a subagent that uses its own window, /context to see the breakdown and disable unused MCP or slim CLAUDE.md, and a 1M model only when truly needed), and how to tell the three confusable errors apart (Prompt is too long = input overflow, max_tokens = output cutoff, usage limit = plan quota, plus the 1M credits entitlement message) — based on official information.

Claude Code MCP Server Will Not Connect (failed / needs authentication): Causes and Fixes

Claude Code MCP Server Will Not Connect (failed / needs authentication): Causes and Fixes

You set up an MCP (Model Context Protocol) server in Claude Code, but /mcp shows it stuck at failed, needs authentication, or pending approval. This article shows how to first classify the cause by the /mcp (or claude mcp list) status into three families (✓ connected / ✗ failed = local launch failure / △ needs authentication = remote auth / ⏸ pending approval = project approval, plus the connected-but-0-tools state), the main causes and fixes for failed and config issues (relative path → absolute, server API keys belong in the per-server env not settings.json, MCP_TIMEOUT for startup timeouts, .mcp.json at the repo root with care for undefined ${VAR}, and never logging to stdout which corrupts the protocol), the very common Windows npx problem (spawn npx ENOENT → make command cmd and wrap with /c npx, or use WSL), remote OAuth (401/403 → authenticate from /mcp; some services like Microsoft 365 and Gmail connect via claude.ai connectors instead), the diagnostic workflow (/mcp status → claude --debug mcp for stderr → launch the server standalone → MCP Inspector → fully restart Desktop), and a prevention checklist — all based on official information.

Claude Code Prints "court" and Raw invoke Tags — Why Tool Calls Do Not Run, and How to Fix It

Claude Code Prints "court" and Raw invoke Tags — Why Tool Calls Do Not Run, and How to Fix It

During long Claude Code sessions, a stray "court" or a raw <invoke name="Bash"> tag suddenly leaks to the screen and the tool call never executes. This is not a mistake in your environment or command: it is a known model-side glitch where Claude (especially the Opus 4.8/4.7 family) corrupts the control tags of a tool call as it generates them, with many issues filed in Anthropic official repository (#64108, #64150, #64690, #65705, #66153, #67295, #68354). This article explains how an agent generates tool calls as text, why the fail-closed harness rejects them so no wrong command ever runs, the two-layer cause (control-token corruption plus a self-poisoning chain where the broken block stays in history and the model imitates it), the trigger conditions (long multi-day sessions, heavy context, the post-/compact state, multiple tools at once, 3+ MCP servers, long tool arguments), three common misconceptions (it did not go rogue; court is meaningless but a useful marker; retry only fixes mild cases), the user fix (bail to a fresh session /clear after two misses; /compact is unreliable), the developer fix (check stop_reason, detect invoke leakage and retry, never keep broken history, shorten arguments), how to tell it apart from similar errors (thinking-block 400, max_tokens truncation, third-party Bedrock parsing), and the official status that no fix has shipped as of June 2026 — all grounded in official docs and the real issues.

What Is LoRA? Customizing AI With a Tiny Bit of Extra Training

What Is LoRA? Customizing AI With a Tiny Bit of Extra Training

Retraining a giant AI from scratch is too expensive, but you want to tweak it just for you; LoRA (Low-Rank Adaptation) grants that wish by freezing the original model and training only a tiny add-on part (an adapter), cutting trainable parameters by about 90%. LoRA makes fine-tuning dramatically cheaper and faster, and is hugely popular in image generation like Stable Diffusion as a small file that adds a character or style. This article explains it with a patch analogy. LoRA is the flagship of parameter-efficient fine-tuning (PEFT): leave the huge original weights frozen, insert a small add-on matrix into each layer, and train only that (W = W0 + BA, where W0 is frozen and BA is the small added part). It builds on the discovery that adapting an AI does not require big changes (a low rank is enough). Benefits: about 90% fewer trainable params (reportedly 10,000x fewer at GPT-3 scale), less GPU memory (about 3x less), faster and cheaper training, no inference latency once the adapter is merged, and lower overfitting risk. Its biggest strength is swappable adapters: keep one common base and swap small (few-MB) LoRA files per use case (support, company tone, a specific character) instantly. Many people first meet LoRA in image generation, where Stable Diffusion LoRAs that learned a character, style, or subject are shared widely (add a style, teach a character, light and shareable). QLoRA combines quantization, training LoRA on a 4-bit base for ~4x less memory than standard LoRA, enabling fine-tuning huge models on a consumer GPU (sometimes CPU) with minimal accuracy loss. Versus full fine-tuning (train all weights), LoRA differs in weights trained, cost, output, and best use; for most work LoRA is enough. Keep the base, season it small. Figures are quoted from public materials, directional.

What Is Quantization? Shrinking AI Models to Run Them on Your Own Machine

What Is Quantization? Shrinking AI Models to Run Them on Your Own Machine

A huge 70B model running on a single home gaming PC instead of a rack of data-center GPUs is made possible by quantization, which lowers the numerical precision of a model's weights to dramatically shrink size and memory. Whereas model distillation moves knowledge into a separate smaller model, quantization makes the same model lighter. This article explains it with a photo-compression analogy. Quantization replaces weights stored as FP16/FP32 decimals with INT8 (8-bit) or INT4 (4-bit) integers, cutting bytes per weight (FP32=4, INT8=1, INT4=0.5); like compressing a RAW photo to JPEG, you sacrifice a little precision for a big reduction, and the surprise is how little you give up. On memory, 4-bit uses about a quarter of FP16: a 70B model drops from ~140GB to ~35GB, and an 8B model at 4-bit is ~4.5-5GB, fitting a midrange 8GB-VRAM GPU for local use (the democratization of LLMs). On accuracy, INT8 is nearly lossless and INT4 degrades under 4% on general Q&A/commonsense tasks, but loss is more noticeable for math, code generation, and hard reasoning (it shows as a small rise in perplexity), so pick the bit-width for the task. Main methods: GPTQ (pioneer of accurate 4-bit), AWQ (protects the ~1% most important weights, often 1-2% more accurate and faster), GGUF (llama.cpp/Ollama format, Q2_K-Q8_0, CPU+GPU hybrid, for local), and QLoRA (4-bit base plus LoRA for consumer-GPU fine-tuning). It differs from distillation (move to a separate small model) and fine-tuning (add task knowledge), and the three are usually combined (quantize a distilled model; fine-tune a quantized base). To start, run a GGUF model with Ollama in one command, choose Q4/Q8 by VRAM, and avoid INT4 for code or exact math. Most major models ship already quantized, so you just download and use them. Keep the smartness, drop only the weight. Figures are quoted from public materials, directional.

What Is Model Distillation? Moving Knowledge From a Big AI to a Small One

What Is Model Distillation? Moving Knowledge From a Big AI to a Small One

A huge, high-performance AI is smart but heavy and expensive; model distillation (knowledge distillation) solves this by transferring a large teacher model's knowledge to a small student model, keeping 95%+ of the teacher's performance at one-tenth the size and speed. This article explains it with a teacher-student analogy. The key is soft labels: ordinary training teaches only "the answer is cat" (hard label), while distillation passes the teacher's full probability distribution like "90% cat, 8% dog, 2% fox," whose degree of hesitation carries rich information; a temperature parameter softens the probabilities to reveal subtle relationships (real example: GPT-4o mini distilled from GPT-4o). Benefits: fast and cheap, ~10x more compact while keeping 95%+ performance, runs on the edge, strong for specialization. Two approaches: white-box (full access to weights and internal representations, deeper transfer; for your own or OSS models) and black-box (only outputs/API responses visible; using another company's API as teacher can violate terms). It differs from quantization (compress the same model's weight precision) and fine-tuning (further-train an existing model for a task) — distillation moves knowledge into a separate small model, and the three are combinable. The legal/ToS reality was a big 2026 issue: the technique is legitimate, but OpenAI, Anthropic, Mistral, and xAI include anti-competitive distillation clauses prohibiting using outputs to build competing models, so distilling a competitor from a restricted API can violate terms. The OpenAI v. DeepSeek dispute (OpenAI alleged DeepSeek-linked accounts circumvented restrictions to obtain outputs for distillation, while DeepSeek's terms reportedly permit distilling its outputs) shows the assessment depends on whose API terms apply, and Claude Fable 5/Mythos 5 reportedly restrict responses on distillation-flagged work. Tips: use your own or licensed OSS models as teacher, check anti-distillation clauses before using a commercial API, and judge whether the use is "developing a competing model." Smartness from the big model, operation from the small — but who you pick as teacher changes the outcome technically and legally. Figures are quoted from public materials, directional.

What Is AI Observability? Monitoring and Tracing LLMs and Agents, for Beginners

What Is AI Observability? Monitoring and Tracing LLMs and Agents, for Beginners

In "How to build a multi-agent system" we said to instrument every handoff before adding agents; the tech that powers that instrumentation in production is AI observability. It makes visible what LLMs and agents actually do in production (which model with what prompt, which tools and searches, what was returned, and how long and how much it cost) so you can trace back to the cause. The decisive difference from ordinary app monitoring: AI can return 200 OK in 50ms and still confidently hallucinate, so most AI failures are quality failures (hallucination, weak retrieval, unsafe answers, incomplete tasks, poor tool use, post-prompt-change regressions), not infrastructure failures. Observability rests on three pillars: traces (one request as a tree of spans showing LLM calls, tools, retrieval, reasoning chains; the star of AI observation), metrics (latency, cost, tokens, error rate, throughput), and logs (per-event detail). The industry standard OpenTelemetry GenAI conventions capture prompts, responses, token usage, and tool/agent calls in a vendor-neutral schema feedable into Datadog/Grafana. The most-confused distinction is observability vs evaluation (evals): observability shows what happened (easy to measure, but cannot tell if the answer is correct), while evals measure whether the answer is good (accuracy, groundedness, safety) and require explicit evaluation. Because cost and latency are easy to measure but answer quality is not, 2026 tools combine trace display with output scoring and degradation alerts. Metrics split into operational (cost, latency, tokens, error rate) and quality (hallucination, groundedness/faithfulness which is most critical for RAG, safety, task completion), with hallucination detection via LLM-as-a-judge, semantic similarity, and groundedness scores. Major tools: LangSmith (LangChain), Langfuse (open-source self-host), Arize Phoenix (RAG debugging), MLflow (lifecycle), AgentOps (agents), and OpenTelemetry (the standard). Start by capturing traces (OpenTelemetry-compliant), visualize operational metrics, then connect evals before shipping. For multi-agent systems observation is essential since failures hide in multi-step chains visible only in a full-session trace. Observe plus evaluate is what makes AI production-grade. Figures and traits are quoted from public materials, directional.

How to Build a Multi-Agent System: A Practical Guide to the Supervisor Pattern

How to Build a Multi-Agent System: A Practical Guide to the Supervisor Pattern

After grasping the concept in "What is a multi-agent system?", this is the hands-on follow-up. Using the 2026 de facto standard supervisor pattern, it walks beginners through a 5-step build. The key principle: build single first and add agents minimally only after hitting a limit (~80% of use cases are fine with one; using multi for simple one-track work inflates cost 3-10x and, per Google research, drops accuracy -39-70% on sequential tasks). Three signs to go multi: specialization split, parallelism, decision separation. The supervisor pattern (the supervisor receives the overall task, decomposes it, delegates to specialist workers, and aggregates results) is where Claude Code subagents, LangGraph Supervisor, and OpenAI Agents SDK handoffs have all converged, because it has the widest framework support, a known failure mode (over-delegation, bounded by an iteration cap), and is easy to audit. The 5 steps: 1) decompose the task clearly up front; 2) define workers with one role + tools + output format (3-5 max); 3) design the supervisor, explicitly listing callable worker names (hard cap) and spending the most time here; 4) decide handoff and context sharing, passing only needed info (the standard is A2A); 5) instrument every handoff before adding agents, cap iterations/tokens/cost, and set up evals and guardrails. Framework-agnostic pseudo-code shows worker definitions, a hard-capped supervisor, and an iteration-bounded run loop. Common pitfalls and fixes: over-delegation (cap + limit callable workers), token bloat (need-only sharing + cache), instability (keep to 3-5 + fixed output), accuracy drop on sequential (revert to single), and unknown failure point (observability). The shared lesson: prompts, tool design, and the eval harness decide success more than the framework. Build small, measure, add only when it pays off. Figures are quoted from public materials and research, condition-dependent.

What Is a Multi-Agent System? Coordinating Multiple AI Agents, Explained for Beginners

What Is a Multi-Agent System? Coordinating Multiple AI Agents, Explained for Beginners

"Split a complex job that one AI agent cannot handle across several agents" is the idea behind multi-agent systems. This beginner-friendly guide lays out the mechanics, main patterns, and major frameworks, and most importantly the real decision rule for when to use multiple agents and when one is enough, without hype. A multi-agent system has several role-specialized AIs work together on one large task; versus a single agent that does everything (fine for ~80% of use cases, cheap and easy to debug), it splits work by specialty for parallel execution and cross-checking, at higher coordination cost and token use. The four dominant orchestration patterns are: orchestrator-worker (a lead decomposes, dispatches workers in parallel, and synthesizes; most widely used, with an audit trail), sequential handoff (pass context to the next agent), group conversation (agents debate in one thread with a selector choosing who speaks; good for cross-verification), and graph state machine (agents as nodes, transitions as edges, explicit state; strong for branching and checkpoints). Frameworks consolidated in 2026 to LangGraph (largest production footprint), CrewAI (lowest learning curve, prototyping), AutoGen/AG2 (debate and verification, research), and OpenAI Swarm (lightweight handoffs). But it is not a cure-all: complex multi-domain tasks see up to +23% on reasoning benchmarks, yet on one-track sequential tasks Google research found -39-70% vs a single agent, the same compute given to one agent often matches or wins, and 7 of 10 deployments reportedly added cost without ROI at ~15x token consumption (avg ROI 2.5-3.5x, top quartile 4-6x when aimed well). The recommended path: build single first, identify a concrete ceiling (blurred roles, parallelizable work), then add a minimal 2-3 agent lead-pattern team with a cost cap and logging, and measure whether the accuracy gain justifies the increase. A2A (communication protocol) and MCP (tool connection) are foundational tech that support multi-agent. Single for 80%, multi only for the hard parts. Figures are quoted from surveys and research and are condition-dependent, directional.

What Is A2A (Agent2Agent)? How It Differs from MCP, Agent Cards, and How It Works

What Is A2A (Agent2Agent)? How It Differs from MCP, Agent Cards, and How It Works

Now that AI agents are commonplace, the next challenge is how to make agents collaborate with each other. If MCP connects an agent to its tools, A2A (Agent2Agent) connects an agent to another agent — an open standard for AIs built on different vendors and frameworks to discover, communicate, and cooperate through a common convention. Google released it in April 2025, donated it to the Linux Foundation that June, and it reached v1.0 in 2026. This beginner guide covers what A2A is (the etiquette of a business partnership analogy), why it's needed (specialized agents relay work — a planning agent to a hotel-booking agent to a payment agent), how it differs from MCP (MCP is vertical, agent ↔ tools; A2A is horizontal, agent ↔ agent; stacking both is the standard two-layer setup), how it works (an Agent Card — a JSON "business card" at /.well-known/agent-card.json — is used to discover capabilities, then a Task carries the request through states like working, input-required, and completed, and an Artifact returns the result, all over HTTP, Server-Sent Events, and JSON-RPC 2.0, with agents keeping their internals hidden), and where it stands and implementation (as of April 2026, 150+ organizations in production, 22,000+ GitHub stars, SDKs in five languages — Python, JavaScript, Java, Go, .NET — with Microsoft, Salesforce, SAP, and ServiceNow involved). The mnemonic: connect to tools = MCP, connect to peers = A2A.

What Is Reranking? Two-Stage Retrieval That Boosts RAG Accuracy — A Beginner's Guide

What Is Reranking? Two-Stage Retrieval That Boosts RAG Accuracy — A Beginner's Guide

You built RAG but the search quality is mediocre — that's exactly when reranking helps. Reranking re-scores the candidates roughly gathered by embedding (vector) search by their relevance to the query and reorders them, keeping only the top ones; this single step can dramatically change a RAG system's answer quality. This beginner guide covers what reranking is (a first-screening-and-final-interview analogy), why it's needed (embedding search vectorizes the query and documents separately, so it judges relevance only coarsely, and a bad ordering directly lowers answer quality — research reports about a 40% RAG accuracy gain from adding reranking, and layering it onto hybrid search is the 2026 standard), how two-stage retrieval works ("gather wide" with fast embedding search for recall, then "narrow smart" with the reranker for precision, then hand the top to the LLM), why a reranker is more accurate (a bi-encoder vectorizes query and document individually and is fast but approximate; a cross-encoder feeds them in together and outputs a 0–1 relevance score, accurate but heavy — so you gather with the fast bi-encoder and narrow with the accurate cross-encoder), and the models and implementation (API type like Cohere Rerank, Voyage, and Jina; open-source like BGE reranker, mixedbread, and FlashRank; and LLM-based scoring like RankLLM — just retrieve 50–100 and narrow to the top 5). The principle: gather wide, narrow smart, and tune the counts with AI evals.