Table of Contents
"Prompt engineering is dead" — that refrain started circulating around 2025. What rose to take its place is the concept of "Harness Engineering." Coined by Anthropic researchers and the engineers building agents like Claude Code and Cursor, it has quickly become one of the central engineering disciplines of the AI agent era.
This article lays out what harness engineering actually is, how it differs from prompt engineering, the six components that make up a harness, a practical design checklist, and concrete examples from today's leading tools — the foundation you need if you're serious about using or building AI agents.
A Harness = the 4 Layers Wrapping the LLM
— like a horse harness, the rig that channels a powerful animal toward your goal
With the same LLM, harness design alone can swing both quality and safety dramatically.
That's the battleground of "Harness Engineering" — a brand-new design discipline.
1. What is Harness Engineering?
"Harness" originally refers to the gear and tack used on a horse — the rig that channels the animal's power in the direction you want. The term in AI works as exactly the same metaphor: the full set of equipment that puts a powerful but unruly LLM to productive work.
Concretely, that includes:
- Tools: file operations, web search, code execution — the means by which the LLM can take action.
- Context management: the strategy for what goes into the prompt and what gets compressed or discarded.
- Memory systems: persistent knowledge and user preferences that survive across sessions.
- Agent loop: the perceive → reason → act → observe cycle.
- Guardrails: permissions, sandbox, Hooks, approval flows.
- Output format: markdown, JSON, citations, streaming.
Designing all of that together is what we call harness engineering. Rather than training or improving the LLM itself, it's the craft of raising real-world utility by engineering everything around the LLM. Claude Code, Cursor, Devin, Codex CLI — they all run on roughly the same models, yet their behavior and performance diverge sharply because of the difference in their harnesses.
2. Harness Engineering vs Prompt Engineering
Prompt engineering hasn't gone away — but the scope is fundamentally different.
| Dimension | Prompt Engineering | Harness Engineering |
|---|---|---|
| Target | Single-turn input text | The whole system (tools, memory, loop) |
| Main work | Optimizing prompt wording, picking few-shot examples | Tool design, context strategy, loop design |
| Deliverable | Text templates | Code, configuration, system architecture |
| Skills required | Linguistic feel, intuition for LLM behavior | General software engineering |
| Scope of impact | Quality of one response | Completion rate, cost, and safety of long tasks |
| Example | "Think step by step" | Defining a calculator tool and letting the LLM call it |
If prompt engineering is the craft of "what to say to the LLM," harness engineering is the craft of "what to give the LLM and how to operate it." The two aren't competing — they're layered. The prompt is just one component within the harness.
3. The 6 Components of a Harness
1. Tool Use
The LLM's means of acting on the world: reading and writing files, executing code, searching the web, calling APIs. Get the tool interface wrong — names, arguments, return values — and the LLM can't use it correctly. Concretely:
- Verb-based, unambiguous names (e.g.
read_file). - Required vs. optional arguments made explicit, with defaults.
- Structured error messages on failure (tell the model what to do next).
- Explicit warnings on side-effecting (destructive) operations.
2. Context Management
The LLM's attention is finite — what you show it determines what it says. Concretely:
- Relevance filtering: pull only the parts relevant to the task, not whole files.
- Compaction: summarize long conversations to retain them.
- RAG integration: fetch what's needed via vector search.
- Caching: trim cost on repeated system prompts using tools like Anthropic's prompt cache.
Related: What is RAG?
3. Memory System
Holding knowledge across sessions. Claude Code's CLAUDE.md, Cursor's .cursor/rules, and Codex's AGENTS.md are all examples of project memory. Beyond that:
- Short-term memory: recent conversation history.
- Long-term memory: user profile, past decisions.
- Factual knowledge: domain-specific knowledge bases.
4. Agent Loop
The core that makes an "AI agent" actually work. The base form is the perceive → reason → act → observe cycle:
- Receive the user's goal.
- Analyze the current state (gather information with tools if needed).
- Plan the next action.
- Act via a tool.
- Observe the result; check whether the goal is met.
- Loop if not, terminate if yes.
How smart your agent gets depends on whether you bake in replanning, self-critique, and subgoal decomposition.
5. Guardrails
The mechanisms that prevent runaway behavior. As Why AI Ignores Your .md Rules covers, enforcing behavior through the environment is far more reliable than asking nicely in prose:
- Approval mode: dangerous operations require human confirmation (e.g. Claude Code's Plan mode).
- Sandbox: restrict filesystem and network access.
- Hooks: arbitrary checks before and after tool calls.
- Rate limiting: minimize damage if something goes off the rails.
6. Output UX
Presenting results in a form the user can understand and verify. Markdown rendering, source citations, syntax-highlighted code blocks, streaming output, visible reasoning (thinking), structured output (JSON), and so on. Producing the "right answer" isn't enough — it's the harness's job to deliver it in a form the user can trust and verify.
4. Why Harness Engineering, Why Now?
Three forces are driving the surge of interest in harness work.
1. The ceiling on raw LLM capability has come into view. With GPT-5-class models, Claude Opus 4.7, and Gemini 3.1 Pro out in the wild, benchmark gains have started to flatten. Real-world performance for a fixed model can swing 2x or more depending on the harness, which means we've entered an era where changing the harness pays off more than changing the model.
2. Problems prompts alone can't solve are stacking up. "Too many tools, the model picks the wrong one." "The context is so packed the important signal is buried." "On long-running tasks, the agent loses the thread halfway through." These aren't problems you fix with cleverer wording in a single turn — they're design problems.
3. The bottleneck for production AI agents has shifted to the harness. 2024 was the race to make LLMs smarter. 2025 through 2026 is the race to make harnesses smarter. Every major product — Anthropic's Claude Code, OpenAI's Codex, Cursor, Devin — is competing on harness engineering.
5. A Practical Harness Design Checklist
7 Checkpoints for a Good Harness
6. Comparing the Major Harnesses
Design Tendencies of the Major AI Agent Harnesses
Each of these harnesses runs on more or less the same LLMs (Claude / GPT / Gemini), yet their strengths diverge sharply because of differing harness design philosophies. "Which harness?" matters more than "which LLM?" — that's the real battleground of the agent era.
7. Anti-Patterns
1. Adding too many tools
Once you cross roughly 20 tools, the LLM's chance of picking the wrong one shoots up. Be ruthless about keeping only the tools you actually need, and merge similar ones.
2. Stuffing everything into context
"Just show it everything, to be safe" is counterproductive. Run things through a relevance filter and include only what's necessary. Context is a device for surfacing the important signal — not a storage closet.
3. Implementing safety with prompts alone
"Please don't perform dangerous operations" gets ignored sooner or later, depending on the situation. The right answer is to make it physically impossible at the environment level — sandbox, Hooks, permission limits.
Summary
Harness engineering is the craft of designing the layer "outside" the LLM. Prompt engineering is just one component within the harness. Treating the six elements deliberately — tool definition, context management, memory, loop, guardrails, output UX — can transform real-world performance from the same underlying LLM.
As of 2026, the main battleground for production AI agents has clearly moved to the harness. Building "smart harnesses" — not just writing "smart prompts" — will be the differentiator for the next generation of engineers.
FAQ
Q1. So we don't need prompt engineering anymore?
Wrong. It's still essential — as one component within the harness. Tool descriptions, system prompts, error messages — all of those are prompt design surfaces. What's outdated is the mindset of "I'll fix this with a better prompt."
Q2. What's the first step to learn harness engineering?
Take Claude Code or Cursor and don't just use it — change its behavior by tweaking its config. Write a CLAUDE.md / .cursor/rules. Try out Hooks. Build a slash command. That's hands-on experience with what a harness actually is.
Q3. Are harnesses the same as frameworks like LangChain?
Close, but not the same. A framework is an implementation toolkit; a harness is a design discipline and mindset. LangChain, LlamaIndex, the Claude Agent SDK, and the like are tools for building harnesses.
Q4. Build my own harness vs. use an existing one?
For most cases, an existing harness (Claude Code, Cursor, etc.) plus customization is enough. Building one from scratch only makes sense for enterprise requirements, niche domains, or extreme cost optimization.
Q5. Will "harness engineer" become a real job title?
Signs are already there. Anthropic, OpenAI, Cursor, and other agent-building companies have started hiring for roles like "Agent Engineer," "Tool Designer," and "Context Engineer". By 2027 or 2028, it's likely to settle in as its own distinct job category.