We were told "the company AI budget has gotten too big." What should we do first?

Three things in order. (1) Look at per-user usage and check what % of total the top 5% consumes (often 50%+). (2) Interview the heavy users about their workflow and identify wasteful patterns. (3) Distribute an internal guide on "caching, model routing, output budget" company-wide and report monthly on progress. If you talk to your Anthropic / OpenAI Enterprise rep, you can also get a free optimization review.

AI Token Savings Playbook — Prompt Caching, Model Choice, Output Budget

Q: Do I need special configuration to use prompt caching?

On the API, you must explicitly mark cache_control blocks. It doesn&#039;t work by default. Integrated tools like Claude Code / Cursor often use it automatically internally, but if you&#039;re calling the API yourself, you must declare it. See Anthropic&#039;s official docs for details.

Q: ChatGPT vs. Claude — which is more cost-efficient?

Depends on the use case. For long autonomous tasks and complex coding, Claude (especially with caching) often comes out cheaper. For short Q&amp;A and terminal automation, GPT-5.5 mini is extremely cheap ($0.60 input). &quot;Subscribe to both and pick the right tool&quot; is also practical.

Q: How do I judge &quot;Haiku is enough&quot;?

Run a three-step experiment. (1) Get it working on Opus. (2) Send the same prompt to Sonnet and compare quality. (3) If Sonnet looks comparable, try Haiku too. For many routine tasks, Haiku and Opus differ by an amount you don&#039;t notice. Reserve Opus for cases that genuinely need deep judgment or reasoning.

Q: Should individual users hit the API directly?

It depends. For 2+ hours daily of interactive coding, the Max plan ($100/$200) is overwhelmingly easier. For embedding AI in your own app, batch processing, or automation, direct API is essential. Plenty of people do both.

Q: What threshold should I set for billing alerts?

For an individual developer, a realistic setup is 1.5x your typical monthly spend for the first alert and 3x as the auto-stop. Example: if you usually spend $30/month, alert at $50 and stop at $100. Early on, run finer-grained alerts like $5/day to build intuition, then loosen.

How to Save on AI Tool Spend & Tokens — Three Levers That Compress Unoptimized Cost to 20-30%

Contents

1. Why your AI bill quietly balloons
2. Cost breakdown — input, output, cache, tools
3. Plan choice and its savings impact
4. Prompt caching — the strongest single lever
5. Context management — /compact and splitting
6. Model selection — task-based routing
7. Managing your output budget
8. The multi-agent trap — 15x tokens
9. Monitoring and billing alerts
10. Seven common wasteful patterns
Summary
FAQ

"I was using ChatGPT Plus, then switched to Claude Code and my monthly bill went up 10x." — entering 2026, this kind of complaint has surged among engineers. AI tools are useful, but if you don't know how to use them, tens of thousands of dollars a month can quietly disappear.

The good news: combining three levers (prompt caching, model routing, output budget), you can do the same work for 20-30% of unoptimized cost. Drawing on Anthropic's official guidance, industry research, and real-world operational data, this article lays out how to legally save on AI tool spend.

3 LEVERS · 2026

Compress to 20-30% of unoptimized cost

— a realistic case: $30K/month down to $6-9K

LEVER 1 CACHE

-60 to 90%

Prompt caching slashes input cost. Maximum impact on production workloads that reuse the same system prompt.

LEVER 2 MODEL CHOICE

-50 to 80%

Route Opus / Sonnet / Haiku per task. Eight out of ten jobs are fine on a cheaper model.

LEVER 3 OUTPUT BUDGET

-30 to 60%

Cap with max_tokens and tell it to "answer briefly." Output tokens cost 5-6x more than input.

The three levers multiply when applied together.
"Cache only" or "model choice only" leaves money on the table — attack with all three at once is the core thesis of this article.

1. Why your AI bill quietly balloons

AI tools come in two billing tracks: personal plans (flat rate) and API billing (usage-based). The bill that explodes is mainly the latter.

Personal plans: ChatGPT Plus $20/mo, Claude Pro $20/mo, Max $100-200/mo. Fixed cost, so even heavy use has a ceiling (with rate limits).
API billing: per-token, usage-based. Cursor / Claude Code / your own AI apps, Lovable / Bolt.new and the like fall here. Use them carelessly and your monthly bill jumps by an order of magnitude.

The reason "suddenly $300" or "$50 burned in a single day" happens: (1) output tokens cost 5-6x more than input, (2) the longer your context grows, the more is resent in full each turn, (3) sub-agents get invoked multiple times behind the scenes, (4) once it loops, it doesn't stop — these compound. Once you understand the mechanics, every one is fixable.

2. Cost breakdown — input, output, cache, tools

Using Claude Opus 4.7 API pricing (as of May 2026) as an example, here's where the money goes.

Item	Unit price	Description
Input tokens	$5 / 1M tokens	What you send: prompt + conversation history + files, etc.
Output tokens	$25 / 1M tokens	What the AI returns. 5x more expensive than input.
Cache write	$6.25 / 1M tokens (1.25x)	Stored to cache with 5-min TTL (only the first write costs more).
Cache write (1h)	$10 / 1M tokens (2x)	Cached with 1-hour TTL. Lasts longer, but the write costs more.
Cache read	$0.50 / 1M tokens (10%)	10% of input price. This is the star of the savings show.
Tool calls	— (included)	Tool definitions are part of the context. The more tools, the fatter the input.

In short, "content sitting in the cache reads for one-tenth the price." That's the single biggest savings lever in 2026.

3. Plan choice and its savings impact

The moment you can predict how you'll use it, switch to the right plan first.

Usage	Recommended plan	Monthly target	Caveats
Hobby, learning, a few times a week	Claude Free / ChatGPT Free	$0	Rate-limited; not for work data.
Personal, a few hours daily	Claude Pro / ChatGPT Plus	$20	Personal plan; not for work data.
Heavy personal use	Claude Max	$100-200	Higher rate ceiling; recommended for Claude Code.
Team work	Claude Team / ChatGPT Team	$25-30/user	OK for work data; data not used for training.
Large organization	Enterprise	Sales quote	SSO, audit logs, SLA.
AI-embedded development	Direct API (Anthropic / OpenAI)	Usage-based	Use caching and batch.

If you're going to use Claude Code "seriously, several hours a day," the Max plan ($100 or $200) is almost always the right answer. Cheaper than direct API and the rate limits are practically sufficient. Cursor offers tiers like Pro $20, Ultra $200.

4. Prompt caching — the strongest single lever

If you're hitting the API directly, prompt caching is a savings tool with "no reason not to use it." Anthropic itself describes it as "the most underused cost optimization tool of 2026."

How it works

When you reuse the same system prompt or same documents across multiple requests, the first call writes to cache (1.25x cost). Every subsequent call reads from cache at 10% of input price.

Break-even math

5-min TTL (write 1.25x): two reads break you even
1-hour TTL (write 2x): five reads break you even
Production rule of thumb: 3+ reads on 5-min TTL or 5+ reads on 1-hour TTL is a reliable win

Important 2026 change

In early 2026, Anthropic shortened the default prompt-cache TTL from 60 minutes to 5 minutes. If you're running production without noticing, your effective cost has gone up 30-60%. Developers stuck with the "old intuition" are quietly losing money — that's the hidden problem of 2026.

Recommended pattern

For production apps:

system prompt + tool definitions: cache with 1-hour TTL (the parts that don't change)
front of conversation history: cache with 5-min TTL (the parts re-accessed within a short window)

If your cache hit rate (cache_read / (cache_read + input)) is under 60%, there's room to optimize. In production, aim for 80%+.

5. Context management — /compact and splitting

Use Claude Code or Cursor for a while, and mid-way through a long conversation you'll find "I'm somehow sending 100k tokens every turn." It's not the output — it's the input (= past conversation) that keeps swelling.

Tactic 1: actively use `/compact`

Claude Code has a /compact command. It summarizes and compresses conversation history, regenerating the context window. You can shrink 200k tokens down to 5,000. Consider it once a session passes 30 minutes.

Tactic 2: split sessions per task

Don't do "implement Feature A," "fix Bug B," and "generate Doc C" in one long conversation — start fresh sessions. Close the session when each task wraps. If you need long-term memory, write it out to a memory file.

Tactic 3: trim noise with Hooks

Claude Agent SDK / Claude Code provide Hooks, which let you transform tool output before it reaches the AI. Example: compress a long npm install log down to just "success/failure" via a Hook. That alone can save thousands of tokens per turn.

6. Model selection — task-based routing

"Always Opus" is a millionaire's strategy. Most tasks get sufficient quality from Sonnet or Haiku. Anthropic's official price ratios are as follows (May 2026).

Model	Input	Output	Best at
Claude Opus 4.7	$5	$25	Complex design, reasoning, long autonomous tasks
Claude Sonnet 4.7	$3	$15	Daily coding, analysis, summarization
Claude Haiku 4.5	$0.80	$4	Classification, extraction, short conversion, real-time response
GPT-5.5	$5	$30	Planning, execution, terminal control
GPT-5.5 mini	$0.60	$2.40	Lightweight tasks

Opus to Haiku is roughly 6x cheaper. Routing per task alone yields enormous savings. Decision criteria:

Use Opus for: complex refactors, designs spanning many files, deep reasoning, exploring an unfamiliar domain
Use Sonnet for: daily coding, analysis, summarization, review, adding tests
Use Haiku for: classification, extraction, format conversion, real-time suggestions, generating commit messages

7. Managing your output budget

Output tokens cost 5-6x more than input. The savings here are large.

Three approaches

Set max_tokens explicitly: cap with max_tokens: 1000 or similar in the API call. Default-unlimited is dangerous.
Add "answer briefly" or "five bullets" to your prompt: the AI listens. Suppress redundant intros, summaries, and signoffs.
Structured output (JSON mode): JSON is shorter than prose. If your app consumes the result, this is the way.

For situations where you don't need a "long, beautiful answer" (classification, extraction, decisions), cutting hard ends up more cost-efficient.

8. The multi-agent trap — 15x tokens

The 2026 trend, multi-agent setups (orchestrator + parallel sub-agents), is powerful, but Anthropic itself has stated publicly that "token consumption is roughly 15x compared to a single agent."

Decision criteria for savings

Clear, sequential tasks (single-file edit, summarization, code review) → single agent suffices
Parallelism that meaningfully reduces wall-clock time → multi-agent is justified
"Multi-agent by default" is economically wrong. Start with a single agent and split only the bottlenecks you can actually see.

Details: see What is a multi-agent?

9. Monitoring and billing alerts

To prevent the "suddenly $500" surprise, routine monitoring + alerts are mandatory.

API users

Check daily token consumption in the Anthropic Console / OpenAI Dashboard
Set a usage limit: auto-stop when you exceed $200/month, etc. No limit = danger.
Billing alerts: email at $50, Slack at $100 — staged thresholds.

Claude Code users

Use /cost to check the current session's token consumption and estimated spend
Make checking /cost at the end of each day a habit

Org administrators

Per-user usage reports (Anthropic Team / Enterprise admin console)
Anomaly detection (flag people consuming 3x their normal)
Quarterly company-wide share-out of "wasteful patterns"

10. Seven common wasteful patterns

Pattern	What's wrong	Fix
Re-attaching all files every turn	Cache doesn't kick in; input balloons	Send unchanging docs once and cache
Asking the same question in both ChatGPT and Claude	Paying twice for the same input on separate plans	Pick one
Continuing a long conversation without `/compact`	Full history sent every turn	`/compact` after 30 minutes
Using Opus for simple classification or extraction	Paying 6x what Haiku costs for the same result	Match model to task
Repeating "more polished" / "a bit longer"	Output tokens stack up	State the desired length up front
Defining many unnecessary tools	Tool definitions ride in the context	Define only what you'll use
Reaching for multi-agent casually	15x tokens vs. single agent	Only when you have a clear need

Summary

The three levers of AI cost optimization: prompt caching, model routing, output budget. Combined, they compress to 20-30% of unoptimized cost.
Cache reads = 10% of input price. 60-90% savings on production workloads. Watch for the early-2026 TTL shortening (60 min → 5 min); ignore it and you're effectively up 30-60%.
Model choice: Opus to Haiku is roughly 6x cheaper. 80% of tasks are fine on Sonnet/Haiku.
Output budget: output tokens cost 5-6x more than input. Set max_tokens explicitly and ask for "brief."
Context management: /compact once past 30 minutes per session, split per task, compress output with Hooks.
Multi-agent trap: 15x tokens vs. single agent. Use only with a clear need.
Monitoring: usage limits, billing alerts, and a /cost check should all be habits.
Stay aware of the seven common wasteful patterns and avoid them.

FAQ

Q1. I use Claude Code daily — is Pro $20 or Max $200 the better deal?

If you use it 2+ hours a day, Max is almost certainly the better deal. Pro hits its rate ceiling fast, frustration mounts, and you end up bleeding into API billing anyway. Max lets you work for hours without worry. Even Anthropic's own messaging assumes Pro users will use Claude Code "lightly."

Q2. Do I need special configuration to use prompt caching?

On the API, you must explicitly mark cache_control blocks. It doesn't work by default. Integrated tools like Claude Code / Cursor often use it automatically internally, but if you're calling the API yourself, you must declare it. See Anthropic's official docs for details.

Q3. ChatGPT vs. Claude — which is more cost-efficient?

Depends on the use case. For long autonomous tasks and complex coding, Claude (especially with caching) often comes out cheaper. For short Q&A and terminal automation, GPT-5.5 mini is extremely cheap ($0.60 input). "Subscribe to both and pick the right tool" is also practical.

Q4. How do I judge "Haiku is enough"?

Run a three-step experiment. (1) Get it working on Opus. (2) Send the same prompt to Sonnet and compare quality. (3) If Sonnet looks comparable, try Haiku too. For many routine tasks, Haiku and Opus differ by an amount you don't notice. Reserve Opus for cases that genuinely need deep judgment or reasoning.

Q5. Should individual users hit the API directly?

It depends. For 2+ hours daily of interactive coding, the Max plan ($100/$200) is overwhelmingly easier. For embedding AI in your own app, batch processing, or automation, direct API is essential. Plenty of people do both.

Q6. What threshold should I set for billing alerts?

For an individual developer, a realistic setup is 1.5x your typical monthly spend for the first alert and 3x as the auto-stop. Example: if you usually spend $30/month, alert at $50 and stop at $100. Early on, run finer-grained alerts like $5/day to build intuition, then loosen.

Q7. We were told "the company AI budget has gotten too big." What should we do first?

Three things in order. (1) Look at per-user usage and check what % of total the top 5% consumes (often 50%+). (2) Interview the heavy users about their workflow and identify wasteful patterns. (3) Distribute an internal guide on "caching, model routing, output budget" company-wide and report monthly on progress. If you talk to your Anthropic / OpenAI Enterprise rep, you can also get a free optimization review.