Table of Contents
- 1. Opus 4.8 in three lines
- 2. Core specs and availability
- 3. Benchmarks head-to-head (4.8 vs 4.7)
- 4. Pricing and fast mode — 3x cheaper speed
- 5. New feature #1: the effort parameter and adaptive thinking
- 6. New feature #2: dynamic workflows (research preview)
- 7. New feature #3: system entries in the Messages API
- 8. The biggest leap is honesty — 10x less overconfidence
- 9. Caveats and regressions (told honestly)
- 10. Who should upgrade right now
- Summary
- FAQ
On May 28, 2026, Anthropic released Claude Opus 4.8 — barely two months after Opus 4.7. The upgrade cadence is clearly accelerating. But the headline this time is not a few percentage points on a benchmark. The first thing Anthropic itself highlighted was "sharper judgement, more honesty about its progress, and the ability to work independently for longer than its predecessors." A release that leads with "it became more honest" before "it got smarter" is unusual.
Here is the bottom line: coding is solidly improved (SWE-bench Pro 64.3% → 69.2%), math jumps dramatically (USAMO 2026 from 69.3% to 96.7%), and long-context tracking nearly doubles (GraphWalks at 1M tokens 40.3% → 68.1%). On top of that, fast mode is roughly 2.5x faster and effectively one-third the price, and three developer-facing features land at once: the effort parameter, dynamic workflows, and system entries in the Messages API. At the same time, not everything got better — prompt-injection robustness actually regressed. This article breaks down the numbers, the new features, and the caveats, based on Anthropic's official announcement and system card.
Claude Opus 4.8 at a glance
— a flagship that leads with "honesty" over raw smarts
(4.7 was 64.3%)
(4.7 was 69.3%)
$10 / $50 per Mtok
vs Opus 4.7
Standard pricing is held flat with 4.7 ($5 / $25 per Mtok), context stays at 1M tokens.
Model ID is claude-opus-4-8, available day one on Claude API, Bedrock, Vertex AI, and Microsoft Foundry.
* Figures in this article are based on Anthropic's official announcement, model page, and system card, plus reporting from multiple tech outlets (as of May 28, 2026). They may be updated as more verification comes in.
1. Opus 4.8 in three lines
For the busy reader, the essentials first.
- Performance: coding is steadily stronger; math (USAMO) and long-context tracking (GraphWalks) improve dramatically. On the other hand, GPQA Diamond slips slightly, and multilingual tasks trail Gemini 3.1 Pro / GPT-5.5.
- Pricing: standard is held flat with 4.7. The biggest economic impact is that fast mode is ~2.5x faster and effectively one-third the price.
- Philosophy: "more honest" before "smarter." It is the first Claude to score 0% on uncritically reporting flawed results, and overconfidence is down 10x versus 4.7. New dynamic workflows and the effort parameter support longer autonomous work.
2. Core specs and availability
Let's start with the immovable facts: Opus 4.8's specs and where you can use it.
| Item | Detail |
|---|---|
| Release date | May 28, 2026 (about 2 months after 4.7) |
| API model ID | claude-opus-4-8 |
| Context window | 1,000,000 tokens (same as 4.7) |
| Max output | 128,000 tokens per response |
| Standard pricing | $5 input / $25 output (per 1M tokens, same as 4.7) |
| Cost reductions | Up to 90% off with prompt caching, 50% off with batch processing |
| Fast mode pricing | $10 input / $50 output (per 1M tokens, ~2.5x faster) |
| Availability | Claude API, Amazon Bedrock, Google Cloud Vertex AI, Microsoft Foundry (day one) |
The key point is that price and context are held flat, and only the substance got stronger. If you are on 4.7, swapping the model ID to claude-opus-4-8 gets you the performance gains at no extra cost (migration caveats are in section 9). Just note that US-only inference carries a 1.1x pricing multiplier.
3. Benchmarks head-to-head (4.8 vs 4.7)
We saw the specs. So how much did the actual capability grow? Here are the main published benchmarks lined up against 4.7. Bold marks the biggest gains.
| Benchmark | Claude Opus 4.8 | Claude Opus 4.7 | Delta |
|---|---|---|---|
| SWE-bench Verified (real code fixes) | 88.6% | 87.6% | +1.0 |
| SWE-bench Pro (hard coding) | 69.2% | 64.3% | +4.9 |
| SWE-bench Multilingual | 84.4% | 80.5% | +3.9 |
| USAMO 2026 (math olympiad) | 96.7% | 69.3% | +27.4 |
| GraphWalks (1M-token long context, F1) | 68.1% | 40.3% | +27.8 |
| GPQA Diamond (graduate-level science) | 93.6% | 94.2% | −0.6 |
| Online-Mind2Web (browser use) | 84% | — | — |
A note on reading the table. The +4.9 points on SWE-bench Pro looks modest but matters: Pro collects more realistic, harder coding tasks, so a gain there translates directly into "fewer moments where you get stuck in real work." But what really stands out are the +27-point leaps on USAMO and GraphWalks.
What the two leaps mean
On top of that, CursorBench exceeds every prior Opus across all effort levels,
the Super-Agent benchmark saw it become the only model to complete every case end-to-end, and the Legal Agent benchmark recorded the first score above 10% on the all-pass standard.
That said, not everything rose. GPQA Diamond slipped from 94.2% to 93.6%. You could call it within the margin of error, but the fact that 4.7 is marginally ahead on "pure science-knowledge quizzes" is worth keeping in mind. More in section 9.
4. Pricing and fast mode — 3x cheaper speed
We've dwelt on performance, but the thing that actually hits your wallet hardest this time is the fast mode price change. Standard pricing is completely held flat with 4.7, so let's line the two up.
Standard mode (held flat)
- Input: $5 / 1M tokens
- Output: $25 / 1M tokens
- Prompt caching: up to 90% off
- Batch processing: 50% off
→ Not a cent different from 4.7. Zero switching cost.
Fast mode (big change)
- Input: $10 / 1M tokens
- Output: $50 / 1M tokens
- Speed: about 2.5x standard
- One-third the price of previous fast mode
→ "Fast = expensive" no longer holds. Great for chat UIs and bulk processing.
This is bigger than it looks. The dilemma of "I want speed, but fast mode is pricey" hit exactly the use cases — chat-UI responses, bulk code review in CI/CD, many-step agent runs — where you can now have both speed and price. Combined with the flat standard pricing, the economic takeaway this time is "the same budget, but faster and smarter." For the full pricing picture, see Claude Opus / Sonnet / Haiku pricing comparison.
5. New feature #1: the effort parameter and adaptive thinking
After pricing, the features developers touch directly. First, the effort parameter. This is a knob that lets you explicitly specify "how deeply to think" across four levels.
Choose thinking depth in four levels
The crux: the default HIGH uses roughly the same token count as 4.7's default, with only the performance going up.
In other words, even with no setting at all, you get better results at the same cost.
The counterpart to effort is adaptive thinking: the model automatically adjusts the compute it uses according to task complexity. Quick on simple questions, deeper of its own accord on hard ones. You set the ceiling and the policy with effort, and adaptive thinking optimizes the actual allocation — a two-tier design that delivers "no wasted thinking tokens, deep only where it counts."
6. New feature #2: dynamic workflows (research preview)
The most ambitious feature this time is this one. Dynamic workflows is a research-preview feature usable in Claude Code (CLI, Desktop, VS Code extension), a mechanism for handing Claude a "big job" wholesale.
Concretely, Claude writes its own orchestration scripts and spawns tens to hundreds of parallel subagents to attack a problem concurrently. It even deploys adversarial verification agents to critically check the results, and iterates until convergence. It coordinates outside the main conversation thread, and its state is resumable, holding up across multi-day execution.
What it's good for
The intended use cases are codebase-wide bug hunts, large-scale migrations, security audits, and critical verification tasks — the kind of work that "would take a team of humans several days."
Availability: Max, Team, and Enterprise plans (admin-enabled), plus via the API, Bedrock, Vertex, and Foundry. For safety it requires explicit confirmation on first trigger. As a research preview, behavior may change.
In positioning, it is a step toward having the model itself design and run, on the spot, the "parallel orchestration of many agents" you previously had to build yourself with the Claude Agent SDK. For large refactors and cross-cutting investigations, the range it can drive without step-by-step human direction expands.
7. New feature #3: system entries in the Messages API
A subtle change, but a welcome one for developers: the Messages API now accepts system entries inside the messages array.
Previously, the system prompt (system instructions) was placed once at the start of the conversation. With this change, you can inject system instructions mid-conversation — and do so without breaking the prompt cache or requiring a user turn.
// Example: updating "permissions, budget, environment" mid-workflow
messages: [
{ role: "system", content: "You are a CI agent. No destructive operations." },
{ role: "user", content: "Update the dependencies" },
{ role: "assistant", content: "..." },
// Update policy mid-run (without breaking the cache)
{ role: "system", content: "Token budget is low. Use effort=low, key points only." },
{ role: "user", content: "Continue" }
]
This pays off in long, multi-step agent runs. "Dynamically swapping policy" mid-execution — tightening permissions, signaling token budget, updating environment context (which branch you're on, etc.) — now works while preserving cache efficiency. It's a design that pairs well with long-haul autonomous runs like dynamic workflows.
8. The biggest leap is honesty — 10x less overconfidence
This is the part I most want to convey. Opus 4.8's true differentiator is not the benchmark numbers — it is "honesty about its own work." What Anthropic and testers stressed repeatedly is that this model proactively flags its own uncertainty and is less likely to make unsupported claims.
Honesty in numbers
On top of that, the rate of letting flaws in its own code pass unremarked is about one-quarter of 4.7's.
It stopped "pretending it works" — and that is decisive for agent operation.
Why does this matter? The biggest risk in letting an AI agent run autonomously for a long time is "reporting a failure as a success, then stacking more work on top of that error." Saying "fixed" while tests are still failing; stating uncertain guesses in a confident tone — this kind of "overconfidence" undermines automation reliability at the root. That Opus 4.8 now flags its uncertainty on its own is, in practical terms, more valuable than a few benchmark points. Personally, I think this single point is the most praiseworthy thing about this update.
9. Caveats and regressions (told honestly)
We've looked at the gains. But since this is an article praising "honesty," I'll be honest too — here, undisguised, are the points that regressed or warrant caution in 4.8.
| Caveat | Detail | How to handle it |
|---|---|---|
| Lower prompt-injection robustness | In Gray Swan red-teaming, attack success rose from 6.0% (4.7) to 9.6% (4.8) | For agents that handle external input, harden input sanitization and privilege separation. Revisit your permission design |
| Slight GPQA Diamond dip | 94.2% → 93.6% (−0.6). On pure science-knowledge quizzes, 4.7 is marginally ahead | Within margin of error. A/B test on your real tasks if it matters |
| Not the leader on multilingual | Multilingual tasks trail Gemini 3.1 Pro / GPT-5.5 | If multilingual is your battleground, consider pairing with / comparing other models |
| Dynamic workflows is a research preview | Behavior may change. Fully depending on it for critical production work is premature | Validate on non-critical work before adopting |
The drop in prompt-injection robustness in particular cannot be overlooked. Attack success rising about 1.6x means that for agents that read external input (web pages, email, user posts) and act autonomously, simply moving to 4.8 can make them relatively weaker on security in some scenarios. Getting smarter does not mean beating 4.7 on every axis of safety — understand this asymmetry correctly.
10. Who should upgrade right now
So, should you switch to claude-opus-4-8 right now? Let's break it down by type.
✅ Upgrade now
- Coding / agent operation is your main use
- You want to delegate long autonomous tasks
- You use fast mode heavily (now 3x cheaper)
- You work with huge codebases / long contexts
- "Overconfident misreporting" would be fatal in your setting
⚠ Consider carefully
- Public agents handling external input (lower injection robustness)
- Multilingual processing is your battleground (others may lead)
- Pure scientific QA is central (slight GPQA dip)
- Putting dynamic workflows straight into critical production
Since the switching cost itself is near-zero (just change the model ID; standard pricing held flat), the royal road is to first switch to claude-opus-4-8 in a non-critical environment and measure on your own tasks. The concrete migration steps from 4.7 carry over directly from the thinking in the Opus 4.7 migration guide. For comparison with GPT-5.5 and others, see GPT-5.5 vs Claude Opus comparison.
Summary
Claude Opus 4.8 (released May 28, 2026, claude-opus-4-8) is a flagship that strengthened the substance while holding price and context flat. Coding improved steadily (SWE-bench Pro +4.9); math (USAMO 96.7%) and long-context tracking (GraphWalks 68.1%) improved dramatically. Fast mode became ~2.5x faster and effectively one-third the price, and the practical features — the effort parameter, dynamic workflows, and system entries in the Messages API — all arrived together.
But the essence is not the numbers. A 0% rate of passing flaws uncritically, overconfidence down more than 10x — this release, leading with "honesty" over "smarts," points in the right direction for an era of long-running autonomous AI. At the same time, prompt-injection robustness actually regressed; it does not beat the old model on every axis. Which is why — fittingly, in the spirit of this very model's virtue — the smartest way to engage is to not be overconfident, and to measure on your own tasks before deciding.
Related reading: Claude Opus 4.7 release breakdown, Opus 4.7 migration guide, Opus / Sonnet / Haiku pricing comparison, GPT-5.5 vs Claude Opus comparison, and What is the Claude Agent SDK.
FAQ
Q. Is migrating from Opus 4.7 to 4.8 hard?
A. It takes almost nothing. Just change the API model ID to claude-opus-4-8; standard pricing and the context window (1M tokens) are held flat. The default effort=HIGH uses roughly the same token count as 4.7's default with only performance going up, so you benefit with no config changes. Just watch the injection-robustness drop (below) for agents that handle external input.
Q. What does "3x cheaper" fast mode mean?
A. It means fast mode's price ($10 input / $50 output per 1M tokens) is effectively one-third that of the previous model's fast mode. Speed is about 2.5x standard. The "I want speed but fast mode is pricey" dilemma is greatly eased, making it easier to use for chat UIs and bulk batch processing.
Q. Can anyone use dynamic workflows?
A. It is in research preview, usable from Claude Code (CLI, Desktop, VS Code extension). Availability is on Max, Team, and Enterprise plans (admin-enabled) and via the API, Bedrock, Vertex, and Foundry. For safety, the first trigger requires explicit confirmation. Behavior may change, so it's safest to try it on non-critical work first.
Q. Is 4.8 better than 4.7 in every respect?
A. No. GPQA Diamond slipped slightly (94.2% → 93.6%), multilingual tasks trail Gemini 3.1 Pro / GPT-5.5, and prompt-injection robustness actually worsened (attack success 6.0% → 9.6%). It is clearly ahead on coding, math, long context, and honesty, but for some uses 4.7 or other models may fit better.
Q. What's the concrete benefit of higher "honesty"?
A. When running AI agents autonomously, the biggest risk is "misreporting a failure as success and stacking work on top of it." Because 4.8 dropped uncritical flawed-result reporting to 0% and cut overconfidence by more than 10x, it stops "pretending it works" and says it's uncertain when it is. For long-running automation, CI, and code review, reliability improves at a practical level.