On May 28, 2026, Anthropic released Claude Opus 4.8 — barely two months after Opus 4.7. The upgrade cadence is clearly accelerating. But the headline this time is not a few percentage points on a benchmark. The first thing Anthropic itself highlighted was "sharper judgement, more honesty about its progress, and the ability to work independently for longer than its predecessors." A release that leads with "it became more honest" before "it got smarter" is unusual.

Here is the bottom line: coding is solidly improved (SWE-bench Pro 64.3% → 69.2%), math jumps dramatically (USAMO 2026 from 69.3% to 96.7%), and long-context tracking nearly doubles (GraphWalks at 1M tokens 40.3% → 68.1%). On top of that, fast mode is roughly 2.5x faster and effectively one-third the price, and three developer-facing features land at once: the effort parameter, dynamic workflows, and system entries in the Messages API. At the same time, not everything got better — prompt-injection robustness actually regressed. This article breaks down the numbers, the new features, and the caveats, based on Anthropic's official announcement and system card.

ANTHROPIC · 2026-05-28 RELEASE

Claude Opus 4.8 at a glance

— a flagship that leads with "honesty" over raw smarts

CODING
69.2%
SWE-bench Pro
(4.7 was 64.3%)
MATH
96.7%
USAMO 2026
(4.7 was 69.3%)
FAST MODE
3x cheaper
~2.5x faster
$10 / $50 per Mtok
HONESTY
10x
less overconfidence
vs Opus 4.7

Standard pricing is held flat with 4.7 ($5 / $25 per Mtok), context stays at 1M tokens.
Model ID is claude-opus-4-8, available day one on Claude API, Bedrock, Vertex AI, and Microsoft Foundry.

* Figures in this article are based on Anthropic's official announcement, model page, and system card, plus reporting from multiple tech outlets (as of May 28, 2026). They may be updated as more verification comes in.

1. Opus 4.8 in three lines

For the busy reader, the essentials first.

  • Performance: coding is steadily stronger; math (USAMO) and long-context tracking (GraphWalks) improve dramatically. On the other hand, GPQA Diamond slips slightly, and multilingual tasks trail Gemini 3.1 Pro / GPT-5.5.
  • Pricing: standard is held flat with 4.7. The biggest economic impact is that fast mode is ~2.5x faster and effectively one-third the price.
  • Philosophy: "more honest" before "smarter." It is the first Claude to score 0% on uncritically reporting flawed results, and overconfidence is down 10x versus 4.7. New dynamic workflows and the effort parameter support longer autonomous work.

2. Core specs and availability

Let's start with the immovable facts: Opus 4.8's specs and where you can use it.

ItemDetail
Release dateMay 28, 2026 (about 2 months after 4.7)
API model IDclaude-opus-4-8
Context window1,000,000 tokens (same as 4.7)
Max output128,000 tokens per response
Standard pricing$5 input / $25 output (per 1M tokens, same as 4.7)
Cost reductionsUp to 90% off with prompt caching, 50% off with batch processing
Fast mode pricing$10 input / $50 output (per 1M tokens, ~2.5x faster)
AvailabilityClaude API, Amazon Bedrock, Google Cloud Vertex AI, Microsoft Foundry (day one)

The key point is that price and context are held flat, and only the substance got stronger. If you are on 4.7, swapping the model ID to claude-opus-4-8 gets you the performance gains at no extra cost (migration caveats are in section 9). Just note that US-only inference carries a 1.1x pricing multiplier.

3. Benchmarks head-to-head (4.8 vs 4.7)

We saw the specs. So how much did the actual capability grow? Here are the main published benchmarks lined up against 4.7. Bold marks the biggest gains.

BenchmarkClaude Opus 4.8Claude Opus 4.7Delta
SWE-bench Verified (real code fixes)88.6%87.6%+1.0
SWE-bench Pro (hard coding)69.2%64.3%+4.9
SWE-bench Multilingual84.4%80.5%+3.9
USAMO 2026 (math olympiad)96.7%69.3%+27.4
GraphWalks (1M-token long context, F1)68.1%40.3%+27.8
GPQA Diamond (graduate-level science)93.6%94.2%−0.6
Online-Mind2Web (browser use)84%

A note on reading the table. The +4.9 points on SWE-bench Pro looks modest but matters: Pro collects more realistic, harder coding tasks, so a gain there translates directly into "fewer moments where you get stuck in real work." But what really stands out are the +27-point leaps on USAMO and GraphWalks.

BIGGEST JUMPS

What the two leaps mean

USAMO 2026 · 69.3% → 96.7%
Near-perfect on US Math Olympiad problems — evidence of a big gain in carrying multi-step rigorous proofs all the way through without breaking. It pays off in complex algorithm design and formal reasoning.
GraphWalks 1M · 40.3% → 68.1%
The ability to correctly trace "what was written where" across a 1M-token context nearly doubles. That raises the reliability of feeding it an entire huge codebase or a long spec.

On top of that, CursorBench exceeds every prior Opus across all effort levels,
the Super-Agent benchmark saw it become the only model to complete every case end-to-end, and the Legal Agent benchmark recorded the first score above 10% on the all-pass standard.

That said, not everything rose. GPQA Diamond slipped from 94.2% to 93.6%. You could call it within the margin of error, but the fact that 4.7 is marginally ahead on "pure science-knowledge quizzes" is worth keeping in mind. More in section 9.

4. Pricing and fast mode — 3x cheaper speed

We've dwelt on performance, but the thing that actually hits your wallet hardest this time is the fast mode price change. Standard pricing is completely held flat with 4.7, so let's line the two up.

Standard mode (held flat)

  • Input: $5 / 1M tokens
  • Output: $25 / 1M tokens
  • Prompt caching: up to 90% off
  • Batch processing: 50% off

→ Not a cent different from 4.7. Zero switching cost.

Fast mode (big change)

  • Input: $10 / 1M tokens
  • Output: $50 / 1M tokens
  • Speed: about 2.5x standard
  • One-third the price of previous fast mode

→ "Fast = expensive" no longer holds. Great for chat UIs and bulk processing.

This is bigger than it looks. The dilemma of "I want speed, but fast mode is pricey" hit exactly the use cases — chat-UI responses, bulk code review in CI/CD, many-step agent runs — where you can now have both speed and price. Combined with the flat standard pricing, the economic takeaway this time is "the same budget, but faster and smarter." For the full pricing picture, see Claude Opus / Sonnet / Haiku pricing comparison.

5. New feature #1: the effort parameter and adaptive thinking

After pricing, the features developers touch directly. First, the effort parameter. This is a knob that lets you explicitly specify "how deeply to think" across four levels.

EFFORT PARAMETER

Choose thinking depth in four levels

LOW · speed first
Fastest responses and lower rate-limit consumption. For simple classification, extraction, short replies.
HIGH · default (recommended)
Anthropic's recommended balance. Roughly the same token count as 4.7's default, but higher performance. When in doubt, use this.
XHIGH · hard / async tasks
Recommended for difficult tasks and async workflows — when you want it to think things over.
MAX · quality first
Maximizes token depth. For quality-over-cost critical work.

The crux: the default HIGH uses roughly the same token count as 4.7's default, with only the performance going up.
In other words, even with no setting at all, you get better results at the same cost.

The counterpart to effort is adaptive thinking: the model automatically adjusts the compute it uses according to task complexity. Quick on simple questions, deeper of its own accord on hard ones. You set the ceiling and the policy with effort, and adaptive thinking optimizes the actual allocation — a two-tier design that delivers "no wasted thinking tokens, deep only where it counts."

6. New feature #2: dynamic workflows (research preview)

The most ambitious feature this time is this one. Dynamic workflows is a research-preview feature usable in Claude Code (CLI, Desktop, VS Code extension), a mechanism for handing Claude a "big job" wholesale.

Concretely, Claude writes its own orchestration scripts and spawns tens to hundreds of parallel subagents to attack a problem concurrently. It even deploys adversarial verification agents to critically check the results, and iterates until convergence. It coordinates outside the main conversation thread, and its state is resumable, holding up across multi-day execution.

What it's good for

The intended use cases are codebase-wide bug hunts, large-scale migrations, security audits, and critical verification tasks — the kind of work that "would take a team of humans several days."

Availability: Max, Team, and Enterprise plans (admin-enabled), plus via the API, Bedrock, Vertex, and Foundry. For safety it requires explicit confirmation on first trigger. As a research preview, behavior may change.

In positioning, it is a step toward having the model itself design and run, on the spot, the "parallel orchestration of many agents" you previously had to build yourself with the Claude Agent SDK. For large refactors and cross-cutting investigations, the range it can drive without step-by-step human direction expands.

7. New feature #3: system entries in the Messages API

A subtle change, but a welcome one for developers: the Messages API now accepts system entries inside the messages array.

Previously, the system prompt (system instructions) was placed once at the start of the conversation. With this change, you can inject system instructions mid-conversation — and do so without breaking the prompt cache or requiring a user turn.

// Example: updating "permissions, budget, environment" mid-workflow
messages: [
  { role: "system",    content: "You are a CI agent. No destructive operations." },
  { role: "user",      content: "Update the dependencies" },
  { role: "assistant", content: "..." },
  // Update policy mid-run (without breaking the cache)
  { role: "system",    content: "Token budget is low. Use effort=low, key points only." },
  { role: "user",      content: "Continue" }
]

This pays off in long, multi-step agent runs. "Dynamically swapping policy" mid-execution — tightening permissions, signaling token budget, updating environment context (which branch you're on, etc.) — now works while preserving cache efficiency. It's a design that pairs well with long-haul autonomous runs like dynamic workflows.

8. The biggest leap is honesty — 10x less overconfidence

This is the part I most want to convey. Opus 4.8's true differentiator is not the benchmark numbers — it is "honesty about its own work." What Anthropic and testers stressed repeatedly is that this model proactively flags its own uncertainty and is less likely to make unsupported claims.

HONESTY METRICS

Honesty in numbers

0%
uncritical flawed-result reporting
Reporting a wrong result as "done." First Claude to score perfect.
3.7%
misses on important events
How often it fails to raise events it should report. Sharply lower.
10x+
drop in overconfidence
Unfounded overconfidence is more than 10x lower vs 4.7.

On top of that, the rate of letting flaws in its own code pass unremarked is about one-quarter of 4.7's.
It stopped "pretending it works" — and that is decisive for agent operation.

Why does this matter? The biggest risk in letting an AI agent run autonomously for a long time is "reporting a failure as a success, then stacking more work on top of that error." Saying "fixed" while tests are still failing; stating uncertain guesses in a confident tone — this kind of "overconfidence" undermines automation reliability at the root. That Opus 4.8 now flags its uncertainty on its own is, in practical terms, more valuable than a few benchmark points. Personally, I think this single point is the most praiseworthy thing about this update.

9. Caveats and regressions (told honestly)

We've looked at the gains. But since this is an article praising "honesty," I'll be honest too — here, undisguised, are the points that regressed or warrant caution in 4.8.

CaveatDetailHow to handle it
Lower prompt-injection robustnessIn Gray Swan red-teaming, attack success rose from 6.0% (4.7) to 9.6% (4.8)For agents that handle external input, harden input sanitization and privilege separation. Revisit your permission design
Slight GPQA Diamond dip94.2% → 93.6% (−0.6). On pure science-knowledge quizzes, 4.7 is marginally aheadWithin margin of error. A/B test on your real tasks if it matters
Not the leader on multilingualMultilingual tasks trail Gemini 3.1 Pro / GPT-5.5If multilingual is your battleground, consider pairing with / comparing other models
Dynamic workflows is a research previewBehavior may change. Fully depending on it for critical production work is prematureValidate on non-critical work before adopting

The drop in prompt-injection robustness in particular cannot be overlooked. Attack success rising about 1.6x means that for agents that read external input (web pages, email, user posts) and act autonomously, simply moving to 4.8 can make them relatively weaker on security in some scenarios. Getting smarter does not mean beating 4.7 on every axis of safety — understand this asymmetry correctly.

10. Who should upgrade right now

So, should you switch to claude-opus-4-8 right now? Let's break it down by type.

✅ Upgrade now

  • Coding / agent operation is your main use
  • You want to delegate long autonomous tasks
  • You use fast mode heavily (now 3x cheaper)
  • You work with huge codebases / long contexts
  • "Overconfident misreporting" would be fatal in your setting

⚠ Consider carefully

  • Public agents handling external input (lower injection robustness)
  • Multilingual processing is your battleground (others may lead)
  • Pure scientific QA is central (slight GPQA dip)
  • Putting dynamic workflows straight into critical production

Since the switching cost itself is near-zero (just change the model ID; standard pricing held flat), the royal road is to first switch to claude-opus-4-8 in a non-critical environment and measure on your own tasks. The concrete migration steps from 4.7 carry over directly from the thinking in the Opus 4.7 migration guide. For comparison with GPT-5.5 and others, see GPT-5.5 vs Claude Opus comparison.

Summary

Claude Opus 4.8 (released May 28, 2026, claude-opus-4-8) is a flagship that strengthened the substance while holding price and context flat. Coding improved steadily (SWE-bench Pro +4.9); math (USAMO 96.7%) and long-context tracking (GraphWalks 68.1%) improved dramatically. Fast mode became ~2.5x faster and effectively one-third the price, and the practical features — the effort parameter, dynamic workflows, and system entries in the Messages API — all arrived together.

But the essence is not the numbers. A 0% rate of passing flaws uncritically, overconfidence down more than 10x — this release, leading with "honesty" over "smarts," points in the right direction for an era of long-running autonomous AI. At the same time, prompt-injection robustness actually regressed; it does not beat the old model on every axis. Which is why — fittingly, in the spirit of this very model's virtue — the smartest way to engage is to not be overconfident, and to measure on your own tasks before deciding.

Related reading: Claude Opus 4.7 release breakdown, Opus 4.7 migration guide, Opus / Sonnet / Haiku pricing comparison, GPT-5.5 vs Claude Opus comparison, and What is the Claude Agent SDK.

FAQ

Q. Is migrating from Opus 4.7 to 4.8 hard?
A. It takes almost nothing. Just change the API model ID to claude-opus-4-8; standard pricing and the context window (1M tokens) are held flat. The default effort=HIGH uses roughly the same token count as 4.7's default with only performance going up, so you benefit with no config changes. Just watch the injection-robustness drop (below) for agents that handle external input.

Q. What does "3x cheaper" fast mode mean?
A. It means fast mode's price ($10 input / $50 output per 1M tokens) is effectively one-third that of the previous model's fast mode. Speed is about 2.5x standard. The "I want speed but fast mode is pricey" dilemma is greatly eased, making it easier to use for chat UIs and bulk batch processing.

Q. Can anyone use dynamic workflows?
A. It is in research preview, usable from Claude Code (CLI, Desktop, VS Code extension). Availability is on Max, Team, and Enterprise plans (admin-enabled) and via the API, Bedrock, Vertex, and Foundry. For safety, the first trigger requires explicit confirmation. Behavior may change, so it's safest to try it on non-critical work first.

Q. Is 4.8 better than 4.7 in every respect?
A. No. GPQA Diamond slipped slightly (94.2% → 93.6%), multilingual tasks trail Gemini 3.1 Pro / GPT-5.5, and prompt-injection robustness actually worsened (attack success 6.0% → 9.6%). It is clearly ahead on coding, math, long context, and honesty, but for some uses 4.7 or other models may fit better.

Q. What's the concrete benefit of higher "honesty"?
A. When running AI agents autonomously, the biggest risk is "misreporting a failure as success and stacking work on top of it." Because 4.8 dropped uncritical flawed-result reporting to 0% and cut overconfidence by more than 10x, it stops "pretending it works" and says it's uncertain when it is. For long-running automation, CI, and code review, reliability improves at a practical level.