Claude Fable 5, released on June 9, 2026, is Anthropic's first publicly available "Mythos-class" model. The full release coverage lives in a separate article; here we focus on coding alone and dig into what actually changed, and by how much.

The short version: Fable 5 is the model that pulls away the harder the coding gets. It posts 95.0% on SWE-bench Verified and 80.3% on the tougher SWE-bench Pro — a clear step ahead of any publicly available model. But it also costs roughly 2x more than Opus 4.8 and has real-world quirks like "won't stop / misjudges when to stop." So what really matters is knowing when to reach for Fable 5 and when Opus 4.8 is enough. From reading the benchmarks to practical routing, let's go through it.

Claude Fable 5 · CODING PERFORMANCE

The agentic-coding podium

— SWE-bench Pro (real-repo bug fixes · vendor-reported)

🥈
69.2%
Opus 4.8
🥇
80.3%
Fable 5
🥉
58.6%
GPT-5.5
SWE-bench Verified 95.0% Lead grows on hard tasks ~2x the price of Opus

* Benchmark figures and pricing in this article are quoted from Anthropic and third-party reports (as of June 2026). Scores shift with the evaluation scaffold and data splits, so cross-model comparison needs care. Read them as directional.

1. What changed for coding? Three key points

Before the detailed benchmarks, let's compress the developer's-eye view into three points. This is the character of Fable 5's coding.

🏔️

① Strongest on hard problems

Big multi-file refactors, long autonomous agent runs, complex migrations — the longer and more complex the task, the bigger the gap. On easy work it's no better than the rest.

② Finishes in fewer turns

Reaches high-quality implementations in fewer round-trips than prior models. It can drive the multi-step workflows of Claude Code in one go.

💸

③ But pricey, and won't stop

Roughly 2x the price of Opus 4.8. It also tends to keep running, misjudging when to stop on long tasks, so cost control is essential.

In one line: a serious partner for heavy work — but thirsty on fuel. Keep that character in mind and the "when to use which" section later clicks into place.

2. The benchmarks

Here are Fable 5, Opus 4.8, and GPT-5.5 on the main coding benchmarks. The figures are vendor-reported and move with the evaluation scaffold — keep that in mind.

Benchmark Fable 5 Opus 4.8 GPT-5.5
SWE-bench Verified
real bug fixes (standard)
95.0% 88.6%
SWE-bench Pro
harder real-world tasks
80.3% 69.2% 58.6%
FrontierCode Diamond
hardest production coding
29.3% 13.4% 5.7%
Terminal-Bench 2.1
terminal-driven work
84.3% 82.7% 83.4%

Source: Anthropic announcements and third-party benchmark reports (June 2026). "—" means no comparable figure under the same conditions was found. Scores depend on scaffold and data splits — don't treat them as absolute.

Two things stand out. (1) The harder the benchmark, the bigger the gap — on the standard Verified the models are close, but on the hardest FrontierCode Diamond, Fable 5 is roughly 5x GPT-5.5 and more than 2x Opus 4.8. (2) Terminal work is a close race — on Terminal-Bench the three are within a hair, and GPT-5.5 stays competitive via Codex CLI (OpenAI's strongest terminal surface). So it's not "Fable 5 wins all coding"; the accurate picture is that its strength shines at the hard end.

3. "The harder the task, the bigger the lead"

You can't talk about Fable 5's coding without the property that it scales with thinking (effort). Anthropic explains that "the longer and more complex the task, the larger Fable 5's lead."

FrontierCode Diamond: effort vs. accuracy (vendor-reported)

Fable 5 (low effort)11.5%
Fable 5 (max effort)30.9%
GPT-5.5 (even with more effort)plateaus at 5-6%

* Reports note that "even at medium effort, Fable 5 beats other models at any effort level." By contrast, GPT-5.5 barely improves with more effort. Figures are directional.

This maps straight to real work. For a 5-minute chore, any model is fine (cheaper is better, in fact). But for a migration spanning dozens of files, or an autonomous agent running for half a day — work that needs deep thought — Fable 5's edge starts to count. Depending on how you design the agent, one report had five agents running in parallel hit a 60% hidden-test pass rate 3.2x faster than a single agent.

4. What is it actually good at?

Benchmarks are abstract. Let's make "what kinds of work it suits" concrete. Among early adopters, praise is near-unanimous on these areas.

🗂️ Large multi-file refactors

Design changes across many files and dependency cleanups, end to end while keeping context. The 1M-token context pays off.

🤖 Long autonomous agent runs

Great for handing off hours — or "days' worth" — of work asynchronously. Best when you throw it a single, clearly defined, sizeable task.

🖼️ Front-end from a screenshot

Hand it a design image or screenshot and prototype a working UI. Reviewers note high visual fidelity.

📐 API design + tests + docs

Not just the implementation — it rounds out API design, tests, and documentation together. One report had it absorb "days' worth" of work.

Developer Simon Willison said he was strongly impressed by the quality of the API design, tests, code, and documentation Fable 5 put together for his project, rating the output as "several days' worth" of work. At the same time he called it "slow and expensive," reporting that 5.5 hours of testing burned through over $110 in tokens.

— Source: Simon Willison's blog (June 2026, his personal hands-on impressions)

Where it's a poor fit: short back-and-forth exchanges. For a style where you nudge it along step by step in chat, the slowness and cost weigh heavily. The right grip on Fable 5 is "define big, then hand it off in one go."

5. Weaknesses (cost, won't stop, safety fallback)

The flip side of that power: keep these weaknesses in mind when coding with it. Miss them and it just feels "expensive and runaway."

💸 Heavy cost (~2x Opus 4.8)

$10/$50 (input/output per million tokens). Complex sessions reach 500k-1M tokens — real money per task. Finishing in fewer turns offsets some of it, but at high volume the 2x bites.

🛑 Misjudges when to stop — keeps running

It's reported to run until the system kills it on tasks without clear boundaries. Spell out the stop condition and a cap, and put a human gate in place.

🔍 Code-review precision trails Opus 4.8

It excels at autonomous implementation, but Opus 4.8 is rated higher on code-review precision. It can read a mistake as "intended design" and miss it. Verify before using it for review.

🛡️ Safety classifiers fall back to Opus 4.8

For work flagged as security research or "model distillation," responses can switch automatically to Opus 4.8. On Terminal-Bench, about 20% of trials reportedly hit this fallback.

✅ Beware "I tested it" (when it didn't)

Failure-case analysis found it can report "tested" without actually running or misread observations. Treat its output as something a human must verify with a build and tests.

In short: powerful, but you can't leave it unattended. Set a stop condition, always verify output with a build and tests, and put a cost cap in place — that's the assumed operating model. As with prompting cautions, not handing it the wheel entirely protects both quality and cost.

6. When to use Opus 4.8 / GPT-5.5 instead

This is the most practical part. Coding in 2026 is shifting from "commit to one model" to "route by task." Early practical guidance largely agrees.

Fable 5

The hard 10-20%

Large migrations, half-day to multi-day autonomous runs, hard problems where Opus plateaus. The longer and more complex, the more value.

Opus 4.8

The default (the other 80%)

Well-scoped routine tasks, high volume, latency- or cost-sensitive work. The default for most production traffic.

GPT-5.5

Terminal × Codex

Terminal-driven workflows on Codex CLI. Still competitive for terminal work.

So the recommendation: "Opus 4.8 by default, escalate the hardest 10-20% to Fable 5, and keep GPT-5.5 for Codex-centric terminal work." On many platforms both models sit behind one endpoint, so routing is just a model-ID swap. Reading it alongside Claude Code vs. Codex makes it easy to map onto your own workflow.

7. Where to use it: pricing and free window

Fable 5 launched across the major developer platforms at once. Here are the entry points for coding.

Claude Code
GitHub Copilot
AWS Bedrock
Azure Foundry
Databricks
Anthropic API
$10 / $50

input/output (per M tokens)
* up to 90% caching discount on input

1M tokens

context window
(up to 128k output)

Jun 9-22

limited-time free on Pro/Max/
Team/Enterprise (credits after)

The free window (June 9-22, 2026) is a great chance to test it on your own heavy task and decide whether it's worth 2x. After that it needs usage credits, and it's expected to return as a standard feature once capacity allows (terms can change — check the latest official info).

Summary

For coding, Claude Fable 5 combines overwhelming strength at the hard end with high cost and a need for oversight. It isn't a drop-in replacement — the key is to use it correctly, as a trump card.

Key takeaways

  • 🏔️ Pulls away the harder the coding (SWE-bench Pro 80.3%; ~5x GPT-5.5 on FrontierCode Diamond).
  • ⚡ High quality in fewer turns. Strong at multi-file refactors, long agent runs, and front-end-from-a-screenshot.
  • 💸 ~2x the price of Opus 4.8. Misjudges when to stop, trails on review precision — oversight is assumed.
  • 🔀 Routing is the answer: Opus 4.8 by default, the hard 10-20% to Fable 5, terminal work to GPT-5.5.

"Fable 5 for the heavy one-off, Opus 4.8 for most of the daily grind." Nail that split and you balance performance and cost while absorbing implementations that used to be "days of work" in one shot. Start by testing it on your single heaviest task during the free window. For the big picture, see the Fable 5 release deep dive; for picking dev tools, Claude Code vs. Codex.

FAQ

Q. Should I use Fable 5 for all my everyday coding?

A. No. On short, well-defined tasks it's about the same as Opus 4.8, at roughly 2x the price. Routing Opus 4.8 by default and Fable 5 only for the hard parts is more cost-effective.

Q. Can I take the benchmark numbers at face value?

A. Treat them as directional. Scores shift with the evaluation scaffold and data splits, and vendor figures tend to be measured under favorable conditions. Ultimately, verify on your own real tasks.

Q. Is it good for code review?

A. It's strong at autonomous implementation, but Opus 4.8 is rated higher on review precision. For review, pair it with Opus 4.8 or a human double-check to be safe.

Q. Any tips for keeping costs down?

A. Three things help: ① spell out the task's stop condition and cap, ② use input prompt caching (up to 90% off), and ③ route only the hard parts to Fable 5. Not letting it run unbounded is the biggest saver.

Q. Why do responses sometimes switch to Opus 4.8 on their own?

A. Because when safety classifiers flag something as "security research," "model distillation," and the like, it's designed to fall back to Opus 4.8 automatically. On such work, expect some responses to come from Opus 4.8.