Table of Contents
- 1. Five 1M-Token Models in One Year — But Only One Actually Reads the Whole Thing
- 2. What Is Context? — Separate the Container from Its Contents
- 3. Major Models in May 2026 — Container Sizes
- 4. Three Reasons "Bigger Is Better" Doesn't Hold
- 5. The Cost Trap — OpenAI Doubles Above 272K, Anthropic Stays Flat
- 6. Five Saving Tactics — Ranked by Real Impact for Solo Devs
- Summary
- FAQ
In 2023, a 32K-token context window felt "spacious." By May 2026, 1 million tokens (1M) has become the industry default. Claude Opus 4.7, Claude Sonnet 4.6, GPT-5.5, Gemini 3.1 Pro, DeepSeek V4-Pro — all major frontier models support 1M. Gemini 3.1 Ultra has reached 2M.
"One million tokens" translates to roughly 8–10 paperback books in English, or tens of thousands of lines of source code. We can now keep that much "in view" within a single session. But here's the catch: only one of these models actually uses that container all the way to the end. Independent benchmarks (multi-needle NIAH, detailed below) show that only Gemini 3 Deep Think mode holds accuracy across the full 1M. The others start losing precision somewhere between 200K and 400K — that's the honest field reality of 2026.
Let me get my take out front: the era of choosing a model purely on container size is over. What matters now is the trio of "effective context × cost × strategy," and Anthropic's move to flat-rate 1M pricing is the most interesting wrinkle of the year. This article walks through what context actually is, the May 2026 model lineup, why bigger isn't enough on its own, the cost-structure differences, and five practical context-saving tactics solo developers and small teams can apply today — backed by independent benchmark numbers.
The Container Grew 250x in Three Years
— A timeline of how 1M went from luxury to baseline
But "supports" and "actually reads to the end" are different things. Only Gemini 3 Deep Think holds accuracy across the full 1M in multi-needle NIAH benchmarks;
the others start degrading at 200K–400K (Digital Applied, Zylos 2026).
1. Five 1M-Token Models in One Year — But Only One Actually Reads the Whole Thing
When OpenAI announced GPT-5.5 in April 2026, the web cheered: "OpenAI finally hits 1M." That same month, Google released Gemini 3.1 Ultra with 2M. Anthropic had introduced flat-rate 1M pricing on Claude Opus 4.6 the year before and reinforced it with 4.7. DeepSeek's V4-Pro is also 1M. Five frontier vendors can now legitimately write "1M+ tokens" on the spec sheet.
This should have been a major event. Just three years ago, 32K felt impressive. We've seen a 30x+ jump in window size since then. The container-size race looked won.
Then independent evaluators Digital Applied and Zylos Research ran a multi-needle Needle-in-a-Haystack (NIAH) test in 2026 — embedding multiple facts in long documents and asking models to retrieve all of them correctly. Here's what they found:
- Gemini 3 Deep Think: holds advertised accuracy across the full 1M
- Claude Opus 4.7 / GPT-5.5 / DeepSeek V4-Pro: precision drops from around 200K–400K
So even though "1M support" is universal, only one model actually uses that 1M to the end under production-equivalent conditions. With other frontier models, asking them to integrate multiple facts starts showing strain at 200K–400K. This is the 2026 reality.
Don't read this as "Claude or GPT is bad." Use cases that genuinely need the full 1M are rare. If you can read 300K (≈ 2–3 paperbacks) reliably, almost every coding, research, or summarization task gets done. The trap is choosing a model on the "1M support" headline alone — that misleads the decision.
2. What Is Context? — Separate the Container from Its Contents
Quick terminology. Three words get mixed up in this space.
Token, Window, Context
In short: "window = container size," "context = contents," "token = unit."
A big container with messy contents still gives you messy answers.
Also: don't confuse "context" with "memory." Context lives inside the session — close the chat and it's gone. Features like ChatGPT Memory or Claude Memory, on the other hand, are a separate cross-session retention mechanism. Memory contents do eventually get injected into the context window, but from the user's perspective it's persistent storage vs. ephemeral workspace.
3. Major Models in May 2026 — Container Sizes
With definitions clear, here's the container sizes the major vendors publish today. All numbers from official specifications as of May 2026.
| Model | Input Limit | Output Limit | Notes |
|---|---|---|---|
| Claude Opus 4.7 | 1,000,000 | 128,000 | Flat 1M at standard pricing, no beta header needed |
| Claude Sonnet 4.6 | 1,000,000 | 64,000 | Same flat pricing |
| Claude Haiku 4.5 | 200,000 | 64,000 | Lightweight model, no 1M tier |
| GPT-5.5 | 922,000 | 128,000 | API total ~1M; input price 2x above 272K |
| GPT-5.4 | 1,000,000 | 128,000 | Same long-context surcharge |
| Gemini 3.1 Pro | 1,000,000 | 65,535 | Available via Vertex AI / AI Studio |
| Gemini 3.1 Ultra | 2,000,000 | 65,535 | 2M tier — currently the only commercial 2M model |
| Grok 4 | 256,000 | 32,000 | xAI official spec; conservative among frontier |
| DeepSeek V4-Pro | 1,000,000 | 96,000 | Largest in the open-weight tier |
Read just the table and you'd conclude "Gemini Ultra wins, end of story." But there's one fact worth bolding: Anthropic offers 1M as a flat rate on Opus 4.6/4.7 and Sonnet 4.6, while OpenAI doubles the input price on GPT-5.5 above 272K tokens. That's not just a pricing knob — it's a strategic stance on how long-context workloads should be handled. We'll dig into the cost math in a later section.
Personally, I keep Claude Opus 4.7 as my workhorse for long-form work. Three reasons: flat pricing, stable accuracy through the 200K band, and Anthropic's documentation quality. For documents that genuinely exceed 300K, I switch to Gemini 3 Deep Think. Mixing models by use case is the right move in 2026.
4. Three Reasons "Bigger Is Better" Doesn't Hold
The previous table just listed physical container sizes. The harder question is whether models actually use what they advertise. The short answer: outside Gemini 3 Deep Think, it's grim. Three reasons.
Reason ①: Lost in the Middle
First documented by Stanford in 2023 and reproduced in every model generation since. AI weights the start and end of input strongly while downplaying the middle (the 30–70% positional zone). Information placed near the center of a 100K context gets retrieved at 5–15 percentage points lower accuracy than the same information at the start or end.
The everyday symptom: "paste a long PDF, ask 'what's the figure for X?', and the model misreports the number that lives right in the middle." That's Lost in the Middle. Three years after Stanford's original paper, even frontier models haven't fully closed the gap.
Reason ②: Context Rot
The longer a conversation runs, the more your initial instructions fade. You said "answer in formal English" at the start; twenty turns later, the model has drifted back to casual phrasing — that's Context Rot.
Two causes. ① Early instructions sit relatively old and are weighted more lightly in the history. ② With long history, attention disperses and specific tokens become harder to reference. Anthropic in 2026 has begun framing this as "context engineering" — a deliberate skill to manage these effects.
Reason ③: Advertised Context ≠ Effective Context
Here's what 2026's latest benchmarks (multi-needle NIAH, production-equivalent conditions) actually look like.
Effective Context (Multi-Fact Integration)
Sources: Digital Applied "Long-Context Retrieval 2026" / Zylos Research "LLM Context Window Management 2026."
On single-needle NIAH (one fact to retrieve) every model passes 1M, but multi-fact integration tells a different story.
To repeat: this is not "Claude Opus 4.7 is broken." 200K–400K still equals 2–3 paperback novels of capacity. Most real-world tasks (code review, long-form writing, meeting summaries, research synthesis) finish well within that band. The problem is the assumption that "since it's 1M, just dump 1M in" — that strategy only works on Gemini Deep Think.
5. The Cost Trap — OpenAI Doubles Above 272K, Anthropic Stays Flat
We just established "effective is 200K–400K." Stack on that the second trap: long-context inputs make the bill jump. Anthropic and OpenAI have taken opposite strategies here.
| Model | Standard Input Price | Long-Context Surcharge |
|---|---|---|
| Claude Opus 4.7 | $5.00 / 1M tokens | Flat across 1M, no surcharge |
| Claude Sonnet 4.6 | $3.00 / 1M tokens | Same — no surcharge |
| GPT-5.5 | $5.00 / 1M tokens | Above 272K: 2x input, 1.5x output |
| GPT-5.4 | Comparable | Same long-context surcharge |
Concrete math. 500K-token input + 50K-token output, one round-trip — the canonical case of summarizing a large codebase or annual report in a single pass.
- Claude Opus 4.7: $5.00 × 0.5 + $25.00 × 0.05 = $3.75
- GPT-5.5 (with the 272K-overage surcharge): $10.00 × 0.5 + $45.00 × 0.05 = $7.25
That's $3.50 per call. Run it 100 times a day and you're $10,500 apart per month. For teams running long-running agents, the gap easily reaches mid-five-figures monthly. Same structural pattern we covered in AI token and session cost-saving.
6. Five Saving Tactics — Ranked by Real Impact for Solo Devs
"The container is 1M but effective is ~300K, and using it long gets expensive." We've covered that. So what can you actually do in the field? Here are five tactics I use day-to-day, ranked by what gives the biggest payoff.
Context Saving — Priority Order
/compact or start a new session.
Of the five, tactic ① "Cut the Session" gives the biggest visible gain. Just cutting the chat noticeably reduces hallucinations.
Tactic ④ is for API developers — UIs (claude.ai / ChatGPT) handle caching automatically.
My personal best practice: just doing ① and ② consistently shifts perceived accuracy noticeably. Even with Claude Code, instead of pushing one long session, hitting /compact or starting a fresh session at every topic change keeps final output quality stable.
Summary
Recap:
- Context window = the maximum tokens an AI can handle in one exchange. The container size.
- As of May 2026, Claude Opus 4.7 / Sonnet 4.6 / GPT-5.5 / Gemini 3.1 Pro / DeepSeek V4-Pro all support 1M; Gemini 3.1 Ultra hits 2M.
- Independent benchmarks (multi-needle NIAH) show only Gemini 3 Deep Think holds accuracy across the full 1M; the others start fading at 200K–400K.
- On cost, Anthropic stays flat while OpenAI applies a surcharge above 272K. Clear strategic divergence.
- The five tactics — cut the session, send excerpts, restate at the end, cache, address explicitly — and tactics ① and ② carry the most weight.
Even with bigger containers, the actual work is still deciding what to send and what to leave out. The 2026 AI skill isn't "stuffing everything in." It's the judgment to send only what's needed, accurately — that's what stays useful long-term. After watching five vendors crown themselves "1M" this year, that's my conclusion.
FAQ
OpenAI offers the tiktoken library; Anthropic exposes a countTokens()-equivalent API in the official SDK. Rule of thumb: ~0.75 English words per token, ~1–1.5 tokens per CJK character. Code varies by tokenizer, so measure before sending long inputs.
Context lives only inside the session — close the chat and it's gone. Memory (ChatGPT Memory / Claude Memory) is a separate cross-session retention mechanism. Memory contents end up injected into the context window, but from the user's perspective it's persistent vs. ephemeral.
RAG is the pattern of "dynamically fetch only the necessary information into context." Even with a 1M window, dumping everything makes it slow, heavy, and expensive, so retrieval-then-load (RAG) remains the mainstream approach. See What Is RAG for more.
Mismatch between training-time and inference-time sequence lengths, attention mechanism's positional-encoding limits, and the compute explosion needed to integrate multiple facts all stack up. "Supported" and "accuracy maintained across the full range" are different problems.
Yes. MCP is a fetch-on-demand mechanism via tools, so you don't need to load everything into context up front. Switch the mental model from "paste the whole file" to "let it go read the file."