Table of Contents
- 1. The answer, up front
- 2. There are two questions — separate "environment" from "quality"
- 3. Claude Code vs Codex — the differences that matter for translation
- 4. Which tool fits translation tasks
- 5. Recommended models — choosing by translation quality
- 6. Choosing by language and use case
- 7. In practice: building a translation pipeline
- 8. Caveats (told honestly)
- Summary
- FAQ
"I want to translate my docs into 10 languages. Which is better, Claude Code or Codex?" This question hides a trap: many people conflate "which tool is better" with "which translates better." The fact is, neither Claude Code nor Codex is a "translation engine." Both are agentic CLI work environments; the thing that actually produces the translated text is the language model running underneath.
So the question splits in two. "In which environment is the work of translating most efficient (= tool choice)?" and "Which model do I trust with the quality of the output (= model choice)?" The answer up front: to translate many files in a repository in bulk while preserving structure, Claude Code fits better — thanks to direct local file access, a 1M-token long context, and strong multi-file consistent editing. Translation quality itself depends on the language pair. This article organizes both the tool and model sides thoroughly, based on official data and multiple sources.
The quick verdict for multilingual translation
— "which tool" and "which model" are separate questions
The shortest guide: if you need to translate files in your repo accurately, structure and all, use Claude Code.
Then, pick a model that is strong in your target language for the final quality.
* Tool specs here are from each vendor's official sources and several tech outlets (as of May 2026); multilingual performance is from Anthropic's official multilingual support material (MMLU-based scores relative to English). Model versions and figures may change, so always make the final call by testing your own language pairs.
1. The answer, up front
For the busy reader, just the essentials.
- As a work environment, Claude Code fits translation better. Why: (1) it reads and writes many local files directly; (2) its 1M-token context can hold "article body + glossary + existing translations" all at once; (3) it is strong at consistent editing of terms and tone across many files.
- Codex fits "async, cloud, hands-off batches." It shines for runs that execute safely in a sandbox and open PRs automatically, or for embedding the open-source CLI into your own pipeline. But its context window is relatively smaller.
- Translation quality is decided by the "model," not the "tool." Long-document tone consistency leans Claude; natural European/East Asian languages and idioms lean GPT; breadth across low-resource languages and dialects leans Gemini — a pattern multiple sources agree on. The best choice changes per language pair.
2. There are two questions — separate "environment" from "quality"
Let's restate the key point from the intro, one notch more carefully. Claude Code and Codex are agentic CLI (command-line) work environments. They read files, edit them, run tests, and open PRs — essentially "workers that move their hands autonomously." Meanwhile, the "language ability" of that worker is supplied by the model underneath (Claude Opus/Sonnet, GPT-5.5, Gemini 3.1 Pro, etc.).
In other words, "is it good at translating?" is basically a model question, while "can it run the work of translating efficiently, accurately, and at scale?" is a tool question. So if you mix the two axes and ask "which is stronger at translation?" as one lump, you lose the answer. This article covers the tool in sections 3-4, the model in sections 5-6, and lands them in practice in section 7.
3. Claude Code vs Codex — the differences that matter for translation
First, the tool axis. The two are alike as "agentic CLI coders," and their general coding performance is roughly at parity as of May 2026. But narrowed to the differences that matter for translation work, their characters split clearly.
| Aspect | Claude Code | Codex |
|---|---|---|
| Where it runs | Real-time collaboration on your local machine | Async execution in a cloud sandbox |
| File access | Reads/writes all local files directly | Sandbox-based; file/PC operations are relatively limited |
| Context window (approx.) | Up to ~1M tokens (Opus line) | Up to ~400K tokens |
| Multi-file consistent editing | Strong (easy to align terms/tone across files) | Possible, but mass simultaneous edits feel the context limit |
| Parallel execution | Easy to spawn parallel subagents | Strong at async tasks and hands-off runs |
| Nature of the CLI | Anthropic-provided (deep IDE integration) | Open source (Apache-2.0), easy to embed in your own pipeline |
| Price range | Individuals $20-$200/month (similar) | Individuals $20-$200/month (similar) |
Recall the reality of translation work. What you translate is not just "raw prose." There are HTML/Markdown tags, code blocks, glossaries, existing translations, file-naming conventions — and you must process them across dozens of files, consistently, without breaking anything. This is where (1) direct access to all local files, (2) a large context window, and (3) reliable multi-file consistent editing pay off. Even in general comparisons, Claude Code is rated highly for "quality on hard multi-file refactors," while Codex is valued for "async PR automation, per-task cost, and sandbox safety." For a full overall comparison, see Claude Code vs Codex: a thorough comparison.
4. Which tool fits translation tasks
Mapping the differences above onto "three typical translation scenarios" makes the fit clear.
The fitting tool, by scenario
When unsure: if the main goal is "translate the files on hand consistently, without breaking structure," use Claude Code.
If you want it "run automatically as a CI / overnight batch," Codex's async operation hits the spot.
To add: for translating large multilingual sites or documentation (dozens to hundreds of files, where term unification is mandatory), Claude Code — which can edit local files directly and has a large context window — is easier to handle. Its strength is the "senior partner" feel when you want to guarantee quality while checking as you go. On the other hand, if you want to embed translation into a fully automated scheduled job, Codex — easy to pipeline as an open-source CLI and able to run async, hands-off — comes into play.
5. Recommended models — choosing by translation quality
Now the model axis. Since output quality is decided by the model, not the tool, this is the heart of it. An important premise: "high coding benchmark" does not mean "good at translation." Translation tests a different ability — tone, idioms, cultural context, coverage of low-resource languages.
Let's start with the most reliable primary data. Anthropic officially publishes per-language performance relative to English (relative scores on MMLU translated into each language by professional translators). Here is an excerpt for the languages this site handles (figures are for the Claude Opus line with extended thinking; English = 100%).
| Language | Score vs English (Claude) | Tier |
|---|---|---|
| Spanish | 98.1% | Top tier |
| French | 97.9% | Top tier |
| Portuguese (Brazil) | 97.8% | Top tier |
| German | 97.7% | Top tier |
| Arabic | 97.1% | High |
| Chinese (Simplified) | 97.1% | High |
| Japanese | 96.9% | High |
| Hindi | 96.8% | High |
What we can read off: Claude holds a very high 96-98% relative-to-English level across major languages. It is especially well-regarded for languages where tone and register consistency matter, such as German, Japanese, and Korean — a view sources broadly agree on (note: this score is an MMLU-reasoning proxy, not pure translation quality per se). Meanwhile, each model has its own colors of strength and weakness. Here are the tendencies repeated across multiple sources.
The colors of each model in translation
These are "tendencies" repeatedly reported across multiple outlets, not a fixed ranking.
Model versions update frequently, so always make the final call by testing your own language pairs.
The key thing is that with either Claude Code or Codex, you can choose and switch the model you call. So a realistic combination is "tool = Claude Code, but also run quality checks through a different model." In the Opus 4.8 generation, "honesty" improved substantially, making the model more likely to flag uncertain passages itself — which helps the efficiency of translation review, too.
6. Choosing by language and use case
Let's turn the tendencies above into practical decisions.
| Situation | Lean toward | Why |
|---|---|---|
| Long documents in a unified tone | Claude (Opus/Sonnet) | Whole text at once in a large context; consistent register and terms |
| Naturalness in major European/East Asian languages | GPT-5.5 line / Claude | Smooth idioms and turns of phrase |
| Breadth into low-resource languages / dialects | Gemini 3.1 Pro | Wide language coverage |
| Large-volume, low-cost batch translation | Gemini Flash / each vendor's light, fast models | Balance of speed and cost |
| Specialized docs (legal, medical, etc.) | Top model + mandatory human review | Domains where mistranslation is unacceptable |
The realistic best practice is "division of labor," not "one model for everything." For example, generate a rough draft fast and cheap with a lightweight model, then polish only the languages that need quality with a top model. Or combine a main translation with a cross-check by a different model. Agentic environments like Claude Code / Codex are well-suited to automatically running this kind of multi-model pipeline.
7. In practice: building a translation pipeline
Once you've decided on the tool and model, build a "template" that stabilizes quality. Here are practical points for running multilingual translation with an agentic CLI.
5 iron rules of agentic translation
- Fix one source language — English (or Japanese) — as the single basis. Translating all languages from one base keeps quality aligned.
- Hand over a glossary. Dictionary-ize the translations of brand names, proper nouns, and UI strings, and unify them across all languages.
- State explicitly "preserve structure, tags, and code; translate only prose." Don't let it touch HTML attribute values or code.
- Run languages in parallel. Running 8 languages at once is fast (watch API rate limits).
- Run a mechanical quality check at the end. Auto-detect leftover untranslated text, swapped punctuation, character-count overflows, etc.
Once this template clicks, the flow of "draft → automated lint → human checks only the key spots" can be dramatically faster while holding quality. Grasping prompt design and how agents work raises pipeline precision further. And when translating text pulled in from outside, don't forget permission design and prompt-injection countermeasures.
8. Caveats (told honestly)
Finally, honestly listed caveats so you don't misjudge.
- Benchmark ≠ real translation quality. The relative-to-English scores here are an MMLU-reasoning proxy and don't fully match the naturalness/accuracy of the output. Always test on your own language pair and genre.
- Model versions change frequently. "X is the best" goes stale in a few months. An operating model of "division of labor + real testing" outlives a fixed conclusion.
- Specialized, legal, and medical translation require human review. Where the cost of a mistranslation is high, keep AI to the draft and let humans bear final responsibility.
- Design cost around "quality × volume." Translating everything with a top model is expensive. Draft with a cheap model, polish only the key parts with a top model — that's economical.
- Codex's sandbox constraints. For directly editing many local files, a cloud sandbox can become a limitation in some cases.
Summary
The answer to "which fits multilingual translation, Claude Code or Codex?" starts with splitting the question in two. As a work environment, to translate many files in a repo consistently while preserving structure, Claude Code fits (direct local edits, 1M context, multi-file consistency). For async, cloud, hands-off batches / PR automation, Codex hits the spot.
And translation quality is decided by the model, not the tool. Given the tendencies — Claude for long-document tone consistency, the GPT line for naturalness in major languages, the Gemini line for breadth across low-resource languages and dialects — the realistic 2026 answer is to pick the best per language pair and divide labor between drafting and finishing. One last emphasis: rather than hunting for a fixed "best model," test on your own tasks and keep a pipeline that mixes multiple models — that is the smartest way to not be jerked around by each new model generation.
Related reading: Claude Code vs Codex: a thorough comparison, Claude Opus 4.8 deep-dive, GPT-5.5 vs Claude Opus comparison, ChatGPT / Claude / Gemini free-tier comparison, and What is the Claude Agent SDK.
FAQ
Q. So which model translates best?
A. "It depends on the language pair and use case" is the honest answer. Long-document tone consistency leans Claude; natural output and idioms in major languages lean the GPT line; breadth across low-resource languages and dialects leans the Gemini line. There is no fixed "best," and versions update fast, so testing in your target language is the sure path.
Q. Does translation quality differ between Claude Code and Codex?
A. The tools themselves do not produce the translation. Quality is decided by the model running underneath. Since you can choose the model in either tool, think of it as "quality = model choice, efficiency = tool choice." Where they differ is in the speed, accuracy, and ease of large-scale processing of the work.
Q. For translating a multilingual site of dozens of files?
A. Claude Code is easier to handle. It reads and writes all local files directly, can reference body text, glossary, and existing translations together in a 1M-token context, and is strong at unifying terms and tone across many files. Running languages in parallel makes large-volume translation feasible in realistic time.
Q. Any tips for keeping costs down?
A. Division of labor. Translating everything with a top model gets expensive. Draft fast and cheap with a lightweight model (e.g. Gemini Flash), then polish only the languages/spots that need quality with a top model. If prompt caching or batch processing is available, use them to cut large-volume translation costs significantly.
Q. Is AI translation OK for specialized docs (contracts, medical)?
A. Keep it to the draft, and have a domain expert do the final check. In domains where the cost of a mistranslation is high, solo operation is risky with any top model. Speed things up with AI, but let humans bear the responsible final check — that line is the safe one.