Contents
You built it on OpenAI's API. Then you want to try Claude too, and compare Gemini. But every provider has a different SDK, request shape, and error behavior. Each switch means rewriting code, transforming responses, and maintaining separate retry logic per vendor — before long, "vendor-specific plumbing" has soaked into every corner of your app. And while you're pinned to one provider, the moment that company has an outage, raises prices, or shuts a model down, your app goes down with it.
The thing that takes over all that plumbing is an LLM gateway (AI gateway), also called an LLM proxy. It's a relay that sits between your app and the providers, exposing one API (usually OpenAI-compatible) to reach every model, and handling the cross-cutting chores — fallback, cost tracking, caching, rate limiting. This guide covers what a gateway does for you, the difference between the self-hosted, hosted, and SDK types, how to choose among LiteLLM, OpenRouter, and the Vercel AI SDK, and the limits you need to know so you don't get burned.
The 30-second answer
If you only read one box
Note: a gateway is no free lunch. It costs you a hop of latency, fees, and some feature loss (§8).
1. Why you need an LLM gateway
If you only call a single provider through a single SDK, you don't need a gateway. You need one the moment you want to use more than one model. Look at the three classic pains.
Each provider has different SDKs, parameter names, response structures, and error codes. Every switch means rewriting your app.
Depend fully on one company and its outage or price change becomes your downtime. You want an escape hatch (fallback).
The best model differs by task. You want to use a cheap model to draft and a smart one to polish — but the plumbing gets in the way.
What they share is a structure where the SDK's constraints dictate an essentially strategic choice — which model to use. A gateway carves that plumbing out of your app. Your app only needs to know one endpoint; who to call behind it, who to fail over to, and how much you've spent are the gateway's job. Because building an AI agent or an agent framework almost always assumes multiple models, the demand only grows.
2. What an LLM gateway is
An LLM gateway is a proxy that sits between your app and one or more LLM providers. Most expose a single API shaped like OpenAI's chat-completions endpoint and consolidate in one place the cross-cutting work that would otherwise be scattered through your code — routing, retries and fallback, caching, rate limiting, cost tracking, and access control.
(OpenAI-compatible)
cost / cache / control
Google / local…
The point is making the window a single one. Your app code just passes a string to model. Write anthropic/claude-opus-4.8 and you get Claude; write openai/gpt-5.5 and you get GPT — nothing else in the app changes. Decisions like "fail over to another model when this one is down" or "return this identical question from cache" are all settled on the gateway side. Mixing in a local LLM so that "sensitive data stays local, everything else goes to the cloud" is written the same way.
3. What it handles for you
The cross-cutting work a gateway takes on falls into roughly these six buckets. Tools differ in what they're good at, but the direction is shared.
Call every provider in one format (usually OpenAI-compatible). Erasing vendor differences from the app is the key feature.
When the primary model errors, overloads, or times out, automatically switch to another. The heart of business continuity.
See spend per user, team, or project. Hand out scoped virtual keys that hide the real ones.
Remember and instantly return identical or similar requests. Cuts both API bills and latency.
Per-key token and request limits, plus load balancing across multiple keys and instances.
Measure logs, latency, and success rate across all requests. Some tools also let you insert input/output guardrails.
💡 "Fallback" doesn't equal "safe." The model you fail over to has different output quirks, token counts, and supported features. Fallback doesn't become safe the instant you configure it — it works only once you've actually fired it and tested it. Always verify beforehand that your prompt doesn't break after the switch.
4. Three types: self-hosted, hosted, SDK
"LLM gateway" gets used as one label, but where it runs splits it into three fairly distinct characters. Mistake this and you'll pick wrong.
| Type | Where it runs | Examples | Who it suits |
|---|---|---|---|
| ① Self-hosted proxy | Your servers (separate process) | LiteLLM / Portkey (OSS) | Keep data in-house and governed |
| ② Hosted (SaaS) | The provider's cloud | OpenRouter / Cloudflare | Use it instantly, zero ops |
| ③ SDK / library | Inside your app code | Vercel AI SDK | Abstract quickly in TS/JS |
① Self-hosted is an independent process (a proxy server) you stand up on your own infrastructure. Because prompts don't pass through an external SaaS, it's strong on governance and audit — but you run it yourself. ② Hosted has the provider run the proxy, so it's the fastest to adopt, but requests pass through a third party. ③ SDK stands up no separate process; it absorbs provider differences inside your app code — not a network relay but an "abstraction layer," and it can be combined with ① or ②.
5. Comparing the main tools
Here are the three headliners in recommended order, plus two more worth knowing. Figures are based on each vendor's official pages as of July 2026 (offerings change, so always confirm the latest against the primary source).
LiteLLM — the standard self-hosted proxy
LiteLLM (by BerriAI) is an open-source Python library and self-hosted gateway. It lets you call 100+ providers and 2,500+ models through a single OpenAI-compatible API (per the official repo). Stand it up as a proxy and you get cost tracking, virtual keys, rate limiting, fallback, load balancing, Redis caching, and observability (Langfuse/Prometheus/Datadog integrations). It's the first pick for organizations that want to keep prompts in-house.
OpenRouter — multi-provider with one key, instantly
OpenRouter is a hosted gateway with no ops. With a single OpenAI-compatible API and one API key, it gives access to 400+ models per the official site. Its pricing design stands out: the official site states "we do not mark up inference tokens (catalog prices equal each provider's published prices)", while charging a 5.5% platform fee on credit purchases (per openrouter.ai/pricing). It's overwhelmingly fast for "just get it running" and "try every vendor with one key."
Vercel AI SDK — abstract from code in TypeScript
Vercel AI SDK (just "AI SDK" in 2026) is an open-source TypeScript toolkit. Rather than a separate proxy process, it's an abstraction layer that absorbs provider differences inside your app code. What the docs call the "architectural core" is provider abstraction: switching from OpenAI to Anthropic means changing one import and one model string — your generation, streaming, and tool-calling code stays fully intact. Pair it with the hosted Vercel AI Gateway and you reach 100+ models. For the implementation details and code, see our complete Vercel AI SDK guide.
Two more to know
A managed, edge-run option. Just route your existing provider calls through it and you get caching, rate limiting, analytics, logging, and fallback with minimal code change (per the docs). A great fit if you already run on Cloudflare.
A control plane that adds production-grade governance, guardrails, and prompt management to a gateway. The official site says it connects 1,600+ LLMs through one API. The OSS version can be self-hosted too.
| Tool | Type | Window | Focus | Pricing idea |
|---|---|---|---|---|
| LiteLLM | ① self-host | OpenAI-compatible API | Governance, virtual keys, observability | OSS free + your ops cost |
| OpenRouter | ② hosted | OpenAI-compatible API | Instant, 400+ models with one key | No inference markup; 5.5% on purchases |
| Vercel AI SDK | ③ SDK | TS functions | Switch from code, type-safe | SDK free + each vendor's billing |
| Cloudflare AI Gateway | ② hosted (edge) | Pass-through | Caching, observability | Cloudflare pricing |
| Portkey | ① / ② both | Unified API | Governance, guardrails | OSS + SaaS plans |
6. Minimal setup (code)
It looks daunting, but the crux of switching is one single place — swap the endpoint (or the model string). Here's the minimal example for each of the three types.
② Hosted: OpenRouter (just swap the endpoint)
Keep your usual OpenAI SDK; change only base_url and the key to reach 400+ models.
from openai import OpenAI
client = OpenAI(
base_url="https://openrouter.ai/api/v1", # this is the only swap
api_key="sk-or-...", # your OpenRouter key
)
resp = client.chat.completions.create(
model="anthropic/claude-opus-4.8", # change to "openai/gpt-5.5" and you've switched
messages=[{"role": "user", "content": "Hello"}],
)
print(resp.choices[0].message.content)
① Self-hosted: LiteLLM (stand up your own proxy)
List your models in a config file, and one command stands up an OpenAI-compatible gateway on localhost:4000. Your app just points there.
# config.yaml
model_list:
- model_name: claude
litellm_params:
model: anthropic/claude-opus-4-8
api_key: os.environ/ANTHROPIC_API_KEY
- model_name: gpt
litellm_params:
model: openai/gpt-5.5
api_key: os.environ/OPENAI_API_KEY
# start (serves an OpenAI-compatible API at http://localhost:4000)
litellm --config config.yaml
③ SDK: Vercel AI SDK (change the model string in code)
Keep the import and function; swap just the model string to switch.
import { generateText } from 'ai';
const { text } = await generateText({
model: 'anthropic/claude-opus-4.8', // change to 'openai/gpt-5.5'
prompt: 'Hello',
});
console.log(text);
In every case you haven't touched a single line of app logic. That's the effect of a gateway/abstraction. Fallback and caching are added on top of this via configuration (each vendor's docs are the quickest path to the exact syntax).
7. How to choose
Choose not by "which is best" but by which fits your constraints. Apply them in this order and you'll rarely get stuck.
Just get it running / solo, PoC, small team → OpenRouter. One key, zero ops, try every vendor's models. Treat the 5.5% fee as the price of not running it yourself.
Building in TypeScript / Next.js → Vercel AI SDK. Type-safe abstraction from code, plus a full streaming UI kit. For the implementation, head to the complete guide.
Don't want data leaving / need org-wide governance → self-host LiteLLM (or Portkey OSS). Hand virtual keys to teams and hold cost and logs in one place.
Already built on Cloudflare → Cloudflare AI Gateway: route your existing calls through it and add caching and observability.
Combinations are normal in practice. For example, "write the app with the Vercel AI SDK, but point its back door at a LiteLLM proxy to centralize company-wide cost and keys" is a two-tier setup that works precisely because the SDK and proxy types are separate layers. As insurance against dependency risk, slotting a local LLM in as one fallback target is becoming standard too.
8. Caveats and limits — not free
A gateway is convenient, but since it adds a layer, there's always a cost. Factor these four in before adopting one.
With a relay in the middle, latency rises slightly. Hosted types feel geographic distance especially. Caching often offsets it, but for ultra-low-latency use, measure.
You get resilient to provider outages, but if the gateway itself goes down, everything does. Build in redundancy, health checks, and a direct-call escape route.
Hosted types add a fee (OpenRouter is 5.5% of purchases); self-hosted adds server ops cost. The break-even shifts with scale.
Converging on the OpenAI-compatible common denominator means each vendor's unique features (extended thinking, special tool formats) may not pass through or arrive late.
One more thing often overlooked: privacy. Routing through a hosted gateway means your prompts and responses pass through a third party's infrastructure. If you handle sensitive data, check the intermediary's data-handling policy, or keep prompts in-house with a self-hosted type (like LiteLLM) in the first place. For production in an organization, treat the gateway's own keys and logs as subjects of least privilege and isolation too — that's the safe side.
Summary
- An LLM gateway is a relay between your app and the providers. It lets you reach every model through one API.
- It takes on six chores: unified API, fallback, cost tracking, caching, rate limiting, observability.
- There are three types — ① self-hosted (LiteLLM) / ② hosted (OpenRouter) / ③ SDK (Vercel AI SDK). Choose by constraint.
- How to choose: instant = OpenRouter / TS build = Vercel AI SDK / governance = LiteLLM. Combinations are normal.
- Don't forget the costs: a hop of latency, the gateway's own failure point, fees, feature loss, privacy.
- Fallback doesn't work just because it's configured — fire it for real and verify your prompt doesn't break.
If you're working with multiple models, a gateway is becoming not a "nice to have" but basic equipment for collecting the plumbing in one place. Start by swapping base_url with OpenRouter or changing one model string with the Vercel AI SDK — that small step dissolves the lock-in to a single vendor and makes both comparison and fallback suddenly realistic. For exact, current specs, confirm each vendor's primary source (LiteLLM / OpenRouter / AI SDK).
FAQ
Q. Are an LLM gateway and an LLM proxy different things?
A. They're used almost interchangeably. Both refer to a relay standing between your app and the providers. If anything, "proxy" leans toward the mechanism (relaying traffic), while "gateway" leans toward the role (including cost management and governance).
Q. If OpenRouter has "no markup," why can it end up pricier?
A. The per-token inference rate is each provider's published price (no markup), but per the official site there's a 5.5% platform fee on credit purchases. The smaller your top-up, the more that share bites, so estimate the effective cost as "model price + a few percent." Confirm the latest at openrouter.ai/pricing.
Q. Vercel AI SDK or LiteLLM — which should I use?
A. They're separate layers, so they don't compete. The Vercel AI SDK is in-code abstraction (for TS/JS); LiteLLM is a separate-process proxy (language-agnostic, governance-oriented). Build a TS app fast with the former; hold company-wide cost, keys, and logs in one place with the latter. Stacking both is common.
Q. Does adding a gateway make things slower?
A. Adding one relay does add a little latency. But where caching kicks in, it's often faster instead. If ultra-low latency is a requirement, place a self-hosted type nearby, lean on caching, and keep a direct-call escape for critical paths to contain the impact.
Q. Do I need a gateway even if I use only one provider?
A. Not required. But there's often value even from cost visibility, access control via virtual keys, caching, and observability alone. If you might add models or use it across a team later, slotting one in early makes the migration easier.