You built it on OpenAI's API. Then you want to try Claude too, and compare Gemini. But every provider has a different SDK, request shape, and error behavior. Each switch means rewriting code, transforming responses, and maintaining separate retry logic per vendor — before long, "vendor-specific plumbing" has soaked into every corner of your app. And while you're pinned to one provider, the moment that company has an outage, raises prices, or shuts a model down, your app goes down with it.

The thing that takes over all that plumbing is an LLM gateway (AI gateway), also called an LLM proxy. It's a relay that sits between your app and the providers, exposing one API (usually OpenAI-compatible) to reach every model, and handling the cross-cutting chores — fallback, cost tracking, caching, rate limiting. This guide covers what a gateway does for you, the difference between the self-hosted, hosted, and SDK types, how to choose among LiteLLM, OpenRouter, and the Vercel AI SDK, and the limits you need to know so you don't get burned.

The 30-second answer

If you only read one box

What it is
A relay between your app and the providers. Reach every model through one API.
Why it helps
Switch, compare, and fall back freely. Manage cost and rate limits in one place.
Which to pick first
Self-host = LiteLLM / instant hosted = OpenRouter / TS app = Vercel AI SDK.

Note: a gateway is no free lunch. It costs you a hop of latency, fees, and some feature loss (§8).

1. Why you need an LLM gateway

If you only call a single provider through a single SDK, you don't need a gateway. You need one the moment you want to use more than one model. Look at the three classic pains.

🔗 Vendor lock-in and scattered code

Each provider has different SDKs, parameter names, response structures, and error codes. Every switch means rewriting your app.

⚡ Outages, price hikes, shutdowns

Depend fully on one company and its outage or price change becomes your downtime. You want an escape hatch (fallback).

🔀 Compare, switch, mix and match

The best model differs by task. You want to use a cheap model to draft and a smart one to polish — but the plumbing gets in the way.

What they share is a structure where the SDK's constraints dictate an essentially strategic choice — which model to use. A gateway carves that plumbing out of your app. Your app only needs to know one endpoint; who to call behind it, who to fail over to, and how much you've spent are the gateway's job. Because building an AI agent or an agent framework almost always assumes multiple models, the demand only grows.

2. What an LLM gateway is

An LLM gateway is a proxy that sits between your app and one or more LLM providers. Most expose a single API shaped like OpenAI's chat-completions endpoint and consolidate in one place the cross-cutting work that would otherwise be scattered through your code — routing, retries and fallback, caching, rate limiting, cost tracking, and access control.

Your app
knows one API only
(OpenAI-compatible)
LLM gateway
routing / fallback
cost / cache / control
The providers
OpenAI / Anthropic
Google / local…
Your app sees one window — the gateway. Who it calls switches behind the scenes.

The point is making the window a single one. Your app code just passes a string to model. Write anthropic/claude-opus-4.8 and you get Claude; write openai/gpt-5.5 and you get GPT — nothing else in the app changes. Decisions like "fail over to another model when this one is down" or "return this identical question from cache" are all settled on the gateway side. Mixing in a local LLM so that "sensitive data stays local, everything else goes to the cloud" is written the same way.

3. What it handles for you

The cross-cutting work a gateway takes on falls into roughly these six buckets. Tools differ in what they're good at, but the direction is shared.

🔌 Unified API

Call every provider in one format (usually OpenAI-compatible). Erasing vendor differences from the app is the key feature.

🔁 Fallback and retry

When the primary model errors, overloads, or times out, automatically switch to another. The heart of business continuity.

💰 Cost tracking and virtual keys

See spend per user, team, or project. Hand out scoped virtual keys that hide the real ones.

⚡ Caching

Remember and instantly return identical or similar requests. Cuts both API bills and latency.

🚦 Rate limiting and load balancing

Per-key token and request limits, plus load balancing across multiple keys and instances.

📊 Observability and guardrails

Measure logs, latency, and success rate across all requests. Some tools also let you insert input/output guardrails.

💡 "Fallback" doesn't equal "safe." The model you fail over to has different output quirks, token counts, and supported features. Fallback doesn't become safe the instant you configure it — it works only once you've actually fired it and tested it. Always verify beforehand that your prompt doesn't break after the switch.

4. Three types: self-hosted, hosted, SDK

"LLM gateway" gets used as one label, but where it runs splits it into three fairly distinct characters. Mistake this and you'll pick wrong.

Type Where it runs Examples Who it suits
① Self-hosted proxy Your servers (separate process) LiteLLM / Portkey (OSS) Keep data in-house and governed
② Hosted (SaaS) The provider's cloud OpenRouter / Cloudflare Use it instantly, zero ops
③ SDK / library Inside your app code Vercel AI SDK Abstract quickly in TS/JS

① Self-hosted is an independent process (a proxy server) you stand up on your own infrastructure. Because prompts don't pass through an external SaaS, it's strong on governance and audit — but you run it yourself. ② Hosted has the provider run the proxy, so it's the fastest to adopt, but requests pass through a third party. ③ SDK stands up no separate process; it absorbs provider differences inside your app code — not a network relay but an "abstraction layer," and it can be combined with ① or ②.

5. Comparing the main tools

Here are the three headliners in recommended order, plus two more worth knowing. Figures are based on each vendor's official pages as of July 2026 (offerings change, so always confirm the latest against the primary source).

LiteLLM — the standard self-hosted proxy

LiteLLM (by BerriAI) is an open-source Python library and self-hosted gateway. It lets you call 100+ providers and 2,500+ models through a single OpenAI-compatible API (per the official repo). Stand it up as a proxy and you get cost tracking, virtual keys, rate limiting, fallback, load balancing, Redis caching, and observability (Langfuse/Prometheus/Datadog integrations). It's the first pick for organizations that want to keep prompts in-house.

OpenRouter — multi-provider with one key, instantly

OpenRouter is a hosted gateway with no ops. With a single OpenAI-compatible API and one API key, it gives access to 400+ models per the official site. Its pricing design stands out: the official site states "we do not mark up inference tokens (catalog prices equal each provider's published prices)", while charging a 5.5% platform fee on credit purchases (per openrouter.ai/pricing). It's overwhelmingly fast for "just get it running" and "try every vendor with one key."

Vercel AI SDK — abstract from code in TypeScript

Vercel AI SDK (just "AI SDK" in 2026) is an open-source TypeScript toolkit. Rather than a separate proxy process, it's an abstraction layer that absorbs provider differences inside your app code. What the docs call the "architectural core" is provider abstraction: switching from OpenAI to Anthropic means changing one import and one model string — your generation, streaming, and tool-calling code stays fully intact. Pair it with the hosted Vercel AI Gateway and you reach 100+ models. For the implementation details and code, see our complete Vercel AI SDK guide.

Two more to know

A managed, edge-run option. Just route your existing provider calls through it and you get caching, rate limiting, analytics, logging, and fallback with minimal code change (per the docs). A great fit if you already run on Cloudflare.

🛡️ Portkey

A control plane that adds production-grade governance, guardrails, and prompt management to a gateway. The official site says it connects 1,600+ LLMs through one API. The OSS version can be self-hosted too.

Tool Type Window Focus Pricing idea
LiteLLM ① self-host OpenAI-compatible API Governance, virtual keys, observability OSS free + your ops cost
OpenRouter ② hosted OpenAI-compatible API Instant, 400+ models with one key No inference markup; 5.5% on purchases
Vercel AI SDK ③ SDK TS functions Switch from code, type-safe SDK free + each vendor's billing
Cloudflare AI Gateway ② hosted (edge) Pass-through Caching, observability Cloudflare pricing
Portkey ① / ② both Unified API Governance, guardrails OSS + SaaS plans
Figures and pricing per each vendor's official pages as of July 2026. They change — re-confirm the primary source at adoption.

6. Minimal setup (code)

It looks daunting, but the crux of switching is one single place — swap the endpoint (or the model string). Here's the minimal example for each of the three types.

② Hosted: OpenRouter (just swap the endpoint)

Keep your usual OpenAI SDK; change only base_url and the key to reach 400+ models.

from openai import OpenAI

client = OpenAI(
    base_url="https://openrouter.ai/api/v1",  # this is the only swap
    api_key="sk-or-...",                       # your OpenRouter key
)

resp = client.chat.completions.create(
    model="anthropic/claude-opus-4.8",  # change to "openai/gpt-5.5" and you've switched
    messages=[{"role": "user", "content": "Hello"}],
)
print(resp.choices[0].message.content)

① Self-hosted: LiteLLM (stand up your own proxy)

List your models in a config file, and one command stands up an OpenAI-compatible gateway on localhost:4000. Your app just points there.

# config.yaml
model_list:
  - model_name: claude
    litellm_params:
      model: anthropic/claude-opus-4-8
      api_key: os.environ/ANTHROPIC_API_KEY
  - model_name: gpt
    litellm_params:
      model: openai/gpt-5.5
      api_key: os.environ/OPENAI_API_KEY
# start (serves an OpenAI-compatible API at http://localhost:4000)
litellm --config config.yaml

③ SDK: Vercel AI SDK (change the model string in code)

Keep the import and function; swap just the model string to switch.

import { generateText } from 'ai';

const { text } = await generateText({
  model: 'anthropic/claude-opus-4.8',  // change to 'openai/gpt-5.5'
  prompt: 'Hello',
});
console.log(text);

In every case you haven't touched a single line of app logic. That's the effect of a gateway/abstraction. Fallback and caching are added on top of this via configuration (each vendor's docs are the quickest path to the exact syntax).

7. How to choose

Choose not by "which is best" but by which fits your constraints. Apply them in this order and you'll rarely get stuck.

Just get it running / solo, PoC, small teamOpenRouter. One key, zero ops, try every vendor's models. Treat the 5.5% fee as the price of not running it yourself.

Building in TypeScript / Next.jsVercel AI SDK. Type-safe abstraction from code, plus a full streaming UI kit. For the implementation, head to the complete guide.

Don't want data leaving / need org-wide governance → self-host LiteLLM (or Portkey OSS). Hand virtual keys to teams and hold cost and logs in one place.

Already built on CloudflareCloudflare AI Gateway: route your existing calls through it and add caching and observability.

Combinations are normal in practice. For example, "write the app with the Vercel AI SDK, but point its back door at a LiteLLM proxy to centralize company-wide cost and keys" is a two-tier setup that works precisely because the SDK and proxy types are separate layers. As insurance against dependency risk, slotting a local LLM in as one fallback target is becoming standard too.

8. Caveats and limits — not free

A gateway is convenient, but since it adds a layer, there's always a cost. Factor these four in before adopting one.

⏱️ One hop of latency

With a relay in the middle, latency rises slightly. Hosted types feel geographic distance especially. Caching often offsets it, but for ultra-low-latency use, measure.

🎯 A new single point of failure

You get resilient to provider outages, but if the gateway itself goes down, everything does. Build in redundancy, health checks, and a direct-call escape route.

💸 Fees and ops cost

Hosted types add a fee (OpenRouter is 5.5% of purchases); self-hosted adds server ops cost. The break-even shifts with scale.

🧩 Feature loss

Converging on the OpenAI-compatible common denominator means each vendor's unique features (extended thinking, special tool formats) may not pass through or arrive late.

One more thing often overlooked: privacy. Routing through a hosted gateway means your prompts and responses pass through a third party's infrastructure. If you handle sensitive data, check the intermediary's data-handling policy, or keep prompts in-house with a self-hosted type (like LiteLLM) in the first place. For production in an organization, treat the gateway's own keys and logs as subjects of least privilege and isolation too — that's the safe side.

Summary

  • An LLM gateway is a relay between your app and the providers. It lets you reach every model through one API.
  • It takes on six chores: unified API, fallback, cost tracking, caching, rate limiting, observability.
  • There are three types — ① self-hosted (LiteLLM) / ② hosted (OpenRouter) / ③ SDK (Vercel AI SDK). Choose by constraint.
  • How to choose: instant = OpenRouter / TS build = Vercel AI SDK / governance = LiteLLM. Combinations are normal.
  • Don't forget the costs: a hop of latency, the gateway's own failure point, fees, feature loss, privacy.
  • Fallback doesn't work just because it's configured — fire it for real and verify your prompt doesn't break.

If you're working with multiple models, a gateway is becoming not a "nice to have" but basic equipment for collecting the plumbing in one place. Start by swapping base_url with OpenRouter or changing one model string with the Vercel AI SDK — that small step dissolves the lock-in to a single vendor and makes both comparison and fallback suddenly realistic. For exact, current specs, confirm each vendor's primary source (LiteLLM / OpenRouter / AI SDK).

FAQ

Q. Are an LLM gateway and an LLM proxy different things?

A. They're used almost interchangeably. Both refer to a relay standing between your app and the providers. If anything, "proxy" leans toward the mechanism (relaying traffic), while "gateway" leans toward the role (including cost management and governance).

Q. If OpenRouter has "no markup," why can it end up pricier?

A. The per-token inference rate is each provider's published price (no markup), but per the official site there's a 5.5% platform fee on credit purchases. The smaller your top-up, the more that share bites, so estimate the effective cost as "model price + a few percent." Confirm the latest at openrouter.ai/pricing.

Q. Vercel AI SDK or LiteLLM — which should I use?

A. They're separate layers, so they don't compete. The Vercel AI SDK is in-code abstraction (for TS/JS); LiteLLM is a separate-process proxy (language-agnostic, governance-oriented). Build a TS app fast with the former; hold company-wide cost, keys, and logs in one place with the latter. Stacking both is common.

Q. Does adding a gateway make things slower?

A. Adding one relay does add a little latency. But where caching kicks in, it's often faster instead. If ultra-low latency is a requirement, place a self-hosted type nearby, lean on caching, and keep a direct-call escape for critical paths to contain the impact.

Q. Do I need a gateway even if I use only one provider?

A. Not required. But there's often value even from cost visibility, access control via virtual keys, caching, and observability alone. If you might add models or use it across a team later, slotting one in early makes the migration easier.