Resource · Reference architecture

What is an AI gateway?

A proxy between your application and your AI model providers. Every model call routes through a single endpoint where credentials are managed centrally, traffic is routed based on cost and availability, and every request is logged.

Scroll for report
By Sagar Batchu, Co-founder & CEO, SpeakeasyPublished
Definition

AI gateway

A proxy between applications and AI model providers (sometimes called an LLM gateway). Every model call routes through a single endpoint where provider credentials are stored centrally, traffic is routed based on cost and availability, and every request is logged with token counts and latency.


AI GatewayReferenceSpeakeasy

An AI gateway is a reverse proxy for model API calls. The term “LLM gateway” is used interchangeably, particularly when the scope is limited to text-based language models. Application code, or an AI agent, sends every request to one gateway endpoint. The gateway decides which provider will handle the request, injects the stored credentials, applies rate limits, and forwards the request to the correct model API. When the response is returned, the gateway sends the response back to the agent. The application never holds a provider API key and never needs to know which model provider is running behind the endpoint.

The pattern is identical to an API gateway or service mesh proxy, applied to the model-call layer. It centralizes the concerns that would otherwise be scattered across every service that talks to a model: credential management, failover, rate limiting, cost attribution, and content inspection.

Two use cases drive adoption: workflow reliability (running AI at production scale without outages, runaway loops, or capacity surprises) and governance (controlling cost, enforcing access, and producing the audit trail that makes AI usage explainable to security, finance, and compliance). Most teams arrive through one of these doors and discover they need the other.

The problem it solves

Without an AI gateway, every application or agent that calls a model manages its own provider credentials. API keys live in environment variables, hardcoded in application code, or scattered across developer config files. When a provider goes down, the application fails unless it has its own retry logic. Costs are tracked in separate dashboards per vendor with no unified view across the organization.

Three problems compound as the number of applications and agents grows:

Credential scatter. Each application holds its own copy of every provider API key it uses. When a key needs to be rotated or revoked, every application has to be updated. When a developer leaves, their personal API keys may remain active in services they configured because there is no central inventory of which keys exist or what they can access.

Cost blindness. Without a gateway, the only view of model costs is the per-vendor billing dashboard. There is no way to attribute spend to a specific team, service, or agent. A runaway agentic loop that generates thousands of model calls in minutes may not surface until the end-of-month invoice.

Reliability gaps. When a provider has an outage, every application that lacks its own failover logic goes down with it. Building per-provider retry logic in each application is duplicated effort that belongs in a shared layer.

How it works

An AI gateway is a reverse proxy. Application code sends every model request to a single gateway endpoint rather than directly to a provider API such as OpenAI, Anthropic, Google, or Mistral. The gateway selects a provider based on routing rules, injects the stored credential, and forwards the request. The response flows back through the gateway, where it is logged with token counts, latency, and cost before being returned to the caller.

Reference architecture · Model call lifecycle

AI gateway

Every model call from the application routes through the gateway is authenticated, routed, rate-limited, and logged before reaching the provider.
Application / AI agentApp backendweb · serviceAI agentautonomousAnalyticsbatch · serviceCustom codeAPI · servicemodel requestmodel: gpt-4o · messages · max_tokens: 1024model call interceptedAI gatewayAuthenticatePer-service API keysapp-backendanalytics-svcagent-pipelineNo provider keys in app code ✓RouteProvider selection rulessummarize → gpt-4o-minireasoning → claude-sonnetdefault → gpt-4oFailover configured ✓Rate limitToken & cost controlsEngineering2 000 rpmAnalytics500 rpmAgent pipeline200 rpmAll within limits ✓LogPer-request telemetry✓ Forwardedtokens · latency · cost✗ Rate limited429 · retry-after headerCost attributed by service ✓model routing with built-in failoverModel providersOpenAIgpt-4o · gpt-4o-miniAnthropicclaude-sonnet · claude-haikuGooglegemini-flash · gemini-proMistralmistral-large · mixtral

The lifecycle of a model call through an AI gateway:

  • The application sends a request to the gateway with a model name and prompt. The gateway endpoint is the only URL the application needs to know.
  • The gateway authenticates the calling service against a registry of per-service credentials. No provider keys ever reach application code.
  • The gateway evaluates routing rules to select a provider: cost targets, task type, latency requirements, or data residency constraints.
  • The gateway applies rate limits scoped by team, service, or individual agent. If the request exceeds a limit, it returns a 429 with a retry-after header.
  • If the selected provider is unavailable, the gateway retries automatically against the configured fallback provider. The application sees a successful response either way.
  • The response flows back through the gateway. Every call is logged: model used, token counts, latency, cost, and the service identity that made the call.

The gateway does not change the model API interface. Applications still use the standard OpenAI-compatible client format. What changes is the endpoint they point to.

Two use cases: reliability and governance

Teams adopt AI gateways for two distinct reasons, and the feature set splits cleanly between them.

Workflow reliability is the engineering case. The job is to make AI calls work consistently at production volume, surviving provider outages, absorbing traffic spikes, and keeping latency predictable. This is the door engineering teams usually walk through first.

Governance is the security, finance, and compliance case. The job is to make AI usage explainable: who called what model with what input, what it cost, who paid, and what risks the traffic carries (leaked secrets, prompt injection, regulated data flowing to a third-party API). This is the door IT, security, and finance leaders walk through.

The same gateway serves both. A few features land on both sides of the line. Rate limiting protects reliability and enforces quotas; caching reduces latency and reduces cost. It’s clearer, though, to introduce them by primary use case.

Reliability features

Multi-provider routing. Routing rules direct calls to different providers based on cost, latency targets, task type, or data residency requirements. A summarization task routes to a cheaper model. A reasoning task routes to a more capable one. All of this happens transparently to the calling application.

Failover. When a provider returns an error or is unreachable, the gateway retries against a configured fallback. Provider outages become invisible to the application. Without a gateway, every application needs its own retry logic for each provider it uses.

Caching. Identical or semantically similar prompts can return cached responses, reducing latency and cost for repeated queries. Semantic caching is harder than traditional API caching, because the variability of natural language prompts makes a cache hit non-trivial to define. It pays off for agents that run the same classification or summarization tasks repeatedly across large volumes.

Rate limiting. Limits are set per service, team, or agent. A runaway agentic loop cannot generate unbounded cost because it is rate-limited at the gateway before reaching the provider. Limits are defined once and enforced consistently across all callers.

Governance features

Centralized credential management. All provider credentials are stored in the gateway. Application code authenticates to the gateway with a service-scoped credential issued through standard protocols (OAuth 2.x, OIDC). No provider API key ever lives in application code or a developer’s environment. When a key is rotated or a service is decommissioned, revocation happens in one place.

Cost tracking and attribution. Every model call is logged with the model used, token counts, latency, and cost. This gives a single view across all providers rather than separate vendor billing dashboards. Usage is attributed to the service, team, or agent that generated it, which is what makes chargeback and budget enforcement possible.

Content inspection and guardrails. Inbound prompts can be scanned for prompt injection patterns, secrets, and policy violations. Outbound completions can be scanned for PII, sensitive data exfiltration, and disallowed content. The gateway is the single enforcement point where these checks run, so the same rules apply regardless of which application made the call or which model handled it.

Audit logging. Every request and response is recorded with the calling identity, model used, token counts, latency, and policy decisions. The log is what makes incident response and compliance audits tractable: one trail to look at, not one per vendor or one per team.

What it doesn’t govern

Shadow AI, the growing problem of untracked AI tool usage within organizations, has a piece an AI gateway addresses well: centralizing provider credentials and surfacing model costs closes the visibility gap at the model-call layer.

The model call is only one slice of what an AI agent does. When an agent uses a tool to read customer records, query a database, or execute commands through an MCP server, those actions happen after inference and outside the AI gateway’s view. The gateway captured the prompt and the completion. It has no record of which tools the model called, what data those tools returned, or what the agent did next. A personally configured MCP server, an agent with unchecked access to production APIs, a tool call that exfiltrates internal data: none of these are model calls, so none are visible here.

The more dangerous surface of Shadow AI sits at the tool-calling layer. If ungoverned agent behavior is the concern that brought you to this page, what you need is an MCP gateway, which governs every tool call an agent makes. The AI gateway and MCP gateway are complementary today, and the architecture that puts both on a shared identity foundation is the AI control plane.

Why now?

Two forces are making AI gateways visible as infrastructure rather than optional tooling.

Agentic workflows changed the volume calculus. A single user-facing request that triggers an agentic workflow can generate dozens of model calls: planning, tool selection, execution, evaluation, and retry. At that volume, credential scatter and cost blindness are no longer manageable problems. A runaway loop, a misconfigured rate limit, or an unrotated key that outlives the service that created it each become real incidents rather than theoretical risks.

Multi-provider environments are the norm, not the exception. Most organizations are not running exclusively on one model provider. Different teams have different preferences. Different tasks have different cost and capability requirements. Different compliance regimes have different data residency requirements. Managing credentials and routing rules separately in each application does not scale.

AI gateway vs MCP gateway (and where they’re heading)

These two layers are often confused because both sit in front of AI infrastructure and both are described as “gateways.” Today they govern different parts of the interaction and run at different points in the request lifecycle.

Layer
AI gateway
What it governs
Prompts, completions, model routing
When it runs
Before and after inference
MCP gateway
What it governs
Tool calls, MCP server access, data returned
Example vendors
Speakeasy, Runlayer , MintMCP 
When it runs
After the model decides to use a tool

An AI gateway sees the prompt and the completion. It does not see what happens after the model decides to call a tool. When Claude Code or Cursor uses an MCP server to read a file or query a database, that action happens after the model call, and outside the AI gateway’s view.

An MCP gateway closes that gap. It sits between the AI agent and the MCP servers it calls, enforcing access policy and logging every tool call with its arguments and result.

The two layers are complementary today, but the category is consolidating. Gartner’s 2025 Market Guide for AI Gateways predicts that MCP gateways (and A2A gateways for agent-to-agent traffic) will fold into broader AI gateway platforms as the market matures: “expect to see consolidation into a single offering that supports multiple AI use cases for gateways.” The consolidated layer needs a name. We call it the AI control plane: a single governing layer that handles every AI-to-system interaction, with the AI gateway and MCP gateway functions sitting on a shared identity, policy, and observability foundation.

Which layer do you need?

Start with the lightest layer that solves the problem in front of you.

Problem
Provider API keys in application code
AI gateway
AI control plane
MCP gateway
No unified view of model spend across teams
AI gateway
AI control plane
MCP gateway
Provider outages taking down services
AI gateway
AI control plane
MCP gateway
Agents making unbounded model calls
AI gateway
AI control plane
MCP gateway
No visibility into which tools agents are calling
AI gateway
AI control plane
MCP gateway
Agents accessing data they shouldn't
AI gateway
AI control plane
MCP gateway
PII flowing through MCP tools without redaction
AI gateway
AI control plane
MCP gateway
Data exfiltration via tool calls
AI gateway
AI control plane
MCP gateway
Multiple teams, agents, and tool servers with no shared identity layer
AI gateway
AI control plane
MCP gateway
Need both model-call and tool-call governance on a common policy foundation
AI gateway
AI control plane
MCP gateway

Vendor landscape

The AI gateway category spans API management platforms with AI extensions (Kong, Apigee, Cloudflare), purpose-built open-source proxies (LiteLLM, Portkey, OpenRouter, Helicone), and hosted observability platforms with routing features. The table below covers the purpose-built proxies most commonly evaluated for production deployments. For a detailed comparison, see Choosing an LLM gateway.

Reference · Vendor comparison

AI gateway features

Illustrative, not exhaustive.
Supported
Partial / limited
Not available
ProductMulti-provider routingFailoverRate limitingCost trackingCachingContent inspection
LiteLLM
Open source · self-hostable
Portkey
Hosted · self-hostable
OpenRouter
Hosted · routing-focused
Helicone
Hosted · observability-focused

A note on Speakeasy

Speakeasy builds the AI control plane, the governing layer between every AI agent in the organization and every system those agents are allowed to reach. Our MCP gateway works in tandem with every major LLM gateway.

The MCP gateway, which governs tool calls and third-party data access, is the production-ready foundation of the Speakeasy control plane today. The AI gateway lane is expanding from there.

If the problem is that model calls are ungoverned and tool calls are invisible, the architecture starts with the MCP gateway and the AI gateway as complementary layers on a shared identity foundation. The term to have in mind while working through the design is AI control plane. It names the full layer both gateways are part of, and the reference architecture is here.

Frequently asked questions

An AI gateway (also called an LLM gateway) is a proxy between your application and your AI model providers. Instead of application code calling OpenAI, Anthropic, or Mistral directly with hardcoded credentials, all model calls route through one endpoint. The gateway manages provider API keys, applies routing rules to select a provider, handles failover if a provider is unavailable, enforces rate limits, and logs every request with token counts, latency, and cost.

The terms are used interchangeably. "LLM gateway" emphasizes the model-call layer for text-based language models, while "AI gateway" (the term Gartner uses) is the broader category that can also cover image, speech, and other non-LLM model traffic. The underlying pattern is the same: a proxy between applications and AI model providers that centralizes credentials, routing, and observability.

For a single provider, the main benefits are centralized API key management (credentials stored in the gateway rather than scattered across application code) and a unified log of every model call. Rate limiting and failover become more relevant as call volume grows. For teams running agentic workflows, where a single task can trigger dozens of model calls, visibility into token usage and cost attribution becomes important regardless of how many providers are in use.

An AI gateway sits in front of the language model and governs what goes into and out of the model: prompts, completions, provider routing, and cost. An MCP gateway sits after the model has decided to use a tool, governing which tools the agent can reach, with what arguments, and under what conditions. The two layers are complementary today: an AI gateway sees every model call, an MCP gateway sees every tool call, and neither can see what the other sees. Gartner expects the categories to consolidate over time, which is the trend the AI control plane represents.

An AI gateway handles the model-call layer only. An AI control plane puts the AI gateway and the MCP gateway on a shared identity foundation, adds content-level inspection (PII detection, prompt injection blocking, data exfiltration detection), and produces a single correlated log across all model calls and tool calls. The control plane is what makes incident response and compliance audits tractable, because both logs share a common identity and can be reconstructed into a single interaction trace.

Multi-provider routing means the gateway can direct model calls to different providers based on rules you define: cost, latency targets, task type, or data residency requirements. A single application endpoint sends requests to the gateway; the gateway decides which provider handles each one. This matters because no single provider is optimal for every task. Routing summarization tasks to a cheaper model and reasoning tasks to a more capable one can reduce cost significantly without changing application code.

When a provider returns an error or is unavailable, the gateway retries the request against a configured fallback provider. The application receives a successful response either way. Without a gateway, the application either fails or needs its own per-provider retry logic. With a gateway, the failover config is defined once and applies to all applications routing through it. LiteLLM, for example, lets you define a primary model and a fallback in a single YAML file.

The Speakeasy AI control plane includes an AI gateway component, the Agents API, currently in early beta. It accepts a unified request format and routes to OpenAI, Anthropic, Google, and Mistral models, and supports multi-turn conversations and sub-agents. The control plane also includes the MCP gateway, which governs tool calls, and the Insights observability layer. The full control plane is what connects the model-call layer and the tool-call layer under a shared identity, so both can be audited together.

AI everywhere.