LLM Gateway
The CRAFT platform routes agent LLM traffic through a shared LiteLLM gateway — a single cluster-level service that holds provider credentials centrally. This page describes the target architecture from the Agent Security ADR and the operator contract for configuring the gateway.This page is for platform operators and solution-team leads. Solution developers looking for how to call an LLM from their solution code should see Access LLMs instead.Target vs. current deployment. The shared gateway is the recommended model. Current em-talk2data Helm charts do not bundle a managed gateway — agents in those deployments invoke the LiteLLM SDK in-process against Vertex/Bedrock directly. Operators adopting the gateway pattern must deploy it separately and set
OPENAI_API_BASE / LITELLM_API_BASE on agent deployments.Architecture
Shared gateway model (Agent Security ADR)
Per the Agent Security ADR, agents are untrusted execution environments. The gateway enforces this through two architectural constraints:- No credentials in agent pods. Provider API keys (OpenAI, Anthropic, Google) live only in the LLM Gateway’s environment. Agents never receive or handle these secrets.
- All model access is brokered. Every completion request passes through the gateway’s allowlist check, budget enforcement, and audit pipeline before reaching a provider.
Identity sidecar vs. LLM gateway. The Agent Security ADR also describes a per-agent identity sidecar — a lightweight container that holds the workload’s Kubernetes ServiceAccount JWT and authenticates the agent to platform gateways (including the LLM Gateway) using short-lived scoped tokens. This is separate from the LLM gateway itself. The identity sidecar is a per-pod security boundary; the LLM gateway is a shared cluster service.
How requests flow
Credential isolation in practice: the LLM Gateway pod holdsVERTEXAI_PROJECT, ANTHROPIC_API_KEY, etc. The agent pod’s only gateway-related secrets are LLM_GATEWAY_URL (a ClusterIP service address) and LLM_GATEWAY_API_KEY (a scoped gateway key with no provider privilege). Neither reveals which provider is in use, and neither can be used to call a provider directly.
em-runtime-mcp as agent tool gateway
em-runtime-mcp is the tool-call gateway — the single endpoint through which every agent invokes a platform tool. The LLM gateway and em-runtime-mcp are complementary:
| Gateway | Handles | Enforces |
|---|---|---|
| LLM Gateway (LiteLLM) | LLM completions | Model allowlist, rate limits, cost attribution |
em-runtime-mcp | Platform tool calls | Per-agent tool allowlist, audit events |
Model registry
Allowlist configuration
The gateway’sconfig.yaml defines the models any agent in the cluster is permitted to call. Requests for unlisted models return 400 Model not allowed.
litellm.acompletion(model="openai/gemini-3.5-flash", api_base=LLM_GATEWAY_URL, ...) using the model_name alias. Switching providers is a gateway config change — no agent code changes required.
Environment-variable conventions. The gateway reads Vertex credentials from
VERTEXAI_PROJECT and VERTEXAI_LOCATION (LiteLLM’s documented convention,
matching the em-talk2data Helm chart values). LiteLLM also accepts Google Cloud SDK
standards GOOGLE_CLOUD_PROJECT + GOOGLE_CLOUD_LOCATION. Authenticate the gateway
pod to Vertex AI via Workload Identity (production) or GOOGLE_APPLICATION_CREDENTIALS
pointing to an ADC JSON (local dev).Provider routing
Prefix in litellm_params.model | Provider |
|---|---|
vertex_ai/ | Google Vertex AI |
openai/ | OpenAI |
anthropic/ | Anthropic (direct) |
gemini/ | Google Gemini (direct, non-Vertex) |
ollama/ | Local Ollama |
vllm/ | vLLM endpoint |
azure/ | Azure OpenAI |
model_name:
Authentication
Gateway API key
The gateway enforces amaster_key. Each project’s agents receive a distinct scoped key provisioned by the platform secrets pipeline. The key grants access to the gateway — it carries no provider-level privilege and cannot be used to call a model provider directly.
Agents receive the key via their Helm values:
OPENAI_API_KEY, ANTHROPIC_API_KEY, VERTEXAI_PROJECT, etc.) are mounted into the gateway Deployment only and are never visible to agent pods.
Budget enforcement
Rate limits
429 Too Many Requests; clients back off exponentially.
Cost ceilings
Overage behavior
- Gateway returns
429with aRetry-Afterheader. - Agent catches
429and applies exponential back-off with jitter. - If the alias has a fallback in
model_list, the router tries it automatically. - If no fallback and budget exhausted,
429propagates to the caller.
Observability
Langfuse traces
LiteLLM auto-emits traces to Langfuse when these env vars are set on the gateway Deployment:metadata the agent passes (project_id, solution, trace_id) — the basis for per-project cost attribution dashboards.
Prometheus and OpenTelemetry metrics
LiteLLM exposes/metrics (Prometheus format) on port 4000. The platform OTEL Collector scrapes it and forwards to your observability backend.
| Metric | Description |
|---|---|
litellm_requests_total | Total completion requests by model and status |
litellm_tokens_total | Total tokens consumed by model |
litellm_request_duration_seconds | Latency histogram |
litellm_spend_usd | Cumulative spend by model alias |
Provider configuration
Google Vertex AI (default)
Google Vertex AI (default)
GOOGLE_APPLICATION_CREDENTIALS). gemini-3.5-flash is the recommended fast tier.Anthropic on Vertex AI (secondary)
Anthropic on Vertex AI (secondary)
OpenAI (direct)
OpenAI (direct)
OPENAI_API_KEY from a Kubernetes Secret into the gateway Deployment only.Anthropic (direct)
Anthropic (direct)
ANTHROPIC_API_KEY from a Kubernetes Secret into the gateway Deployment only.Google Gemini (direct, non-Vertex)
Google Gemini (direct, non-Vertex)
Self-hosted — Ollama
Self-hosted — Ollama
Self-hosted — vLLM
Self-hosted — vLLM
openai/ prefix.Deploying the gateway
The LLM Gateway is a Kubernetes Deployment behind a ClusterIP Service. It is not a sidecar in each agent pod — it is a shared platform service maintained by the platform team.http://llm-gateway.craft.svc.cluster.local/v1.
Disaster recovery — gateway unavailable
Because the gateway is shared, a gateway outage affects all agents cluster-wide, not a single pod. Symptoms: agents receiveConnection refused or 503 Service Unavailable on their gateway URL.
Response:
- Check gateway pod health:
kubectl get pods -n craft -l app=llm-gateway - Check gateway logs:
kubectl logs -n craft deployment/llm-gateway --tail=50 - Check provider upstream status (Vertex, OpenAI, Anthropic status pages)
- If gateway pods are crash-looping:
kubectl rollout restart deployment/llm-gateway -n craft - If a model alias is hitting rate limits, the fallback entry in
model_listactivates automatically
Next steps
Access LLMs (solution dev)
How to call the gateway from solution code using litellm.
LLM Observability
Platform-side observability: Langfuse traces, cost dashboards, model comparison.
Manage Secrets
How provider keys and gateway keys flow through the secrets pipeline.
Platform Overview
How the gateway fits into the overall platform architecture.
MCP Server
Connect Claude Code, Cursor, Goose, or an external agent to CRAFT’s tool gateway over MCP.

