Skip to main content

LLM Gateway

The CRAFT platform routes agent LLM traffic through a shared LiteLLM gateway — a single cluster-level service that holds provider credentials centrally. This page describes the target architecture from the Agent Security ADR and the operator contract for configuring the gateway.
This page is for platform operators and solution-team leads. Solution developers looking for how to call an LLM from their solution code should see Access LLMs instead.Target vs. current deployment. The shared gateway is the recommended model. Current em-talk2data Helm charts do not bundle a managed gateway — agents in those deployments invoke the LiteLLM SDK in-process against Vertex/Bedrock directly. Operators adopting the gateway pattern must deploy it separately and set OPENAI_API_BASE / LITELLM_API_BASE on agent deployments.

Architecture

Shared gateway model (Agent Security ADR)

Per the Agent Security ADR, agents are untrusted execution environments. The gateway enforces this through two architectural constraints:
  1. No credentials in agent pods. Provider API keys (OpenAI, Anthropic, Google) live only in the LLM Gateway’s environment. Agents never receive or handle these secrets.
  2. All model access is brokered. Every completion request passes through the gateway’s allowlist check, budget enforcement, and audit pipeline before reaching a provider.
The LiteLLM gateway is deployed as a shared Kubernetes Deployment — one set of gateway replicas serves all agents in the cluster. Agents call it via its internal ClusterIP service address.
Identity sidecar vs. LLM gateway. The Agent Security ADR also describes a per-agent identity sidecar — a lightweight container that holds the workload’s Kubernetes ServiceAccount JWT and authenticates the agent to platform gateways (including the LLM Gateway) using short-lived scoped tokens. This is separate from the LLM gateway itself. The identity sidecar is a per-pod security boundary; the LLM gateway is a shared cluster service.

How requests flow

Credential isolation in practice: the LLM Gateway pod holds VERTEXAI_PROJECT, ANTHROPIC_API_KEY, etc. The agent pod’s only gateway-related secrets are LLM_GATEWAY_URL (a ClusterIP service address) and LLM_GATEWAY_API_KEY (a scoped gateway key with no provider privilege). Neither reveals which provider is in use, and neither can be used to call a provider directly.

em-runtime-mcp as agent tool gateway

em-runtime-mcp is the tool-call gateway — the single endpoint through which every agent invokes a platform tool. The LLM gateway and em-runtime-mcp are complementary:
GatewayHandlesEnforces
LLM Gateway (LiteLLM)LLM completionsModel allowlist, rate limits, cost attribution
em-runtime-mcpPlatform tool callsPer-agent tool allowlist, audit events

Model registry

Allowlist configuration

The gateway’s config.yaml defines the models any agent in the cluster is permitted to call. Requests for unlisted models return 400 Model not allowed.
# LLM Gateway config.yaml (mounted as a ConfigMap on the gateway Deployment).
# CRAFT defaults route to Vertex AI; the gateway is provider-agnostic, so other
# providers (OpenAI, Anthropic direct, self-hosted) can be added side-by-side.
model_list:
  # DEFAULT for most solutions and agents.
  # GA on Vertex AI Model Garden (since Google I/O 2026-05-19). Outperforms
  # gemini-3.1-pro on 6 of 8 major benchmarks — including agentic (MCP Atlas)
  # and coding (Terminal-Bench) — at 3.6× the speed and a fraction of the cost.
  - model_name: gemini-3.5-flash
    litellm_params:
      model: vertex_ai/gemini-3.5-flash
      vertex_project: os.environ/VERTEXAI_PROJECT
      vertex_location: global

  # Specialized variant. Use ONLY when custom tool calling is a core
  # requirement (the model prioritizes registered custom functions over bash
  # fallbacks). Per Google's own guidance: if >50% of requests don't involve
  # tool calling, stay on gemini-3.5-flash. Preview SLA; global endpoint only.
  - model_name: gemini-3.1-pro-preview-customtools
    litellm_params:
      model: vertex_ai/gemini-3.1-pro-preview-customtools
      vertex_project: os.environ/VERTEXAI_PROJECT
      vertex_location: global

  # Secondary / cross-provider option — demonstrates model agnosticity
  - model_name: claude-opus-4-8
    litellm_params:
      model: vertex_ai/claude-opus-4-8
      vertex_project: os.environ/VERTEXAI_PROJECT
      vertex_location: global

  # Self-hosted (no key required)
  - model_name: ollama/llama3.1
    litellm_params:
      model: ollama/llama3.1
      api_base: http://ollama.ollama.svc.cluster.local:11434

litellm_settings:
  drop_params: true       # tolerate extra params from different SDKs
  request_timeout: 120    # seconds

general_settings:
  master_key: os.environ/LITELLM_MASTER_KEY
Agents call litellm.acompletion(model="openai/gemini-3.5-flash", api_base=LLM_GATEWAY_URL, ...) using the model_name alias. Switching providers is a gateway config change — no agent code changes required.
Use gemini-3.1-pro-preview-customtools only when extensive custom tool calling is core to the agent. It’s a fine-tuned variant that prioritizes registered custom functions over bash fallbacks. Google’s own guidance: if
50% of requests don’t involve tool calling, stay on gemini-3.5-flash — quality on non-tool workloads is lower on the customtools variant.
Three operational constraints: (1) global-endpoint onlyvertex_location must be global. (2) Lower quota — configure a fallback chain to gemini-3.5-flash or claude-opus-4-8. (3) Preview SLA — may be deprecated or renamed without GA notice.
Environment-variable conventions. The gateway reads Vertex credentials from VERTEXAI_PROJECT and VERTEXAI_LOCATION (LiteLLM’s documented convention, matching the em-talk2data Helm chart values). LiteLLM also accepts Google Cloud SDK standards GOOGLE_CLOUD_PROJECT + GOOGLE_CLOUD_LOCATION. Authenticate the gateway pod to Vertex AI via Workload Identity (production) or GOOGLE_APPLICATION_CREDENTIALS pointing to an ADC JSON (local dev).

Provider routing

Prefix in litellm_params.modelProvider
vertex_ai/Google Vertex AI
openai/OpenAI
anthropic/Anthropic (direct)
gemini/Google Gemini (direct, non-Vertex)
ollama/Local Ollama
vllm/vLLM endpoint
azure/Azure OpenAI
To add fallbacks, list multiple entries sharing the same model_name:
  - model_name: gemini-3.5-flash
    litellm_params:
      model: vertex_ai/gemini-3.5-flash
      vertex_project: os.environ/VERTEXAI_PROJECT
      vertex_location: global
  - model_name: gemini-3.5-flash    # fallback for resilience
    litellm_params:
      model: vertex_ai/claude-opus-4-8
      vertex_project: os.environ/VERTEXAI_PROJECT
      vertex_location: global

Authentication

Gateway API key

The gateway enforces a master_key. Each project’s agents receive a distinct scoped key provisioned by the platform secrets pipeline. The key grants access to the gateway — it carries no provider-level privilege and cannot be used to call a model provider directly. Agents receive the key via their Helm values:
# charts/your-agent/values.yaml
env:
  - name: LLM_GATEWAY_URL
    value: "http://llm-gateway.craft.svc.cluster.local/v1"
  - name: LLM_GATEWAY_API_KEY
    valueFrom:
      secretKeyRef:
        name: llm-gateway-keys
        key: project-key
The gateway’s own provider secrets (OPENAI_API_KEY, ANTHROPIC_API_KEY, VERTEXAI_PROJECT, etc.) are mounted into the gateway Deployment only and are never visible to agent pods.

Budget enforcement

Rate limits

router_settings:
  num_retries: 3
  timeout: 120
  retry_after: 5             # seconds between retries

model_list:
  - model_name: gemini-3.5-flash
    litellm_params:
      model: vertex_ai/gemini-3.5-flash
      vertex_project: os.environ/VERTEXAI_PROJECT
      vertex_location: global
      tpm: 100000             # tokens per minute (cluster-wide)
      rpm: 500                # requests per minute (cluster-wide)
Limits are enforced cluster-wide at the shared gateway, not per-agent. When a limit is exceeded, the gateway returns 429 Too Many Requests; clients back off exponentially.

Cost ceilings

litellm_settings:
  max_budget: 50.00          # USD, per model alias
  budget_duration: monthly

Overage behavior

  1. Gateway returns 429 with a Retry-After header.
  2. Agent catches 429 and applies exponential back-off with jitter.
  3. If the alias has a fallback in model_list, the router tries it automatically.
  4. If no fallback and budget exhausted, 429 propagates to the caller.
Do not remove the fallbacks list from the router config without first confirming the upstream has headroom. A saturated primary with no fallback causes a complete LLM outage for all agents in the cluster — not just one pod.

Observability

Langfuse traces

LiteLLM auto-emits traces to Langfuse when these env vars are set on the gateway Deployment:
LANGFUSE_HOST=https://langfuse.your-cluster.example.com
LANGFUSE_PUBLIC_KEY=<project-public-key>
LANGFUSE_SECRET_KEY=<project-secret-key>
Enable in the gateway config:
litellm_settings:
  success_callback: ["langfuse"]
  failure_callback: ["langfuse"]
Every completion call generates a Langfuse trace tagged with the metadata the agent passes (project_id, solution, trace_id) — the basis for per-project cost attribution dashboards.

Prometheus and OpenTelemetry metrics

LiteLLM exposes /metrics (Prometheus format) on port 4000. The platform OTEL Collector scrapes it and forwards to your observability backend.
MetricDescription
litellm_requests_totalTotal completion requests by model and status
litellm_tokens_totalTotal tokens consumed by model
litellm_request_duration_secondsLatency histogram
litellm_spend_usdCumulative spend by model alias
OTLP export from the gateway:
litellm_settings:
  service_callback: ["otel"]

environment_variables:
  OTEL_EXPORTER_OTLP_ENDPOINT: "http://otel-collector:4317"
  OTEL_SERVICE_NAME: "llm-gateway"

Provider configuration

- model_name: gemini-3.5-flash
  litellm_params:
    model: vertex_ai/gemini-3.5-flash
    vertex_project: os.environ/VERTEXAI_PROJECT
    vertex_location: global

- model_name: gemini-3.1-pro-preview-customtools
  litellm_params:
    model: vertex_ai/gemini-3.1-pro-preview-customtools
    vertex_project: os.environ/VERTEXAI_PROJECT
    vertex_location: global    # preview model — global endpoint only
Authenticate the gateway with Workload Identity (GKE) or a mounted ADC JSON (GOOGLE_APPLICATION_CREDENTIALS). gemini-3.5-flash is the recommended fast tier.
- model_name: claude-opus-4-8
  litellm_params:
    model: vertex_ai/claude-opus-4-8
    vertex_project: os.environ/VERTEXAI_PROJECT
    vertex_location: global
Anthropic models served by Vertex AI Model Garden. Stays within the Vertex tenancy boundary — no separate Anthropic key required.
- model_name: gpt-5.5
  litellm_params:
    model: openai/gpt-5.5
    api_key: os.environ/OPENAI_API_KEY
Mount OPENAI_API_KEY from a Kubernetes Secret into the gateway Deployment only.
- model_name: claude-opus-4-8
  litellm_params:
    model: anthropic/claude-opus-4-8
    api_key: os.environ/ANTHROPIC_API_KEY
Mount ANTHROPIC_API_KEY from a Kubernetes Secret into the gateway Deployment only.
- model_name: gemini-3.5-flash
  litellm_params:
    model: gemini/gemini-3.5-flash
    api_key: os.environ/GEMINI_API_KEY
Prefer the Vertex AI entry above for production — it inherits the tenant’s IAM boundary rather than a long-lived API key.
- model_name: llama3.1
  litellm_params:
    model: ollama/llama3.1
    api_base: http://ollama.ollama.svc.cluster.local:11434
No API key required. The Ollama service must be reachable from the gateway namespace.
- model_name: llama3.1-70b
  litellm_params:
    model: openai/llama3.1-70b
    api_base: http://vllm.vllm.svc.cluster.local:8000/v1
    api_key: os.environ/VLLM_API_KEY
vLLM exposes an OpenAI-compatible API — use the openai/ prefix.

Deploying the gateway

The LLM Gateway is a Kubernetes Deployment behind a ClusterIP Service. It is not a sidecar in each agent pod — it is a shared platform service maintained by the platform team.
# Minimal production Deployment (add HPA + PodDisruptionBudget for HA)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: llm-gateway
  namespace: craft
spec:
  replicas: 2          # scale horizontally for throughput; add Redis for shared state
  selector:
    matchLabels:
      app: llm-gateway
  template:
    spec:
      containers:
        - name: litellm
          image: ghcr.io/berriai/litellm:main-latest
          ports:
            - containerPort: 4000
          env:
            - name: LITELLM_MASTER_KEY
              valueFrom:
                secretKeyRef:
                  name: llm-gateway-secrets
                  key: master-key
            - name: VERTEXAI_PROJECT
              valueFrom:
                secretKeyRef:
                  name: llm-gateway-secrets
                  key: vertex-project
          volumeMounts:
            - name: config
              mountPath: /app/config.yaml
              subPath: config.yaml
      volumes:
        - name: config
          configMap:
            name: llm-gateway-config
---
apiVersion: v1
kind: Service
metadata:
  name: llm-gateway
  namespace: craft
spec:
  selector:
    app: llm-gateway
  ports:
    - port: 80
      targetPort: 4000
  type: ClusterIP
Agents in any namespace reference the gateway as http://llm-gateway.craft.svc.cluster.local/v1.

Disaster recovery — gateway unavailable

Because the gateway is shared, a gateway outage affects all agents cluster-wide, not a single pod. Symptoms: agents receive Connection refused or 503 Service Unavailable on their gateway URL. Response:
  1. Check gateway pod health: kubectl get pods -n craft -l app=llm-gateway
  2. Check gateway logs: kubectl logs -n craft deployment/llm-gateway --tail=50
  3. Check provider upstream status (Vertex, OpenAI, Anthropic status pages)
  4. If gateway pods are crash-looping: kubectl rollout restart deployment/llm-gateway -n craft
  5. If a model alias is hitting rate limits, the fallback entry in model_list activates automatically
To verify the gateway is healthy:
curl http://llm-gateway.craft.svc.cluster.local/health | jq .
# Expected: {"status": "healthy"}

Next steps

Access LLMs (solution dev)

How to call the gateway from solution code using litellm.

LLM Observability

Platform-side observability: Langfuse traces, cost dashboards, model comparison.

Manage Secrets

How provider keys and gateway keys flow through the secrets pipeline.

Platform Overview

How the gateway fits into the overall platform architecture.

MCP Server

Connect Claude Code, Cursor, Goose, or an external agent to CRAFT’s tool gateway over MCP.