Skip to main content

Access LLMs

Solutions never call OpenAI, Anthropic, or Gemini directly. Instead, point an OpenAI-compatible SDK at the platform’s LiteLLM gateway. You get provider-agnostic access, central API-key management, per-project cost attribution, rate-limit enforcement, and Langfuse traces — for free. The canonical reference implementation is in Data Insights › Text-to-SQL; this page extracts the patterns every solution needs.

Why a gateway

Without a gatewayWith LiteLLM gateway
Each solution holds its own API keysOne key per provider, held centrally
Switching from OpenAI → Anthropic = code changeSwitching is an env var change
Per-solution cost attribution requires custom telemetryFree via gateway metadata + Langfuse
Rate limits enforced per-key (poorly)Enforced centrally, per project
Adding observability requires per-call wrappingAuto-instrumented via Langfuse hook
Read the diagram: your solution calls one OpenAI-compatible endpoint; the gateway routes by provider/model to the chosen upstream and auto-traces every call into Langfuse. The metadata={"project_id": ..., "solution": ...} you pass on every call is what makes per-project cost attribution possible.

Configure your service

Three env vars (plus Langfuse credentials when you want tracing):
charts/<solution>/values.yaml
api:
  env:
    LLM_GATEWAY_URL: "https://litellm.example.com/v1"   # platform operator provides this
    LLM_DEFAULT_MODEL: "openai/gemini-3.5-flash"
  envVars:
    - name: LLM_GATEWAY_API_KEY
      valueFrom: { secretKeyRef: { name: <solution>-secrets, key: llm-gateway-api-key } }

    # Observability (Langfuse) — see /guides/langfuse-setup
    - name: LANGFUSE_HOST
      valueFrom: { configMapKeyRef: { name: <solution>-langfuse, key: host } }
    - name: LANGFUSE_PUBLIC_KEY
      valueFrom: { secretKeyRef: { name: <solution>-secrets, key: langfuse-public-key } }
    - name: LANGFUSE_SECRET_KEY
      valueFrom: { secretKeyRef: { name: <solution>-secrets, key: langfuse-secret-key } }
The gateway URL and API key live in your secrets pipeline (see Manage Secrets).

Code patterns

Minimal call (litellm)

litellm speaks the OpenAI completions wire format and works against any compatible endpoint. Install with uv add litellm.
packages/api/src/api/llm.py
import os
import litellm

litellm.api_base = os.environ["LLM_GATEWAY_URL"]
litellm.api_key  = os.environ["LLM_GATEWAY_API_KEY"]
DEFAULT_MODEL = os.environ.get("LLM_DEFAULT_MODEL", "openai/gemini-3.5-flash")

async def complete(prompt: str, *, model: str | None = None, project_id: str, solution: str) -> str:
    response = await litellm.acompletion(
        model=model or DEFAULT_MODEL,
        messages=[{"role": "user", "content": prompt}],
        max_tokens=1024,   # set generously — Gemini 3.5-flash uses thinking tokens before text
        # metadata flows to Langfuse + the gateway's cost-attribution layer
        metadata={
            "project_id": project_id,
            "solution":   solution,
            "trace_id":   "auto",
        },
    )
    return response.choices[0].message.content
Use the openai/ prefix when pointing litellm at a custom proxy (the CRAFT gateway). Without it, litellm infers the provider from the model name and routes directly to Anthropic/Vertex instead of through the proxy. openai/gemini-3.5-flash tells litellm “use the OpenAI wire format, route to whatever api_base says” — the gateway then forwards to the correct upstream.Set max_tokens ≥ 200 for Gemini 3.5-flash at the global Vertex endpoint. The model uses adaptive thinking by default, consuming thinking tokens before text tokens. With max_tokens=30, all tokens go to thinking and the response is empty. The token budget needs room for both.

Per-call model override

The default model is configurable at deploy time, but any call site can override it:
# Default for most calls — fastest + highest quality on agentic / coding benchmarks
fast_answer = await complete("…", model="openai/gemini-3.5-flash",                  project_id=p, solution=s)

# Only when extensive custom tool calling is core to this code path
# (e.g. the model otherwise calls `cat`/`grep`/`sed` instead of your registered tools)
tool_heavy  = await complete("…", model="openai/gemini-3.1-pro-preview-customtools", project_id=p, solution=s)

# Secondary provider — demonstrates model agnosticity
secondary   = await complete("…", model="openai/claude-opus-4-8",                   project_id=p, solution=s)

# Local dev with no external dependency
local_dev   = await complete("…", model="ollama/llama3.1",                   project_id=p, solution=s)
The gateway routes by the model name you pass. Use the openai/ prefix for models served over the OpenAI-compatible wire format (the common case for all gateway-hosted models); provider-native prefixes such as ollama/ apply only for local sidecars not routed through the gateway. Ask the platform operator for the current model list.
Use gemini-3.1-pro-preview-customtools only when extensive custom tool calling is a core requirement of your agent. It’s a fine-tuned variant of gemini-3.1-pro that prioritizes your registered custom functions over bash fallbacks — same intelligence as base 3.1 Pro, narrowed for one job. Google’s own guidance: if >50% of requests don’t involve tool calling, stay on gemini-3.5-flash; quality on non-tool workloads is measurably lower on the customtools variant (Apiyi: When to Switch to Customtools).gemini-3.5-flash already beats gemini-3.1-pro on agentic benchmarks (MCP Atlas +5.4%) and coding (Terminal-Bench +6%) at 3.6× the speed (Gemini 3.5 launch data) — so reach for the customtools variant only when you’ve measured that bash fallbacks are overriding your registered tools (Apiyi suggests: when bash usage exceeds 30% of actions that could have been handled by registered tools).Operational notes when you do use it: preview SLA (may be deprecated or renamed without the GA notice period), lower quota than GA models (will 429 sooner under sustained load — wrap in a fallback chain via Failure modes below), global-endpoint only (vertex_location must be global), and Provisioned Throughput (PT) is not supported.

Streaming (SSE)

async def stream_completion(prompt: str, *, project_id: str, solution: str):
    response = await litellm.acompletion(
        model=DEFAULT_MODEL,
        messages=[{"role": "user", "content": prompt}],
        stream=True,
        metadata={"project_id": project_id, "solution": solution},
    )
    async for chunk in response:
        delta = chunk.choices[0].delta.content
        if delta:
            yield delta
Wrap in a FastAPI StreamingResponse to expose to your UI as Server-Sent Events.

Observability — Langfuse

When LANGFUSE_HOST, LANGFUSE_PUBLIC_KEY, and LANGFUSE_SECRET_KEY are set, litellm auto-emits traces. Each call shows up as a trace with the metadata you passed (project, solution, latency, token counts, model, prompt+completion). See Guides › Langfuse Setup and Deployment › Observability › Langfuse. Add litellm.success_callback = ["langfuse"] once at startup (the gateway’s reference implementation does this in commons/llm/__init__.py of em-talk2data). For LLM-specific observability patterns, see Deployment › Observability › LLM Observability.

Cost attribution

The metadata={"project_id": ..., "solution": ...} you pass on each call is the only thing that gives the platform per-project cost attribution. Always pass it. Without it, the call rolls up to a generic bucket and the cost dashboard cannot tell you which project caused the spike. If you wrap litellm.acompletion in a service helper (recommended), make project_id a required parameter so it’s impossible to call without it.

Failure modes

The gateway enforces per-project rate limits. Back off exponentially (tenacity with wait_random_exponential) and surface a friendly message to the user. Do not retry forever — let the user see the rate limit.
Either the gateway is down (rare) or the upstream provider rejected the request. Retry once with backoff, then fall back to a different model — cross-provider first for resilience (fallbacks=["openai/claude-opus-4-8", "openai/gemini-3.1-pro-preview"] is a litellm parameter — both route through the gateway’s model_list to Vertex AI). Don’t list gemini-3.1-pro-preview-customtools as a general fallback — it’s specialized for tool-heavy agents and may degrade quality for non-tool workloads.
LLM_GATEWAY_API_KEY rotated and your pod hasn’t restarted. If you’ve configured Stakater Reloader on the secret (default for em-service), it should already be rolling. If not, kubectl rollout restart the deployment manually.
The model you requested isn’t routed by the gateway. Check the gateway’s model list with curl $LLM_GATEWAY_URL/models -H "Authorization: Bearer $LLM_GATEWAY_API_KEY" | jq '.data[].id'.
A spike with no project attribution means a code path is calling without metadata. Grep for litellm.acompletion( and confirm every call passes metadata={"project_id": ..., "solution": ...}.
See Troubleshooting › LLM for more.

Verification

# Confirm the gateway is reachable from the pod
kubectl -n em-<solution> exec deployment/<solution>-api -- \
  sh -c 'curl -s -H "Authorization: Bearer $LLM_GATEWAY_API_KEY" $LLM_GATEWAY_URL/models | head -c 200; echo'

# Confirm a call attributes correctly in Langfuse
curl -s -H "Authorization: Bearer $TOKEN" -H "X-Project-ID: $PROJECT_ID" \
  http://localhost:8000/llm/test  # your route that issues a complete() call

# Then look in Langfuse for the trace tagged project_id=$PROJECT_ID
For gateway internals from an operator perspective (model allowlist, rate limits, provider routing), see LLM Gateway.

Next steps

Langfuse setup

Stand up Langfuse and wire it to your service.

LLM observability

The platform-side observability stack.

LLM observability deep-dive

Trace inspection, cost dashboards, model comparison.

Data Insights › Text-to-SQL

Reference implementation that uses this pattern in production.