Access LLMs

Solutions never call OpenAI, Anthropic, or Gemini directly. Instead, point an OpenAI-compatible SDK at the platform’s LiteLLM gateway. You get provider-agnostic access, central API-key management, per-project cost attribution, rate-limit enforcement, and Langfuse traces — for free. The canonical reference implementation is in Data Insights › Text-to-SQL; this page extracts the patterns every solution needs.

Why a gateway

Without a gateway	With LiteLLM gateway
Each solution holds its own API keys	One key per provider, held centrally
Switching from OpenAI → Anthropic = code change	Switching is an env var change
Per-solution cost attribution requires custom telemetry	Free via gateway metadata + Langfuse
Rate limits enforced per-key (poorly)	Enforced centrally, per project
Adding observability requires per-call wrapping	Auto-instrumented via Langfuse hook

Read the diagram: your solution calls one OpenAI-compatible endpoint; the gateway routes by provider/model to the chosen upstream and auto-traces every call into Langfuse. The metadata={"project_id": ..., "solution": ...} you pass on every call is what makes per-project cost attribution possible.

Configure your service

Three env vars (plus Langfuse credentials when you want tracing):

charts/<solution>/values.yaml

api:
  env:
    LLM_GATEWAY_URL: "https://litellm.example.com/v1"   # platform operator provides this
    LLM_DEFAULT_MODEL: "gpt-4o-mini"
  envVars:
    - name: LLM_GATEWAY_API_KEY
      valueFrom: { secretKeyRef: { name: <solution>-secrets, key: llm-gateway-api-key } }

    # Observability (Langfuse) — see /guides/langfuse-setup
    - name: LANGFUSE_HOST
      valueFrom: { configMapKeyRef: { name: <solution>-langfuse, key: host } }
    - name: LANGFUSE_PUBLIC_KEY
      valueFrom: { secretKeyRef: { name: <solution>-secrets, key: langfuse-public-key } }
    - name: LANGFUSE_SECRET_KEY
      valueFrom: { secretKeyRef: { name: <solution>-secrets, key: langfuse-secret-key } }

The gateway URL and API key live in your secrets pipeline (see Manage Secrets).

Code patterns

Minimal call (litellm)

litellm speaks the OpenAI completions wire format and works against any compatible endpoint. Install with uv add litellm.

packages/api/src/api/llm.py

import os
import litellm

litellm.api_base = os.environ["LLM_GATEWAY_URL"]
litellm.api_key  = os.environ["LLM_GATEWAY_API_KEY"]
DEFAULT_MODEL = os.environ.get("LLM_DEFAULT_MODEL", "gpt-4o-mini")

async def complete(prompt: str, *, model: str | None = None, project_id: str, solution: str) -> str:
    response = await litellm.acompletion(
        model=model or DEFAULT_MODEL,
        messages=[{"role": "user", "content": prompt}],
        # metadata flows to Langfuse + the gateway's cost-attribution layer
        metadata={
            "project_id": project_id,
            "solution":   solution,
            "trace_id":   "auto",
        },
    )
    return response.choices[0].message.content

Per-call model override

The default model is configurable at deploy time, but any call site can override it:

fast_answer = await complete("…", model="gpt-4o-mini",   project_id=p, solution=s)
careful_one = await complete("…", model="claude-3-5-sonnet-20241022", project_id=p, solution=s)
local_dev   = await complete("…", model="ollama/llama3.1", project_id=p, solution=s)

The gateway routes by the provider/model prefix. Any model the gateway is configured to serve is callable; ask the platform operator for the current list.

Streaming (SSE)

async def stream_completion(prompt: str, *, project_id: str, solution: str):
    response = await litellm.acompletion(
        model=DEFAULT_MODEL,
        messages=[{"role": "user", "content": prompt}],
        stream=True,
        metadata={"project_id": project_id, "solution": solution},
    )
    async for chunk in response:
        delta = chunk.choices[0].delta.content
        if delta:
            yield delta

Wrap in a FastAPI StreamingResponse to expose to your UI as Server-Sent Events.

Observability — Langfuse

When LANGFUSE_HOST, LANGFUSE_PUBLIC_KEY, and LANGFUSE_SECRET_KEY are set, litellm auto-emits traces. Each call shows up as a trace with the metadata you passed (project, solution, latency, token counts, model, prompt+completion). See Guides › Langfuse Setup and Deployment › Observability › Langfuse. Add litellm.success_callback = ["langfuse"] once at startup (the gateway’s reference implementation does this in commons/llm/__init__.py of em-talk2data). For LLM-specific observability patterns, see Deployment › Observability › LLM Observability.

Cost attribution

The metadata={"project_id": ..., "solution": ...} you pass on each call is the only thing that gives the platform per-project cost attribution. Always pass it. Without it, the call rolls up to a generic bucket and the cost dashboard cannot tell you which project caused the spike. If you wrap litellm.acompletion in a service helper (recommended), make project_id a required parameter so it’s impossible to call without it.

Failure modes

429 Too Many Requests

The gateway enforces per-project rate limits. Back off exponentially (tenacity with wait_random_exponential) and surface a friendly message to the user. Do not retry forever — let the user see the rate limit.

503 Service Unavailable

Either the gateway is down (rare) or the upstream provider rejected the request. Retry once with backoff, then fall back to a different model (fallbacks=["claude-3-5-sonnet-20241022", "gemini/gemini-1.5-pro"] is a litellm parameter).

401 Unauthorized

LLM_GATEWAY_API_KEY rotated and your pod hasn’t restarted. If you’ve configured Stakater Reloader on the secret (default for em-service), it should already be rolling. If not, kubectl rollout restart the deployment manually.

Model unavailable

The model you requested isn’t routed by the gateway. Check the gateway’s model list with curl $LLM_GATEWAY_URL/models -H "Authorization: Bearer $LLM_GATEWAY_API_KEY" | jq '.data[].id'.

Cost spike — no attribution

A spike with no project attribution means a code path is calling without metadata. Grep for litellm.acompletion( and confirm every call passes metadata={"project_id": ..., "solution": ...}.

See Troubleshooting › LLM for more.

Verification

# Confirm the gateway is reachable from the pod
kubectl -n em-<solution> exec deployment/<solution>-api -- \
  sh -c 'curl -s -H "Authorization: Bearer $LLM_GATEWAY_API_KEY" $LLM_GATEWAY_URL/models | head -c 200; echo'

# Confirm a call attributes correctly in Langfuse
curl -s -H "Authorization: Bearer $TOKEN" -H "X-Project-ID: $PROJECT_ID" \
  http://localhost:8000/llm/test  # your route that issues a complete() call

# Then look in Langfuse for the trace tagged project_id=$PROJECT_ID

For gateway internals from an operator perspective (model allowlist, rate limits, provider routing), see LLM Gateway.

Next steps

Langfuse setup

Stand up Langfuse and wire it to your service.

LLM observability

The platform-side observability stack.

LLM observability deep-dive

Trace inspection, cost dashboards, model comparison.

Data Insights › Text-to-SQL

Reference implementation that uses this pattern in production.

Solution Developer Guide

How-To Guides

Access LLMs

Access LLMs

Why a gateway

Configure your service

Code patterns

Minimal call (litellm)

Per-call model override

Streaming (SSE)

Observability — Langfuse

Cost attribution

Failure modes

Verification

Next steps

Langfuse setup

LLM observability

LLM observability deep-dive

Data Insights › Text-to-SQL

Solution Developer Guide

How-To Guides

Documentation Index

​Access LLMs

​Why a gateway

​Configure your service

​Code patterns

​Minimal call (litellm)

​Per-call model override

​Streaming (SSE)

​Observability — Langfuse

​Cost attribution

​Failure modes

​Verification

​Next steps

Langfuse setup

LLM observability

LLM observability deep-dive

Data Insights › Text-to-SQL

Access LLMs

Why a gateway

Configure your service

Code patterns

Minimal call (litellm)

Per-call model override

Streaming (SSE)

Observability — Langfuse

Cost attribution

Failure modes

Verification

Next steps