Access LLMs
Solutions never call OpenAI, Anthropic, or Gemini directly. Instead, point an OpenAI-compatible SDK at the platform’s LiteLLM gateway. You get provider-agnostic access, central API-key management, per-project cost attribution, rate-limit enforcement, and Langfuse traces — for free. The canonical reference implementation is in Data Insights › Text-to-SQL; this page extracts the patterns every solution needs.Why a gateway
| Without a gateway | With LiteLLM gateway |
|---|---|
| Each solution holds its own API keys | One key per provider, held centrally |
| Switching from OpenAI → Anthropic = code change | Switching is an env var change |
| Per-solution cost attribution requires custom telemetry | Free via gateway metadata + Langfuse |
| Rate limits enforced per-key (poorly) | Enforced centrally, per project |
| Adding observability requires per-call wrapping | Auto-instrumented via Langfuse hook |
provider/model to the chosen upstream and auto-traces every call into Langfuse. The metadata={"project_id": ..., "solution": ...} you pass on every call is what makes per-project cost attribution possible.
Configure your service
Three env vars (plus Langfuse credentials when you want tracing):charts/<solution>/values.yaml
Code patterns
Minimal call (litellm)
litellm speaks the OpenAI completions wire format and works against any compatible endpoint. Install with uv add litellm.
packages/api/src/api/llm.py
Use the
openai/ prefix when pointing litellm at a custom proxy (the CRAFT gateway). Without it, litellm infers the provider from the model name and routes directly to Anthropic/Vertex instead of through the proxy. openai/gemini-3.5-flash tells litellm “use the OpenAI wire format, route to whatever api_base says” — the gateway then forwards to the correct upstream.Set max_tokens ≥ 200 for Gemini 3.5-flash at the global Vertex endpoint. The model uses adaptive thinking by default, consuming thinking tokens before text tokens. With max_tokens=30, all tokens go to thinking and the response is empty. The token budget needs room for both.Per-call model override
The default model is configurable at deploy time, but any call site can override it:openai/ prefix for models served over the OpenAI-compatible wire format (the common case for all gateway-hosted models); provider-native prefixes such as ollama/ apply only for local sidecars not routed through the gateway. Ask the platform operator for the current model list.
Streaming (SSE)
StreamingResponse to expose to your UI as Server-Sent Events.
Observability — Langfuse
WhenLANGFUSE_HOST, LANGFUSE_PUBLIC_KEY, and LANGFUSE_SECRET_KEY are set, litellm auto-emits traces. Each call shows up as a trace with the metadata you passed (project, solution, latency, token counts, model, prompt+completion). See Guides › Langfuse Setup and Deployment › Observability › Langfuse.
Add litellm.success_callback = ["langfuse"] once at startup (the gateway’s reference implementation does this in commons/llm/__init__.py of em-talk2data).
For LLM-specific observability patterns, see Deployment › Observability › LLM Observability.
Cost attribution
Themetadata={"project_id": ..., "solution": ...} you pass on each call is the only thing that gives the platform per-project cost attribution. Always pass it. Without it, the call rolls up to a generic bucket and the cost dashboard cannot tell you which project caused the spike.
If you wrap litellm.acompletion in a service helper (recommended), make project_id a required parameter so it’s impossible to call without it.
Failure modes
429 Too Many Requests
429 Too Many Requests
The gateway enforces per-project rate limits. Back off exponentially (
tenacity with wait_random_exponential) and surface a friendly message to the user. Do not retry forever — let the user see the rate limit.503 Service Unavailable
503 Service Unavailable
401 Unauthorized
401 Unauthorized
Model unavailable
Model unavailable
Cost spike — no attribution
Cost spike — no attribution
A spike with no project attribution means a code path is calling without
metadata. Grep for litellm.acompletion( and confirm every call passes metadata={"project_id": ..., "solution": ...}.Verification
Next steps
Langfuse setup
Stand up Langfuse and wire it to your service.
LLM observability
The platform-side observability stack.
LLM observability deep-dive
Trace inspection, cost dashboards, model comparison.
Data Insights › Text-to-SQL
Reference implementation that uses this pattern in production.

