Troubleshooting

Symptom-first reference. Find the closest match below and follow the diagnostic steps. If your symptom isn’t here, file an issue with label solution-dev-guide and the symptom will land in the next revision.

401 / 403 from your service

401 Unauthorized — Missing or invalid bearer token

Diagnose:

curl -v -H "Authorization: Bearer $TOKEN" -H "X-Project-ID: $PROJECT_ID" \
  http://localhost:8000/echo?msg=hi 2>&1 | head -30

Common causes:

Token missing → request didn’t include Authorization header
Token expired → check exp claim with jq -R 'split(".") | .[1] | @base64d | fromjson' <<< "$TOKEN"
Wrong audience → your service’s KEYCLOAK_AUDIENCE env doesn’t match the aud claim
Wrong issuer → KEYCLOAK_ISSUER_URL doesn’t match the realm in the iss claim
JWKS cache stale after Keycloak key rotation → restart your pod (or implement a TTL cache)

403 Forbidden — Permission denied

Diagnose:

# Confirm Governance sees the user with the expected roles
curl -s -H "Authorization: Bearer $TOKEN" -H "X-Project-ID: $PROJECT_ID" \
  http://em-runtime-governance.em-runtime:8000/governance/whoami | jq .

Common causes:

User lacks the OpenFGA tuple for the action — see RBAC Configuration
X-Project-ID points to a project the user has no access to
You’re calling the permission check with the wrong resource_uri shape

400 Bad Request — Missing X-Project-ID

Every authenticated route on the platform expects X-Project-ID. Add it to your test client / SDK calls. In FastAPI, use the project_id dependency from Authenticate Users.

Pod crashes or “secret not found”

CrashLoopBackOff with 'secret <name> not found'

Diagnose:

kubectl -n em-<solution> describe pod <pod-name> | grep -A2 "Failed"
kubectl -n em-<solution> get secret <solution>-secrets   # exists?
kubectl -n em-<solution> get externalsecret               # if ESO; check sync status

Common causes:

ESO/Infisical hasn’t synced yet → wait one sync interval, or trigger manually
Upstream secret doesn’t exist → create it in GCP Secret Manager / Infisical / Vault
SecretStore points to wrong project/path → check ClusterSecretStore config
K8s ServiceAccount missing Workload Identity annotation (GCP) → see Secrets deployment

Pod runs but reads empty value from env var

Diagnose:

kubectl -n em-<solution> exec <pod> -- printenv DATABASE_URL
kubectl -n em-<solution> get secret <solution>-secrets -o jsonpath='{.data.database-url}' | base64 -d

Common causes:

Key name mismatch between secretKeyRef.key and the K8s Secret data key
The env: map in values.yaml overrides envVars with the same name (the em-service env model deduplicates with env winning) → rename one of them
Secret value is literally empty in upstream (you set it to "" somewhere)

Secret rotated but pod still has old value

Stakater Reloader needs an annotation to know which Secret to watch. For env-injected secrets via envFrom, em-service adds it by default. For custom volume-mounted secrets, you must add secret.reloader.stakater.com/reload: "<secret-name>" to podAnnotations — see Manage Secrets › Rotation.

Image pull errors

ImagePullBackOff or ErrImagePull

Diagnose:

kubectl -n em-<solution> describe pod <pod-name> | grep -A3 "Failed to pull"

Common causes:

image.repository or image.tag is wrong
Registry is private and imagePullSecrets not configured
Pull-secret expired (registry tokens often have short lifetimes)
Cluster cannot reach the registry (network policy / air-gapped environment)
Wrong CPU architecture (e.g., arm64 image on amd64 cluster) — check with docker manifest inspect

403 from registry but credentials look correct

Some registries (GHCR, ACR) enforce IP allowlists. Check the registry’s audit log for the cluster’s egress IP and add it to the allowlist. For GHCR specifically, the personal access token used must have read:packages scope.

Helm upgrade fails

Immutable selector field after first install

Kubernetes Deployments have an immutable .spec.selector. Once your release exists, you cannot change app.kubernetes.io/name, alias names, or labels that affect the selector. Workaround: helm uninstall and helm install (loses pod identity for ~1 cycle) — or re-create the namespace.

Alias mismatch — values block ignored

If you rename an em-service alias from api to webapp, every reference in values.yaml (and any --set) must update too. The platform won’t warn — it’ll just deploy with default values. Check rendered output:

helm template <release> ./charts/<solution> -f values.yaml | grep -A5 "kind: Deployment"

Chart version skew

Helm dependency cache is per-chart. After bumping the em-service version in Chart.yaml, run helm dependency update ./charts/<solution> before helm upgrade. Stale Chart.lock causes the old version to be deployed silently.

Resource conflicts on first install

If a previous release left orphan resources (PVCs, PDBs), helm install fails with “exists and cannot be imported into the current release.” Delete the orphans manually or use helm install --replace.

CRD not found

no matches for kind "..." in version "..."

A CRD your chart references isn’t installed. Most commonly:

Gateway API (gateway.networking.k8s.io/v1) → install via kubectl apply -k 'github.com/kubernetes-sigs/gateway-api/config/crd?ref=v1.1.0'
ExternalSecret (external-secrets.io/v1beta1) → install ESO via helm install external-secrets external-secrets/external-secrets -n external-secrets --create-namespace
The platform’s own CRDs (if you depend on em-core) → install em-core first; see em-core Chart

em-service version requires a newer CRD than the cluster has

Recent em-service chart versions target Gateway API v1 (gateway.networking.k8s.io/v1), while some older clusters only have v1beta1. Either upgrade the cluster’s Gateway API CRDs (see above), or pin em-service to a chart version that still targets v1beta1. Check the em-service Chart reference for the Gateway API version each release uses.

LLM gateway errors

429 Too Many Requests

Per-project rate limit hit. Back off (tenacity with wait_random_exponential), then surface a friendly message. Don’t retry forever.

503 Service Unavailable / model unavailable

Upstream provider rejected or gateway lost a route. Configure fallbacks in the call:

# Fallback to a different provider first (claude-opus-4-8 via Vertex)
# rather than to a preview model. The customtools variant is specialised
# for tool-heavy agents and isn't appropriate as a general fallback.
await litellm.acompletion(
    model="gemini-3.5-flash",
    messages=[...],
    fallbacks=["claude-opus-4-8", "vertex_ai/gemini-3.1-pro-preview"],
)

401 Unauthorized

LLM_GATEWAY_API_KEY rotated but pod hasn’t restarted. Check Reloader annotation; manually kubectl rollout restart if needed.

Cost dashboard shows un-attributed spend

A code path called litellm.acompletion(...) without metadata={"project_id": ..., "solution": ...}. Grep for the call sites and confirm every one passes metadata.

Streaming hangs

The HTTP client’s read timeout is shorter than the model’s first-token latency. Bump it: httpx.AsyncClient(timeout=httpx.Timeout(connect=5.0, read=120.0, write=5.0, pool=5.0)).

Postgres errors

Connection refused / could not connect to server

Pod-to-Postgres DNS using the wrong service name. Use the K8s service DNS:

postgresql+asyncpg://<user>:<pwd>@<solution>-postgresql.em-<solution>.svc.cluster.local:5432/<db>

Not localhost, not the pod IP.

Init container ordering: app starts before DB is ready

Add a waitForPostgres init container, or use a startup probe that retries until the DB accepts connections.

Migrations conflict on parallel deploys

Run migrations as a Helm hook (helm.sh/hook: pre-install,pre-upgrade) so only one pod ever runs them at a time.

Local dev gotchas

Kind: services on the cluster cannot reach localhost

Use host.docker.internal (macOS/Windows) or your host LAN IP (Linux). Or run dependencies inside Kind via helm install.

MinIO: SignatureDoesNotMatch

Set OBSTORE_PATH_STYLE=true. MinIO requires path-style addressing.

uvicorn --reload not picking up changes

Pass --reload-dir packages/api/src if your code lives outside the cwd. Editor “atomic write” may not trigger watchdog → switch the editor’s save mode.

docker-compose containers not healthy

docker compose ps shows unhealthy? Check docker compose logs <service>. The compose file in Local Development sets healthchecks; failures usually mean port conflicts or insufficient memory.

Still stuck?

Report via the 👎 thumbs at the bottom of the affected page — include:

The exact symptom (with command + output)
What you’ve already tried
Your solution’s name and the cluster context

​Troubleshooting

​401 / 403 from your service

​Pod crashes or “secret not found”

​Image pull errors

​Helm upgrade fails

​CRD not found

​LLM gateway errors

​Postgres errors

​Local dev gotchas

​Still stuck?

Troubleshooting

401 / 403 from your service

Pod crashes or “secret not found”

Image pull errors

Helm upgrade fails

CRD not found

LLM gateway errors

Postgres errors

Local dev gotchas

Still stuck?