Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.emergence.ai/llms.txt

Use this file to discover all available pages before exploring further.

Troubleshooting

Symptom-first reference. Find the closest match below and follow the diagnostic steps. If your symptom isn’t here, file an issue with label solution-dev-guide and the symptom will land in the next revision.

401 / 403 from your service

Diagnose:
curl -v -H "Authorization: Bearer $TOKEN" -H "X-Project-ID: $PROJECT_ID" \
  http://localhost:8000/echo?msg=hi 2>&1 | head -30
Common causes:
  • Token missing → request didn’t include Authorization header
  • Token expired → check exp claim with jq -R 'split(".") | .[1] | @base64d | fromjson' <<< "$TOKEN"
  • Wrong audience → your service’s KEYCLOAK_AUDIENCE env doesn’t match the aud claim
  • Wrong issuer → KEYCLOAK_ISSUER_URL doesn’t match the realm in the iss claim
  • JWKS cache stale after Keycloak key rotation → restart your pod (or implement a TTL cache)
Diagnose:
# Confirm Governance sees the user with the expected roles
curl -s -H "Authorization: Bearer $TOKEN" -H "X-Project-ID: $PROJECT_ID" \
  http://em-runtime-governance.em-runtime:8001/governance/whoami | jq .
Common causes:
  • User lacks the OpenFGA tuple for the action — see RBAC Configuration
  • X-Project-ID points to a project the user has no access to
  • You’re calling the permission check with the wrong resource_uri shape
Every authenticated route on the platform expects X-Project-ID. Add it to your test client / SDK calls. In FastAPI, use the project_id dependency from Authenticate Users.

Pod crashes or “secret not found”

Diagnose:
kubectl -n em-<solution> describe pod <pod-name> | grep -A2 "Failed"
kubectl -n em-<solution> get secret <solution>-secrets   # exists?
kubectl -n em-<solution> get externalsecret               # if ESO; check sync status
Common causes:
  • ESO/Infisical hasn’t synced yet → wait one sync interval, or trigger manually
  • Upstream secret doesn’t exist → create it in GCP Secret Manager / Infisical / Vault
  • SecretStore points to wrong project/path → check ClusterSecretStore config
  • K8s ServiceAccount missing Workload Identity annotation (GCP) → see Secrets deployment
Diagnose:
kubectl -n em-<solution> exec <pod> -- printenv DATABASE_URL
kubectl -n em-<solution> get secret <solution>-secrets -o jsonpath='{.data.database-url}' | base64 -d
Common causes:
  • Key name mismatch between secretKeyRef.key and the K8s Secret data key
  • The env: map in values.yaml overrides envVars with the same name (the em-service env model deduplicates with env winning) → rename one of them
  • Secret value is literally empty in upstream (you set it to "" somewhere)
Stakater Reloader needs an annotation to know which Secret to watch. For env-injected secrets via envFrom, em-service adds it by default. For custom volume-mounted secrets, you must add secret.reloader.stakater.com/reload: "<secret-name>" to podAnnotations — see Manage Secrets › Rotation.

Image pull errors

Diagnose:
kubectl -n em-<solution> describe pod <pod-name> | grep -A3 "Failed to pull"
Common causes:
  • image.repository or image.tag is wrong
  • Registry is private and imagePullSecrets not configured
  • Pull-secret expired (registry tokens often have short lifetimes)
  • Cluster cannot reach the registry (network policy / air-gapped environment)
  • Wrong CPU architecture (e.g., arm64 image on amd64 cluster) — check with docker manifest inspect
Some registries (GHCR, ACR) enforce IP allowlists. Check the registry’s audit log for the cluster’s egress IP and add it to the allowlist. For GHCR specifically, the personal access token used must have read:packages scope.

Helm upgrade fails

Kubernetes Deployments have an immutable .spec.selector. Once your release exists, you cannot change app.kubernetes.io/name, alias names, or labels that affect the selector. Workaround: helm uninstall and helm install (loses pod identity for ~1 cycle) — or re-create the namespace.
If you rename an em-service alias from api to webapp, every reference in values.yaml (and any --set) must update too. The platform won’t warn — it’ll just deploy with default values. Check rendered output:
helm template <release> ./charts/<solution> -f values.yaml | grep -A5 "kind: Deployment"
Helm dependency cache is per-chart. After bumping the em-service version in Chart.yaml, run helm dependency update ./charts/<solution> before helm upgrade. Stale Chart.lock causes the old version to be deployed silently.
If a previous release left orphan resources (PVCs, PDBs), helm install fails with “exists and cannot be imported into the current release.” Delete the orphans manually or use helm install --replace.

CRD not found

A CRD your chart references isn’t installed. Most commonly:
  • Gateway API (gateway.networking.k8s.io/v1) → install via kubectl apply -k 'github.com/kubernetes-sigs/gateway-api/config/crd?ref=v1.1.0'
  • ExternalSecret (external-secrets.io/v1beta1) → install ESO via helm install external-secrets external-secrets/external-secrets -n external-secrets --create-namespace
  • The platform’s own CRDs (if you depend on em-core) → install em-core first; see em-core Chart
em-service v0.0.16+ uses Gateway API v1 (not v1beta1). Some older clusters still have v1beta1 only. Either upgrade the cluster CRDs or pin em-service to a v0.0.15 line that supports v1beta1.

LLM gateway errors

Per-project rate limit hit. Back off (tenacity with wait_random_exponential), then surface a friendly message. Don’t retry forever.
Upstream provider rejected or gateway lost a route. Configure fallbacks in the call:
await litellm.acompletion(
    model="gpt-4o-mini",
    messages=[...],
    fallbacks=["claude-3-5-sonnet-20241022", "gemini/gemini-1.5-pro"],
)
LLM_GATEWAY_API_KEY rotated but pod hasn’t restarted. Check Reloader annotation; manually kubectl rollout restart if needed.
A code path called litellm.acompletion(...) without metadata={"project_id": ..., "solution": ...}. Grep for the call sites and confirm every one passes metadata.
The HTTP client’s read timeout is shorter than the model’s first-token latency. Bump it: httpx.AsyncClient(timeout=httpx.Timeout(connect=5.0, read=120.0, write=5.0, pool=5.0)).

Postgres errors

Pod-to-Postgres DNS using the wrong service name. Use the K8s service DNS:
postgresql+asyncpg://<user>:<pwd>@<solution>-postgresql.em-<solution>.svc.cluster.local:5432/<db>
Not localhost, not the pod IP.
Add a waitForPostgres init container, or use a startup probe that retries until the DB accepts connections.
Run migrations as a Helm hook (helm.sh/hook: pre-install,pre-upgrade) so only one pod ever runs them at a time.

Local dev gotchas

Use host.docker.internal (macOS/Windows) or your host LAN IP (Linux). Or run dependencies inside Kind via helm install.
Set OBSTORE_PATH_STYLE=true. MinIO requires path-style addressing.
Pass --reload-dir packages/api/src if your code lives outside the cwd. Editor “atomic write” may not trigger watchdog → switch the editor’s save mode.
docker compose ps shows unhealthy? Check docker compose logs <service>. The compose file in Local Development sets healthchecks; failures usually mean port conflicts or insufficient memory.

Still stuck?

Report via the 👎 thumbs at the bottom of the affected page — include:
  1. The exact symptom (with command + output)
  2. What you’ve already tried
  3. Your solution’s name and the cluster context