Troubleshooting

Common ScaiGrid issues and how to diagnose them. For production monitoring, see Health and Monitoring.

Authentication issues#

"Invalid or expired authentication token"#

json
{"error": {"code": "AUTH_TOKEN_INVALID"}}

Check:

Is the token actually an API key (sgk_) or a valid JWT? Don't send random strings.
For JWTs: has it expired? Decode (e.g., at jwt.io) to see exp.
For JWTs: refresh via /v1/auth/refresh before the expiration.
Is the Authorization header formatted as Bearer <token>? Not just <token>.

"User lacks required permission"#

json
{"error": {"code": "AUTHZ_PERMISSION_DENIED"}}

Check:

Hit GET /v1/me with the same token. What permissions are listed?
Cross-reference with the endpoint's required permission (documented on the endpoint's reference page).
If missing, ask your tenant admin to grant the role or direct permission.

"No account found for this email"#

json
{"error": {"code": "AUTH_IDENTITY_NOT_FOUND"}}

The email isn't provisioned in ScaiKey, or isn't in any tenant the caller's ScaiKey account can see. Coordinate with your identity admin.

Model / inference issues#

"Model does not exist"#

json
{"error": {"code": "MODEL_NOT_FOUND"}}

Check:

List available models: GET /v1/models. Is your slug in there?
Are you using the right slug format? (scailabs/poolnoodle-omni, not poolnoodle-omni)
If the model is tenant-scoped, are you authenticated as the right tenant?

"Model not enabled for tenant"#

json
{"error": {"code": "MODEL_ACCESS_DENIED"}}

The model exists but is blocked for your tenant via /v1/model-access. Your tenant admin can review and change this.

"All backends for model are unhealthy"#

json
{"error": {"code": "MODEL_UNAVAILABLE"}}

Every backend in the model's routing policy is circuit-broken or failing health checks. Check:

GET /v1/backends/{backend_id}/health for each backend the model routes to.
Upstream provider status pages.
Admin UI "Frontend Models" page — shows per-model health.

Usually recovers on its own when the upstream provider recovers. Operators can probe a backend's health to close its circuit breaker faster:

bash
curl https://scaigrid.scailabs.ai/v1/backends/{backend_id}/health \
  -H "Authorization: Bearer $ADMIN_TOKEN"

Streaming hangs or times out mid-response#

Symptom: SSE stream starts, chunks arrive, then connection hangs for minutes before the client sees a timeout.

Check:

Nginx proxy_read_timeout — must be greater than ScaiGrid's DISPATCH_STREAM_TIMEOUT_S. Default 660s vs 600s. If your nginx is set to 60s or 300s, bump it up.
Your client's read timeout — must also exceed the stream duration.
Check if [DONE] is arriving at all. If it does and the connection stays open, the issue is connection close; check Connection: close is set on the SSE response (ScaiGrid does this by default).

Empty content in a successful response#

Response has status: "ok" but choices[0].message.content is empty.

Check:

finish_reason — if "length", the model ran out of tokens. Bump max_tokens.
usage.completion_tokens — if > 0 and content is still empty, the model emitted reasoning/thinking tokens that aren't surfaced as content. This is a model capability quirk, not a ScaiGrid bug.
Is the model a "reasoning" model (o1, o3, DeepSeek R1)? They sometimes consume the whole token budget on internal reasoning before producing visible output. Use more tokens or switch models.

"Unexpected stream error" with no detail#

Generic BACKEND_ERROR mid-stream.

Check:

Pull the X-Scaigrid-Request-Id from the response. Contact support with that ID — we can see the full upstream response.
Check if it's consistent (same model, same prompt) vs flaky. Consistent is a ScaiGrid shape-mismatch bug; flaky is an upstream issue.

Rate limiting#

429 RATE_LIMITED on every request#

Check:

X-Scaigrid-Ratelimit-Limit and X-Scaigrid-Ratelimit-Reset headers on the 429. Which level is hitting?
Are you using a single API key for a high-throughput service? Create per-service keys to spread load.
If even one request per minute fails: check if your tenant's rate limit is configured too low. Tenant admin can raise.
See Rate Limiting.

429 BUDGET_EXCEEDED#

Different problem — you've hit a cost budget, not a request rate limit.

Check:

GET /v1/accounting/budgets to see active budgets and current usage.
Raise the budget (if authorized) or wait for the period to roll over.

Module issues#

"Module not enabled for tenant"#

json
{"error": {"code": "MODULE_NOT_ENABLED"}}

Check:

GET /v1/modules — what's the status of the module you're calling?
If available: ask a tenant admin to POST /v1/modules/{id}/enable.
If error: the module failed to initialize. GET /v1/modules/{id} shows the last error.

Module stuck in `error` state#

Module tried to initialize and failed. The error message on /v1/modules/{id} is the place to start.

Common causes:

Missing dependency. Module depends on another module that isn't enabled.
Database migration failure. New module version, old schema. Check migration logs.
Config issue. Module config references something that doesn't exist (a non-existent model, an unreachable webhook URL).

Fix the root cause and restart the ScaiGrid process — modules re-initialize on boot.

Performance issues#

Slow p99 latency, healthy p50#

Usually a tail-latency issue in one specific backend or upstream provider.

Check:

Per-backend latency in metrics: scaigrid_request_duration_seconds_bucket{backend_id=...}
Is one backend significantly slower? Reduce its weight in the routing policy or mark it for retirement.
Check upstream provider status pages — sometimes one provider's region has a bad hour.

Accounting flush lag#

scaigrid_accounting_flush_lag_seconds > 60 and climbing.

Check:

Redis-to-MariaDB flush worker is falling behind. Check worker logs.
MariaDB write capacity — flush is batch INSERT; if writes are slow, lag grows.
Usage records table size — if it's monstrous, partitioning or archiving stale partitions helps.

High Redis memory#

Check:

Rate-limit counters — should be small (~KB per tenant) unless you have millions of active keys.
Session cache — bound by active session count.
Event bus stream retention — each stream has a max length; bigger retention = more memory.
Module-specific Redis usage — some modules (ScaiQueue, ScaiBunker) use Redis heavily.

Webhook issues#

My webhook isn't receiving events#

Check:

/v1/webhooks/{webhook_id} — status: "active"? Might have been auto-disabled.
/v1/webhooks/{webhook_id}/deliveries — any failed deliveries? Check status code and error message.
Is your endpoint responding 2xx within 10 seconds?
Did you subscribe to the right event type? Event names are case-sensitive.

Webhook receives duplicates#

Expected. Webhooks are at-least-once. Check Webhooks Deep Dive for idempotency patterns.

Database / infrastructure#

"Service unavailable" (503 SERVICE_UNAVAILABLE)#

Something ScaiGrid depends on is down.

Check:

/health/ready — which dependency?
If Redis: check connectivity.
If MariaDB: Galera cluster health — majority of nodes required.
If S3: can ScaiGrid reach your storage endpoint?

Migration failures on deployment#

scaigrid-migrate container exits non-zero.

Check:

Which migration failed? Logs show the Alembic revision number.
Is the database compatible with the version you're migrating to?
Is there locked DDL or long-running transactions blocking the migration?
For module migrations: was a required module disabled and removed between versions? Re-enable and let it clean up its tables first.

Getting help#

Every ScaiGrid response has an X-Scaigrid-Request-Id header. When contacting support:

Capture the request ID from the failing request.
Note the exact time (UTC is easiest).
Include both — we can pull full logs from those two data points in seconds.

Without a request ID, debugging relies on luck and pattern matching. With one, root cause is usually minutes away.

Troubleshooting

Authentication issues#

"Invalid or expired authentication token"#

"User lacks required permission"#

"No account found for this email"#

Model / inference issues#

"Model does not exist"#

"Model not enabled for tenant"#

"All backends for model are unhealthy"#

Streaming hangs or times out mid-response#

Empty content in a successful response#

"Unexpected stream error" with no detail#

Rate limiting#

429 RATE_LIMITED on every request#

429 BUDGET_EXCEEDED#

Module issues#

"Module not enabled for tenant"#

Module stuck in error state#

Performance issues#

Slow p99 latency, healthy p50#

Accounting flush lag#

High Redis memory#

Webhook issues#

My webhook isn't receiving events#

Webhook receives duplicates#

Database / infrastructure#

"Service unavailable" (503 SERVICE_UNAVAILABLE)#

Migration failures on deployment#

Getting help#

Related#

Module stuck in `error` state#