---
title: Troubleshooting
path: operations/troubleshooting
status: published
---

# Troubleshooting

Common ScaiGrid issues and how to diagnose them. For production monitoring, see [Health and Monitoring](./02-health-and-monitoring.md).

## Authentication issues

### "Invalid or expired authentication token"

```json
{"error": {"code": "AUTH_TOKEN_INVALID"}}
```

**Check:**
- Is the token actually an API key (`sgk_`) or a valid JWT? Don't send random strings.
- For JWTs: has it expired? Decode (e.g., at jwt.io) to see `exp`.
- For JWTs: refresh via `/v1/auth/refresh` before the expiration.
- Is the `Authorization` header formatted as `Bearer <token>`? Not just `<token>`.

### "User lacks required permission"

```json
{"error": {"code": "AUTHZ_PERMISSION_DENIED"}}
```

**Check:**
- Hit `GET /v1/me` with the same token. What permissions are listed?
- Cross-reference with the endpoint's required permission (documented on the endpoint's reference page).
- If missing, ask your tenant admin to grant the role or direct permission.

### "No account found for this email"

```json
{"error": {"code": "AUTH_IDENTITY_NOT_FOUND"}}
```

The email isn't provisioned in ScaiKey, or isn't in any tenant the caller's ScaiKey account can see. Coordinate with your identity admin.

## Model / inference issues

### "Model does not exist"

```json
{"error": {"code": "MODEL_NOT_FOUND"}}
```

**Check:**
- List available models: `GET /v1/models`. Is your slug in there?
- Are you using the right slug format? (`scailabs/poolnoodle-omni`, not `poolnoodle-omni`)
- If the model is tenant-scoped, are you authenticated as the right tenant?

### "Model not enabled for tenant"

```json
{"error": {"code": "MODEL_ACCESS_DENIED"}}
```

The model exists but is blocked for your tenant via `/v1/model-access`. Your tenant admin can review and change this.

### "All backends for model are unhealthy"

```json
{"error": {"code": "MODEL_UNAVAILABLE"}}
```

Every backend in the model's routing policy is circuit-broken or failing health checks. Check:

- `GET /v1/backends/{backend_id}/health` for each backend the model routes to.
- Upstream provider status pages.
- Admin UI "Frontend Models" page — shows per-model health.

Usually recovers on its own when the upstream provider recovers. Operators can probe a backend's health to close its circuit breaker faster:

```bash
curl https://scaigrid.scailabs.ai/v1/backends/{backend_id}/health \
  -H "Authorization: Bearer $ADMIN_TOKEN"
```

### Streaming hangs or times out mid-response

Symptom: SSE stream starts, chunks arrive, then connection hangs for minutes before the client sees a timeout.

**Check:**
- Nginx `proxy_read_timeout` — must be greater than ScaiGrid's `DISPATCH_STREAM_TIMEOUT_S`. Default 660s vs 600s. If your nginx is set to 60s or 300s, bump it up.
- Your client's read timeout — must also exceed the stream duration.
- Check if `[DONE]` is arriving at all. If it does and the connection stays open, the issue is connection close; check `Connection: close` is set on the SSE response (ScaiGrid does this by default).

### Empty content in a successful response

Response has `status: "ok"` but `choices[0].message.content` is empty.

**Check:**
- `finish_reason` — if `"length"`, the model ran out of tokens. Bump `max_tokens`.
- `usage.completion_tokens` — if > 0 and content is still empty, the model emitted reasoning/thinking tokens that aren't surfaced as content. This is a model capability quirk, not a ScaiGrid bug.
- Is the model a "reasoning" model (o1, o3, DeepSeek R1)? They sometimes consume the whole token budget on internal reasoning before producing visible output. Use more tokens or switch models.

### "Unexpected stream error" with no detail

Generic `BACKEND_ERROR` mid-stream.

**Check:**
- Pull the `X-Scaigrid-Request-Id` from the response. Contact support with that ID — we can see the full upstream response.
- Check if it's consistent (same model, same prompt) vs flaky. Consistent is a ScaiGrid shape-mismatch bug; flaky is an upstream issue.

## Rate limiting

### 429 RATE_LIMITED on every request

**Check:**
- `X-Scaigrid-Ratelimit-Limit` and `X-Scaigrid-Ratelimit-Reset` headers on the 429. Which level is hitting?
- Are you using a single API key for a high-throughput service? Create per-service keys to spread load.
- If even one request per minute fails: check if your tenant's rate limit is configured too low. Tenant admin can raise.
- See [Rate Limiting](../07-advanced/05-rate-limiting.md).

### 429 BUDGET_EXCEEDED

Different problem — you've hit a cost budget, not a request rate limit.

**Check:**
- `GET /v1/accounting/budgets` to see active budgets and current usage.
- Raise the budget (if authorized) or wait for the period to roll over.

## Module issues

### "Module not enabled for tenant"

```json
{"error": {"code": "MODULE_NOT_ENABLED"}}
```

**Check:**
- `GET /v1/modules` — what's the status of the module you're calling?
- If `available`: ask a tenant admin to `POST /v1/modules/{id}/enable`.
- If `error`: the module failed to initialize. `GET /v1/modules/{id}` shows the last error.

### Module stuck in `error` state

Module tried to initialize and failed. The error message on `/v1/modules/{id}` is the place to start.

Common causes:

- **Missing dependency.** Module depends on another module that isn't enabled.
- **Database migration failure.** New module version, old schema. Check migration logs.
- **Config issue.** Module config references something that doesn't exist (a non-existent model, an unreachable webhook URL).

Fix the root cause and restart the ScaiGrid process — modules re-initialize on boot.

## Performance issues

### Slow p99 latency, healthy p50

Usually a tail-latency issue in one specific backend or upstream provider.

**Check:**
- Per-backend latency in metrics: `scaigrid_request_duration_seconds_bucket{backend_id=...}`
- Is one backend significantly slower? Reduce its weight in the routing policy or mark it for retirement.
- Check upstream provider status pages — sometimes one provider's region has a bad hour.

### Accounting flush lag

`scaigrid_accounting_flush_lag_seconds > 60` and climbing.

**Check:**
- Redis-to-MariaDB flush worker is falling behind. Check worker logs.
- MariaDB write capacity — flush is batch INSERT; if writes are slow, lag grows.
- Usage records table size — if it's monstrous, partitioning or archiving stale partitions helps.

### High Redis memory

**Check:**
- Rate-limit counters — should be small (~KB per tenant) unless you have millions of active keys.
- Session cache — bound by active session count.
- Event bus stream retention — each stream has a max length; bigger retention = more memory.
- Module-specific Redis usage — some modules (ScaiQueue, ScaiBunker) use Redis heavily.

## Webhook issues

### My webhook isn't receiving events

**Check:**
- `/v1/webhooks/{webhook_id}` — `status: "active"`? Might have been auto-disabled.
- `/v1/webhooks/{webhook_id}/deliveries` — any failed deliveries? Check status code and error message.
- Is your endpoint responding 2xx within 10 seconds?
- Did you subscribe to the right event type? Event names are case-sensitive.

### Webhook receives duplicates

**Expected.** Webhooks are at-least-once. Check [Webhooks Deep Dive](../07-advanced/03-webhooks-deep-dive.md) for idempotency patterns.

## Database / infrastructure

### "Service unavailable" (503 SERVICE_UNAVAILABLE)

Something ScaiGrid depends on is down.

**Check:**
- `/health/ready` — which dependency?
- If Redis: check connectivity.
- If MariaDB: Galera cluster health — majority of nodes required.
- If S3: can ScaiGrid reach your storage endpoint?

### Migration failures on deployment

`scaigrid-migrate` container exits non-zero.

**Check:**
- Which migration failed? Logs show the Alembic revision number.
- Is the database compatible with the version you're migrating to?
- Is there locked DDL or long-running transactions blocking the migration?
- For module migrations: was a required module disabled and removed between versions? Re-enable and let it clean up its tables first.

## Getting help

Every ScaiGrid response has an `X-Scaigrid-Request-Id` header. When contacting support:

1. Capture the request ID from the failing request.
2. Note the exact time (UTC is easiest).
3. Include both — we can pull full logs from those two data points in seconds.

Without a request ID, debugging relies on luck and pattern matching. With one, root cause is usually minutes away.

## Related

- [Health and Monitoring](./02-health-and-monitoring.md)
- [Errors](../03-core-concepts/07-errors.md)
- [Error Codes Reference](../06-reference/11-error-codes.md)
