---
title: Troubleshooting
path: operations/troubleshooting
status: published
---

# Troubleshooting

Common failures and how to diagnose them. Reach for the audit log, structured logs, and `/v1/health/detailed` before anything else.

## "Every request is 401"

**Check first:** is there an `Authorization: Bearer ...` header? A missing header returns `401 authentication_required`. If present:

- **`token_expired`** — refresh via client-credentials or OAuth refresh flow.
- **`token_invalid`** — signature didn't validate. Common causes:
  - Token is from a different ScaiKey environment (prod token against staging ScaiVault).
  - ScaiKey's signing keys rotated and ScaiVault's JWKS cache is stale. Restart or wait ~10 minutes; ScaiVault auto-refreshes.
- **`token_revoked`** — ScaiKey explicitly revoked. Issue a new token.

```bash
curl -H "Authorization: Bearer $TOKEN" https://scaivault.scailabs.ai/v1/auth/whoami
```

`whoami` shows exactly who ScaiVault sees. If it's not what you expect, the token's claims don't match the tenant/identity you think they do.

## "Every request is 403 access_denied"

Auth is fine; policy says no. Use the policy test endpoint:

```bash
curl -X POST https://scaivault.scailabs.ai/v1/policies/test \
  -H "Authorization: Bearer $ADMIN_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"identity_id": "sa:your-service", "path": "app/db/password", "permission": "read"}'
```

The response tells you which policies were evaluated and why none matched. Common causes:

- The identity is bound via a group that doesn't exist yet in ScaiVault's identity cache. Trigger `POST /v1/identity/sync?tenant_id=tnt_xyz`.
- A `conditions` block blocks the request — `ip_not_allowed`, `mfa_required`, `time_window_violation`. The error's `failed_condition` names which.
- Path pattern doesn't match. Check: does the policy have `**` (not just `*`) if the secret is nested? Path matching is anchored.

## "Reads succeed but writes don't"

Separate scopes: `secrets:read` vs `secrets:write` vs `secrets:delete` vs `secrets:rotate`. The token may have the first but not the second. `whoami` lists the token's scopes.

Also: is the target path partner-scoped (`/v1/partner/secrets/*`)? Only partner admins can write there.

## "`rotation.due` event fires but my app doesn't do anything"

- **Is the webhook URL reachable from ScaiVault?** Check `GET /v1/webhooks/{id}/deliveries` — failed deliveries are logged there with HTTP status.
- **Signature verification failing on your side?** Look at `X-ScaiVault-Signature` and re-check [Webhook Signatures](../advanced/webhook-signatures). Common mistake: verifying parsed-then-reserialized body instead of raw bytes.
- **Event filters too narrow?** `GET /v1/webhooks/{id}` shows current filters. `path_prefix` is prefix-based — make sure the secret path starts with what you configured.

Test manually: `POST /v1/webhooks/{id}/test` sends a synthetic event. If it arrives and verifies, production events should too.

## "Certificates aren't renewing automatically"

ACME auto-renew failure sequence:

1. **Is `auto_renew: true` on the managed cert?** `GET /v1/pki/certificates/{id}` shows.
2. **Did the renewal actually fire?** Filter audit: `action=pki_acme_renew` in the hours before `not_after - 30d`. If nothing fires, renewal isn't scheduled — probably a restart wiped the scheduler state (it shouldn't, but check).
3. **Did the ACME provider refuse?** Audit shows `success: false` with the error. Common: rate limit (`backend_rate_limited`), challenge failure (`acme_challenge_failed`).
4. **For `dns-01` challenges:** the DNS provider credentials may have rotated. `GET /v1/pki/dns-providers/{id}` and check it can authenticate.

## "Dynamic credentials fail on first use"

AWS specifically: IAM credentials take 5–15 seconds to propagate globally. Your code might see `InvalidAccessKeyId` on immediate use. Either add a retry loop or pre-warm by sleeping ~10 seconds after generation.

For databases: the user genuinely exists immediately. If login fails:

- **Check revocation statements haven't pre-fired.** Some statements have typos like `DROP ROLE` without `IF EXISTS`, causing revocation to leak state.
- **Check connection pooling at the client.** A pool that caches DNS may route to an unreachable instance.
- **Verify the root credentials.** `GET /v1/dynamic/engines/{name}` shows `connection_status`. If `unhealthy`, the engine can't reach the target — rotate the root and test.

## "Slow requests"

`/v1/health/detailed` shows dependency latencies. Typical causes:

- **KMS latency.** First request after idle warms up; subsequent requests are fast. If sustained, check KMS quotas and instance health.
- **DB pool exhaustion.** `pool_available` close to zero in `/v1/health/detailed`. Either scale up or tune pool size (`DB_POOL_SIZE` env).
- **Redis degraded.** Audit-log writes and rate-limit checks go through Redis. If it's slow, everything slows.
- **Federated backend proxy mode.** Every read waits for the upstream. Switch the subtree to `sync` mode if freshness isn't critical.

Distributed traces are the fastest way to pinpoint this. If OTEL is configured, pull the slow request by its `request_id` and see which span dominates.

## "Too many 429s"

The identity is hitting its rate-limit bucket.

- **Which category?** Error body includes `category` in `details`.
- **Is this one caller sharing a token across many processes?** Give each process its own token (each gets its own bucket).
- **Cache hot reads.** A service that reads the same secret 1000x/min should cache in-process with a short TTL, invalidated by webhook on rotation.
- **Batch where possible.** 50 individual reads in a second count against the read limit; one `POST /v1/secrets/batch` counts once against the batch-read limit.

## "Health endpoints say 'ready' but real requests fail"

Readiness only checks reachability of DB, Redis, KMS — not end-to-end functionality. Run a known-good request:

```bash
curl -H "Authorization: Bearer $BOOTSTRAP_TOKEN" \
     https://scaivault.scailabs.ai/v1/auth/whoami
```

If this works, the API is actually healthy. Something specific to the failing request is broken — inspect its audit log and structured logs by `request_id`.

## "Secret reads return an old value"

Two possibilities:

- **Local client cache.** Most clients cache in-process for some TTL. Invalidate and re-read.
- **Federated backend in sync mode.** The backend has a newer value but the next sync hasn't run. `POST /v1/federation/backends/{id}/sync` forces it. Long-term fix: shorter `sync_interval`, or switch to `proxy` mode, or teach the remote to emit change events you can subscribe to.

Actual stale reads *from* ScaiVault's own storage shouldn't happen — writes are visible to subsequent reads immediately. If they appear to, it's a client cache.

## Database migration refuses to run

ScaiVault checks the DB schema version against what the binary expects. Possible states:

- **Binary newer than DB.** Default behavior: migrate forward. If `MIGRATE_ON_START=false`, run `scaivault migrate` manually.
- **Binary older than DB.** Rollback was intended. ScaiVault refuses to start to avoid corruption. Either upgrade the binary or (only if you've confirmed safety) downgrade the schema with `scaivault migrate --target N`.

Never run `scaivault migrate --target N` in production without verifying the target schema is compatible with application behavior.

## Bootstrap super-admin flow is broken

After `BOOTSTRAP_SUPER_ADMIN=user:admin@acme.example`, the user must sign in via ScaiKey at least once for the promotion to happen. If they never logged in, no promotion. Fix: have them sign in. If they did and it didn't take, check the audit log: was `identity_sync` succeeding before their login? If the cache didn't have their user yet, the promotion skipped.

Re-trigger:

```bash
curl -X POST https://scaivault.scailabs.ai/v1/identity/sync?tenant_id=... \
  -H "Authorization: Bearer $SOMEONES_TOKEN"
```

Then have the user sign out and back in.

## Collecting logs for support

Include:

- `request_id` from the failing response's `error.request_id` or `X-Request-ID` header.
- Time range (UTC) of the issue.
- Output of `GET /v1/health/detailed` at roughly that time.
- Any relevant audit log entries (`GET /v1/audit/logs?request_id=req_abc`).

## What's next

- [Health and Monitoring](./health-and-monitoring)
- [Errors](../core-concepts/errors) — code taxonomy
- [Error Codes Reference](../reference/error-codes)
