Troubleshooting

Common failures and how to diagnose them. Reach for the audit log, structured logs, and /v1/health/detailed before anything else.

"Every request is 401"#

Check first: is there an Authorization: Bearer ... header? A missing header returns 401 authentication_required. If present:

token_expired — refresh via client-credentials or OAuth refresh flow.
token_invalid — signature didn't validate. Common causes:
- Token is from a different ScaiKey environment (prod token against staging ScaiVault).
- ScaiKey's signing keys rotated and ScaiVault's JWKS cache is stale. Restart or wait ~10 minutes; ScaiVault auto-refreshes.
token_revoked — ScaiKey explicitly revoked. Issue a new token.

bash
curl -H "Authorization: Bearer $TOKEN" https://scaivault.scailabs.ai/v1/auth/whoami

whoami shows exactly who ScaiVault sees. If it's not what you expect, the token's claims don't match the tenant/identity you think they do.

"Every request is 403 access_denied"#

Auth is fine; policy says no. Use the policy test endpoint:

bash
curl -X POST https://scaivault.scailabs.ai/v1/policies/test \
  -H "Authorization: Bearer $ADMIN_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"identity_id": "sa:your-service", "path": "app/db/password", "permission": "read"}'

The response tells you which policies were evaluated and why none matched. Common causes:

The identity is bound via a group that doesn't exist yet in ScaiVault's identity cache. Trigger POST /v1/identity/sync?tenant_id=tnt_xyz.
A conditions block blocks the request — ip_not_allowed, mfa_required, time_window_violation. The error's failed_condition names which.
Path pattern doesn't match. Check: does the policy have ** (not just *) if the secret is nested? Path matching is anchored.

"Reads succeed but writes don't"#

Separate scopes: secrets:read vs secrets:write vs secrets:delete vs secrets:rotate. The token may have the first but not the second. whoami lists the token's scopes.

Also: is the target path partner-scoped (/v1/partner/secrets/*)? Only partner admins can write there.

"`rotation.due` event fires but my app doesn't do anything"#

Is the webhook URL reachable from ScaiVault? Check GET /v1/webhooks/{id}/deliveries — failed deliveries are logged there with HTTP status.
Signature verification failing on your side? Look at X-ScaiVault-Signature and re-check Webhook Signatures. Common mistake: verifying parsed-then-reserialized body instead of raw bytes.
Event filters too narrow? GET /v1/webhooks/{id} shows current filters. path_prefix is prefix-based — make sure the secret path starts with what you configured.

Test manually: POST /v1/webhooks/{id}/test sends a synthetic event. If it arrives and verifies, production events should too.

"Certificates aren't renewing automatically"#

ACME auto-renew failure sequence:

Is auto_renew: true on the managed cert? GET /v1/pki/certificates/{id} shows.
Did the renewal actually fire? Filter audit: action=pki_acme_renew in the hours before not_after - 30d. If nothing fires, renewal isn't scheduled — probably a restart wiped the scheduler state (it shouldn't, but check).
Did the ACME provider refuse? Audit shows success: false with the error. Common: rate limit (backend_rate_limited), challenge failure (acme_challenge_failed).
For dns-01 challenges: the DNS provider credentials may have rotated. GET /v1/pki/dns-providers/{id} and check it can authenticate.

"Dynamic credentials fail on first use"#

AWS specifically: IAM credentials take 5–15 seconds to propagate globally. Your code might see InvalidAccessKeyId on immediate use. Either add a retry loop or pre-warm by sleeping ~10 seconds after generation.

For databases: the user genuinely exists immediately. If login fails:

Check revocation statements haven't pre-fired. Some statements have typos like DROP ROLE without IF EXISTS, causing revocation to leak state.
Check connection pooling at the client. A pool that caches DNS may route to an unreachable instance.
Verify the root credentials. GET /v1/dynamic/engines/{name} shows connection_status. If unhealthy, the engine can't reach the target — rotate the root and test.

"Slow requests"#

/v1/health/detailed shows dependency latencies. Typical causes:

KMS latency. First request after idle warms up; subsequent requests are fast. If sustained, check KMS quotas and instance health.
DB pool exhaustion. pool_available close to zero in /v1/health/detailed. Either scale up or tune pool size (DB_POOL_SIZE env).
Redis degraded. Audit-log writes and rate-limit checks go through Redis. If it's slow, everything slows.
Federated backend proxy mode. Every read waits for the upstream. Switch the subtree to sync mode if freshness isn't critical.

Distributed traces are the fastest way to pinpoint this. If OTEL is configured, pull the slow request by its request_id and see which span dominates.

"Too many 429s"#

The identity is hitting its rate-limit bucket.

Which category? Error body includes category in details.
Is this one caller sharing a token across many processes? Give each process its own token (each gets its own bucket).
Cache hot reads. A service that reads the same secret 1000x/min should cache in-process with a short TTL, invalidated by webhook on rotation.
Batch where possible. 50 individual reads in a second count against the read limit; one POST /v1/secrets/batch counts once against the batch-read limit.

"Health endpoints say 'ready' but real requests fail"#

Readiness only checks reachability of DB, Redis, KMS — not end-to-end functionality. Run a known-good request:

bash
curl -H "Authorization: Bearer $BOOTSTRAP_TOKEN" \
     https://scaivault.scailabs.ai/v1/auth/whoami

If this works, the API is actually healthy. Something specific to the failing request is broken — inspect its audit log and structured logs by request_id.

"Secret reads return an old value"#

Two possibilities:

Local client cache. Most clients cache in-process for some TTL. Invalidate and re-read.
Federated backend in sync mode. The backend has a newer value but the next sync hasn't run. POST /v1/federation/backends/{id}/sync forces it. Long-term fix: shorter sync_interval, or switch to proxy mode, or teach the remote to emit change events you can subscribe to.

Actual stale reads from ScaiVault's own storage shouldn't happen — writes are visible to subsequent reads immediately. If they appear to, it's a client cache.

Database migration refuses to run#

ScaiVault checks the DB schema version against what the binary expects. Possible states:

Binary newer than DB. Default behavior: migrate forward. If MIGRATE_ON_START=false, run scaivault migrate manually.
Binary older than DB. Rollback was intended. ScaiVault refuses to start to avoid corruption. Either upgrade the binary or (only if you've confirmed safety) downgrade the schema with scaivault migrate --target N.

Never run scaivault migrate --target N in production without verifying the target schema is compatible with application behavior.

Bootstrap super-admin flow is broken#

After BOOTSTRAP_SUPER_ADMIN=user:admin@acme.example, the user must sign in via ScaiKey at least once for the promotion to happen. If they never logged in, no promotion. Fix: have them sign in. If they did and it didn't take, check the audit log: was identity_sync succeeding before their login? If the cache didn't have their user yet, the promotion skipped.

Re-trigger:

bash
curl -X POST https://scaivault.scailabs.ai/v1/identity/sync?tenant_id=... \
  -H "Authorization: Bearer $SOMEONES_TOKEN"

Then have the user sign out and back in.

Collecting logs for support#

Include:

request_id from the failing response's error.request_id or X-Request-ID header.
Time range (UTC) of the issue.
Output of GET /v1/health/detailed at roughly that time.
Any relevant audit log entries (GET /v1/audit/logs?request_id=req_abc).