Troubleshooting

Common failure modes and what to check. Grouped by symptom, not by subsystem.

Authentication fails#

Symptom: every request returns `AUTH_TOKEN_INVALID`#

Check: The ScaiKey URL configured in ScaiDrive (SCAIDRIVE_SCAIKEY_URL) must match the issuer in the JWT (iss claim).

Common cause: deploying across environments. A token issued by scaikey-staging.example.com won't validate against scaikey.example.com unless the iss claim matches.

Fix: align issuer and URL, or set SCAIDRIVE_JWT_ISSUER explicitly.

Symptom: token is valid but `AUTHZ_USER_SUSPENDED`#

Check: SELECT status, is_active FROM users WHERE id = ?. If is_active = false, the user was deactivated — via ScaiKey webhook, via admin API, or manually.

Fix: re-activate via POST /api/v1/admin/users/{user_id}/activate or fix the underlying ScaiKey state.

Symptom: valid token randomly fails#

Check: JWKS cache. ScaiDrive caches ScaiKey's JWKS for 5 minutes. If ScaiKey rotated a signing key recently, in-flight tokens signed with the new key may be rejected until the next JWKS fetch.

Fix: transient; resolves within 5 minutes. For immediate resolution, restart the API pod (flushes the cache).

Uploads fail#

Symptom: `PAYLOAD_TOO_LARGE`#

Check: Tenant settings and reverse proxy config.

Tenant: GET /api/v1/admin/tenant → settings.max_file_size_bytes. Proxy: nginx's client_max_body_size, cloud LB's request-body limits.

A mismatch is common: tenant allows 5 GB but nginx rejects at 1 GB.

Fix: raise the proxy limit to at least the tenant limit. For files >1 GB, use resumable uploads — they don't stream the whole body through the proxy in one request.

Symptom: chunks upload successfully, finalize fails with `CHECKSUM_MISMATCH`#

Check: Whether the client supplied checksum_sha256 on session creation, and whether the concatenation of all chunks actually matches.

Common cause: client computed hash over the file, but wrote one chunk with a trailing \n that wasn't in the original.

Fix: re-compute the full-file hash client-side from the same bytes being uploaded. If the checksums still mismatch, the bytes being read differ from the bytes being hashed — usually a file-open mode issue (text vs binary on Windows).

Symptom: upload session not found immediately after creation#

Check: Redis session cache and DB. Sessions are written to both.

Common cause: the client is talking to one ScaiDrive instance for the POST and a different instance for the PUT, and Redis is not shared.

Fix: ensure all API replicas share a single Redis.

Sync doesn't catch up#

Symptom: client's cursor hasn't moved in hours#

Check: GET /api/v1/sync/conflicts?device_id=<id>&include_resolved=false. An unresolved MANUAL conflict blocks sync on that resource, but not usually the whole stream.

More common cause: the client isn't calling POST /api/v1/sync/cursor after consuming changes. The server never advances cursors on its own.

Fix: fix client logic. As a server-side workaround for a stuck device, manually advance its cursor via the cursor endpoint.

Symptom: `SYNC_CURSOR_INVALID` on every pull#

Check: How old is the cursor? The change log has a per-tenant retention (default 90 days). Cursors older than that are rejected.

Fix: start the client from cursor=0 — a full resync. For tenants with massive change volumes, consider shortening changelog retention or increasing client reconnection frequency.

Symptom: WebSocket connects then disconnects immediately#

Check: Token in the WebSocket URL. Most WebSocket libraries don't expose the response body on handshake failure — the close code is your only signal.

Close codes:

4401 — auth failure (invalid or expired token)
4403 — authorization (user suspended, no share access)
4429 — connection limit hit
1006 — network issue (not ScaiDrive)

Search returns nothing#

Symptom: keyword search misses obvious matches#

Check: Is the content actually indexed? GET /api/v1/search/index-status/{file_id}.

Cause: the file may not have text extractable (image, scanned PDF without OCR, encrypted ZIP). ScaiDrive doesn't OCR by default.

Fix: enable OCR at the vectorization-provider level, or use search on the filename only.

Symptom: semantic search returns `SERVICE_UNAVAILABLE`#

Check: GET /api/v1/search/health. If weaviate_connected: false or embedding_service_available: false, the subsystem is down.

Fixes:

Weaviate unreachable: check network, Weaviate pod, SCAIDRIVE_WEAVIATE_URL.
Embedding provider down: if using OpenAI/Cohere/etc., that provider is probably having issues. Swap to a secondary provider via policy.

Symptom: search results are stale#

Check: Vectorization queue depth: GET /api/v1/search/queue.

Cause: workers are behind. Uploaded files take minutes to index — normal, not a bug. If the queue is >1000 and growing, workers are underscaled.

Fix: scale worker replicas. Check scaidrive_queue_depth{queue="high"} metric.

Performance#

Symptom: API latency spikes#

Check: Database slow query log. Large GET /children calls on massive folders (>50k items) are the most common cause.

Fix: paginate client-side. The endpoint supports pagination; some client integrations don't use it.

Symptom: high memory on API pods#

Check: Active WebSocket connections (scaidrive_sync_websocket_connections). Each connection holds ~8 KB plus a small per-share subscription set.

Cause: too many connections on too few pods.

Fix: scale out. With 5k connections per pod, a typical 3-pod deployment comfortably handles 15k — but 50k needs 10 pods.

Symptom: slow upload throughput#

Check:

Client parallelism — are chunks uploading serially?
Network path — upload from a test client inside the same network; if that's fast, the bottleneck is upstream.
S3 throughput — aws s3 cp a test file to the bucket directly and measure.
Dedup hit rate — scaidrive_chunks_deduplicated_total. A low hit rate means most uploads are going to S3.

Common fix: increase parallelism on the client to 4–8 chunks.

Connectors#

Symptom: SMB connector never completes initial sync#

Check: Job logs: GET /api/v1/smb-connectors/{id}/jobs. Look at the most recent job's error_count and stderr.

Common causes:

Locked files (Office temp files starting with ~$). Add to exclude_patterns.
Case-sensitive vs case-insensitive filesystems producing false conflicts. Check conflict_resolution.
SMB timeouts on very large directories. Increase per-directory timeout via connector settings.

Symptom: SharePoint connector auth suddenly fails#

Check: Azure app secret expiry. Azure client secrets have a max 2-year lifetime.

Fix: rotate the secret in Azure, update via PATCH /api/v1/sharepoint-connectors/{id} with the new azure_client_secret.

Quotas#

Symptom: user hitting `QUOTA_EXCEEDED` but dashboard shows they're under#

Check: Which quota is failing — the error response's details names it. Common case: group quota or share quota is tighter than user quota.

Fix: identify the binding quota, adjust or exempt.

Symptom: tenant `used_bytes` doesn't match reality#

Check: Last usage recalculation: GET /api/v1/quotas/usage/tenant. calculated_at shows when it was last fully recomputed.

Fix: force a recalculation: GET /api/v1/quotas/usage/tenant?recalculate=true. Expensive on large tenants; expect minutes.

Getting help#

When filing a support ticket, include:

The request ID (X-Request-Id header or meta.request_id in the response body).
The tenant ID and user ID of the caller.
A timestamp in UTC.
The exact request (method, path, body) and response.

Support tickets with request IDs get triaged in under five minutes. Tickets without them can take days.