Troubleshooting
Common failure modes and what to check. Grouped by symptom, not by subsystem.
Authentication fails#
Symptom: every request returns AUTH_TOKEN_INVALID#
Check: The ScaiKey URL configured in ScaiDrive (SCAIDRIVE_SCAIKEY_URL) must match the issuer in the JWT (iss claim).
Common cause: deploying across environments. A token issued by scaikey-staging.example.com won't validate against scaikey.example.com unless the iss claim matches.
Fix: align issuer and URL, or set SCAIDRIVE_JWT_ISSUER explicitly.
Symptom: token is valid but AUTHZ_USER_SUSPENDED#
Check: SELECT status, is_active FROM users WHERE id = ?. If is_active = false, the user was deactivated — via ScaiKey webhook, via admin API, or manually.
Fix: re-activate via POST /api/v1/admin/users/{user_id}/activate or fix the underlying ScaiKey state.
Symptom: valid token randomly fails#
Check: JWKS cache. ScaiDrive caches ScaiKey's JWKS for 5 minutes. If ScaiKey rotated a signing key recently, in-flight tokens signed with the new key may be rejected until the next JWKS fetch.
Fix: transient; resolves within 5 minutes. For immediate resolution, restart the API pod (flushes the cache).
Uploads fail#
Symptom: PAYLOAD_TOO_LARGE#
Check: Tenant settings and reverse proxy config.
Tenant: GET /api/v1/admin/tenant → settings.max_file_size_bytes.
Proxy: nginx's client_max_body_size, cloud LB's request-body limits.
A mismatch is common: tenant allows 5 GB but nginx rejects at 1 GB.
Fix: raise the proxy limit to at least the tenant limit. For files >1 GB, use resumable uploads — they don't stream the whole body through the proxy in one request.
Symptom: chunks upload successfully, finalize fails with CHECKSUM_MISMATCH#
Check: Whether the client supplied checksum_sha256 on session creation, and whether the concatenation of all chunks actually matches.
Common cause: client computed hash over the file, but wrote one chunk with a trailing \n that wasn't in the original.
Fix: re-compute the full-file hash client-side from the same bytes being uploaded. If the checksums still mismatch, the bytes being read differ from the bytes being hashed — usually a file-open mode issue (text vs binary on Windows).
Symptom: upload session not found immediately after creation#
Check: Redis session cache and DB. Sessions are written to both.
Common cause: the client is talking to one ScaiDrive instance for the POST and a different instance for the PUT, and Redis is not shared.
Fix: ensure all API replicas share a single Redis.
Sync doesn't catch up#
Symptom: client's cursor hasn't moved in hours#
Check: GET /api/v1/sync/conflicts?device_id=<id>&include_resolved=false. An unresolved MANUAL conflict blocks sync on that resource, but not usually the whole stream.
More common cause: the client isn't calling POST /api/v1/sync/cursor after consuming changes. The server never advances cursors on its own.
Fix: fix client logic. As a server-side workaround for a stuck device, manually advance its cursor via the cursor endpoint.
Symptom: SYNC_CURSOR_INVALID on every pull#
Check: How old is the cursor? The change log has a per-tenant retention (default 90 days). Cursors older than that are rejected.
Fix: start the client from cursor=0 — a full resync. For tenants with massive change volumes, consider shortening changelog retention or increasing client reconnection frequency.
Symptom: WebSocket connects then disconnects immediately#
Check: Token in the WebSocket URL. Most WebSocket libraries don't expose the response body on handshake failure — the close code is your only signal.
Close codes:
4401— auth failure (invalid or expired token)4403— authorization (user suspended, no share access)4429— connection limit hit1006— network issue (not ScaiDrive)
Search returns nothing#
Symptom: keyword search misses obvious matches#
Check: Is the content actually indexed? GET /api/v1/search/index-status/{file_id}.
Cause: the file may not have text extractable (image, scanned PDF without OCR, encrypted ZIP). ScaiDrive doesn't OCR by default.
Fix: enable OCR at the vectorization-provider level, or use search on the filename only.
Symptom: semantic search returns SERVICE_UNAVAILABLE#
Check: GET /api/v1/search/health. If weaviate_connected: false or embedding_service_available: false, the subsystem is down.
Fixes:
- Weaviate unreachable: check network, Weaviate pod,
SCAIDRIVE_WEAVIATE_URL. - Embedding provider down: if using OpenAI/Cohere/etc., that provider is probably having issues. Swap to a secondary provider via policy.
Symptom: search results are stale#
Check: Vectorization queue depth: GET /api/v1/search/queue.
Cause: workers are behind. Uploaded files take minutes to index — normal, not a bug. If the queue is >1000 and growing, workers are underscaled.
Fix: scale worker replicas. Check scaidrive_queue_depth{queue="high"} metric.
Performance#
Symptom: API latency spikes#
Check: Database slow query log. Large GET /children calls on massive folders (>50k items) are the most common cause.
Fix: paginate client-side. The endpoint supports pagination; some client integrations don't use it.
Symptom: high memory on API pods#
Check: Active WebSocket connections (scaidrive_sync_websocket_connections). Each connection holds ~8 KB plus a small per-share subscription set.
Cause: too many connections on too few pods.
Fix: scale out. With 5k connections per pod, a typical 3-pod deployment comfortably handles 15k — but 50k needs 10 pods.
Symptom: slow upload throughput#
Check:
- Client parallelism — are chunks uploading serially?
- Network path — upload from a test client inside the same network; if that's fast, the bottleneck is upstream.
- S3 throughput —
aws s3 cpa test file to the bucket directly and measure. - Dedup hit rate —
scaidrive_chunks_deduplicated_total. A low hit rate means most uploads are going to S3.
Common fix: increase parallelism on the client to 4–8 chunks.
Connectors#
Symptom: SMB connector never completes initial sync#
Check: Job logs: GET /api/v1/smb-connectors/{id}/jobs. Look at the most recent job's error_count and stderr.
Common causes:
- Locked files (Office temp files starting with
~$). Add to exclude_patterns. - Case-sensitive vs case-insensitive filesystems producing false conflicts. Check
conflict_resolution. - SMB timeouts on very large directories. Increase per-directory timeout via connector settings.
Symptom: SharePoint connector auth suddenly fails#
Check: Azure app secret expiry. Azure client secrets have a max 2-year lifetime.
Fix: rotate the secret in Azure, update via PATCH /api/v1/sharepoint-connectors/{id} with the new azure_client_secret.
Quotas#
Symptom: user hitting QUOTA_EXCEEDED but dashboard shows they're under#
Check: Which quota is failing — the error response's details names it. Common case: group quota or share quota is tighter than user quota.
Fix: identify the binding quota, adjust or exempt.
Symptom: tenant used_bytes doesn't match reality#
Check: Last usage recalculation: GET /api/v1/quotas/usage/tenant. calculated_at shows when it was last fully recomputed.
Fix: force a recalculation: GET /api/v1/quotas/usage/tenant?recalculate=true. Expensive on large tenants; expect minutes.
Getting help#
When filing a support ticket, include:
- The request ID (
X-Request-Idheader ormeta.request_idin the response body). - The tenant ID and user ID of the caller.
- A timestamp in UTC.
- The exact request (method, path, body) and response.
Support tickets with request IDs get triaged in under five minutes. Tickets without them can take days.