Platform
ScaiWave ScaiGrid ScaiCore ScaiBot ScaiDrive ScaiKey Models Tools & Services
Solutions
Organisations Developers Internet Service Providers Managed Service Providers AI-in-a-Box
Resources
Support Documentation Blog Downloads
Company
About Research Careers Investment Opportunities Contact
Log in

Troubleshooting

Common failure modes and what to check. Grouped by symptom, not by subsystem.

Authentication fails#

Symptom: every request returns AUTH_TOKEN_INVALID#

Check: The ScaiKey URL configured in ScaiDrive (SCAIDRIVE_SCAIKEY_URL) must match the issuer in the JWT (iss claim).

Common cause: deploying across environments. A token issued by scaikey-staging.example.com won't validate against scaikey.example.com unless the iss claim matches.

Fix: align issuer and URL, or set SCAIDRIVE_JWT_ISSUER explicitly.

Symptom: token is valid but AUTHZ_USER_SUSPENDED#

Check: SELECT status, is_active FROM users WHERE id = ?. If is_active = false, the user was deactivated — via ScaiKey webhook, via admin API, or manually.

Fix: re-activate via POST /api/v1/admin/users/{user_id}/activate or fix the underlying ScaiKey state.

Symptom: valid token randomly fails#

Check: JWKS cache. ScaiDrive caches ScaiKey's JWKS for 5 minutes. If ScaiKey rotated a signing key recently, in-flight tokens signed with the new key may be rejected until the next JWKS fetch.

Fix: transient; resolves within 5 minutes. For immediate resolution, restart the API pod (flushes the cache).

Uploads fail#

Symptom: PAYLOAD_TOO_LARGE#

Check: Tenant settings and reverse proxy config.

Tenant: GET /api/v1/admin/tenantsettings.max_file_size_bytes. Proxy: nginx's client_max_body_size, cloud LB's request-body limits.

A mismatch is common: tenant allows 5 GB but nginx rejects at 1 GB.

Fix: raise the proxy limit to at least the tenant limit. For files >1 GB, use resumable uploads — they don't stream the whole body through the proxy in one request.

Symptom: chunks upload successfully, finalize fails with CHECKSUM_MISMATCH#

Check: Whether the client supplied checksum_sha256 on session creation, and whether the concatenation of all chunks actually matches.

Common cause: client computed hash over the file, but wrote one chunk with a trailing \n that wasn't in the original.

Fix: re-compute the full-file hash client-side from the same bytes being uploaded. If the checksums still mismatch, the bytes being read differ from the bytes being hashed — usually a file-open mode issue (text vs binary on Windows).

Symptom: upload session not found immediately after creation#

Check: Redis session cache and DB. Sessions are written to both.

Common cause: the client is talking to one ScaiDrive instance for the POST and a different instance for the PUT, and Redis is not shared.

Fix: ensure all API replicas share a single Redis.

Sync doesn't catch up#

Symptom: client's cursor hasn't moved in hours#

Check: GET /api/v1/sync/conflicts?device_id=<id>&include_resolved=false. An unresolved MANUAL conflict blocks sync on that resource, but not usually the whole stream.

More common cause: the client isn't calling POST /api/v1/sync/cursor after consuming changes. The server never advances cursors on its own.

Fix: fix client logic. As a server-side workaround for a stuck device, manually advance its cursor via the cursor endpoint.

Symptom: SYNC_CURSOR_INVALID on every pull#

Check: How old is the cursor? The change log has a per-tenant retention (default 90 days). Cursors older than that are rejected.

Fix: start the client from cursor=0 — a full resync. For tenants with massive change volumes, consider shortening changelog retention or increasing client reconnection frequency.

Symptom: WebSocket connects then disconnects immediately#

Check: Token in the WebSocket URL. Most WebSocket libraries don't expose the response body on handshake failure — the close code is your only signal.

Close codes:

  • 4401 — auth failure (invalid or expired token)
  • 4403 — authorization (user suspended, no share access)
  • 4429 — connection limit hit
  • 1006 — network issue (not ScaiDrive)

Search returns nothing#

Symptom: keyword search misses obvious matches#

Check: Is the content actually indexed? GET /api/v1/search/index-status/{file_id}.

Cause: the file may not have text extractable (image, scanned PDF without OCR, encrypted ZIP). ScaiDrive doesn't OCR by default.

Fix: enable OCR at the vectorization-provider level, or use search on the filename only.

Symptom: semantic search returns SERVICE_UNAVAILABLE#

Check: GET /api/v1/search/health. If weaviate_connected: false or embedding_service_available: false, the subsystem is down.

Fixes:

  • Weaviate unreachable: check network, Weaviate pod, SCAIDRIVE_WEAVIATE_URL.
  • Embedding provider down: if using OpenAI/Cohere/etc., that provider is probably having issues. Swap to a secondary provider via policy.

Symptom: search results are stale#

Check: Vectorization queue depth: GET /api/v1/search/queue.

Cause: workers are behind. Uploaded files take minutes to index — normal, not a bug. If the queue is >1000 and growing, workers are underscaled.

Fix: scale worker replicas. Check scaidrive_queue_depth{queue="high"} metric.

Performance#

Symptom: API latency spikes#

Check: Database slow query log. Large GET /children calls on massive folders (>50k items) are the most common cause.

Fix: paginate client-side. The endpoint supports pagination; some client integrations don't use it.

Symptom: high memory on API pods#

Check: Active WebSocket connections (scaidrive_sync_websocket_connections). Each connection holds ~8 KB plus a small per-share subscription set.

Cause: too many connections on too few pods.

Fix: scale out. With 5k connections per pod, a typical 3-pod deployment comfortably handles 15k — but 50k needs 10 pods.

Symptom: slow upload throughput#

Check:

  • Client parallelism — are chunks uploading serially?
  • Network path — upload from a test client inside the same network; if that's fast, the bottleneck is upstream.
  • S3 throughput — aws s3 cp a test file to the bucket directly and measure.
  • Dedup hit rate — scaidrive_chunks_deduplicated_total. A low hit rate means most uploads are going to S3.

Common fix: increase parallelism on the client to 4–8 chunks.

Connectors#

Symptom: SMB connector never completes initial sync#

Check: Job logs: GET /api/v1/smb-connectors/{id}/jobs. Look at the most recent job's error_count and stderr.

Common causes:

  • Locked files (Office temp files starting with ~$). Add to exclude_patterns.
  • Case-sensitive vs case-insensitive filesystems producing false conflicts. Check conflict_resolution.
  • SMB timeouts on very large directories. Increase per-directory timeout via connector settings.

Symptom: SharePoint connector auth suddenly fails#

Check: Azure app secret expiry. Azure client secrets have a max 2-year lifetime.

Fix: rotate the secret in Azure, update via PATCH /api/v1/sharepoint-connectors/{id} with the new azure_client_secret.

Quotas#

Symptom: user hitting QUOTA_EXCEEDED but dashboard shows they're under#

Check: Which quota is failing — the error response's details names it. Common case: group quota or share quota is tighter than user quota.

Fix: identify the binding quota, adjust or exempt.

Symptom: tenant used_bytes doesn't match reality#

Check: Last usage recalculation: GET /api/v1/quotas/usage/tenant. calculated_at shows when it was last fully recomputed.

Fix: force a recalculation: GET /api/v1/quotas/usage/tenant?recalculate=true. Expensive on large tenants; expect minutes.

Getting help#

When filing a support ticket, include:

  • The request ID (X-Request-Id header or meta.request_id in the response body).
  • The tenant ID and user ID of the caller.
  • A timestamp in UTC.
  • The exact request (method, path, body) and response.

Support tickets with request IDs get triaged in under five minutes. Tickets without them can take days.

What's next#

Updated 2026-05-18 15:04:23 View source (.md) rev 2