Plattform
ScaiWave ScaiGrid ScaiCore ScaiBot ScaiDrive ScaiKey Modelle Tools & Services
Lösungen
Organisationen Entwickler Internet Service Provider Managed Service Provider AI-in-a-Box
Ressourcen
Support Documentation Blog Downloads
Unternehmen
Über uns Forschung Karriere Investieren Kontakt
Anmelden

Troubleshooting

Common failure modes and what to check. Grouped by symptom, not by subsystem.

Authentication fails#

Symptom: every request returns AUTH_TOKEN_INVALID#

Check: The ScaiKey URL configured in ScaiDrive (SCAIDRIVE_SCAIKEY_URL) must match the issuer in the JWT (iss claim).

Common cause: deploying across environments. A token issued by scaikey-staging.example.com won't validate against scaikey.example.com unless the iss claim matches.

Fix: align issuer and URL, or set SCAIDRIVE_JWT_ISSUER explicitly.

Symptom: token is valid but AUTHZ_USER_SUSPENDED#

Check: SELECT status, is_active FROM users WHERE id = ?. If is_active = false, the user was deactivated — via ScaiKey webhook, via admin API, or manually.

Fix: re-activate via POST /api/v1/admin/users/{user_id}/activate or fix the underlying ScaiKey state.

Symptom: valid token randomly fails#

Check: JWKS cache. ScaiDrive caches ScaiKey's JWKS for 5 minutes. If ScaiKey rotated a signing key recently, in-flight tokens signed with the new key may be rejected until the next JWKS fetch.

Fix: transient; resolves within 5 minutes. For immediate resolution, restart the API pod (flushes the cache).

Uploads fail#

Symptom: PAYLOAD_TOO_LARGE#

Check: Tenant settings and reverse proxy config.

Tenant: GET /api/v1/admin/tenantsettings.max_file_size_bytes. Proxy: nginx's client_max_body_size, cloud LB's request-body limits.

A mismatch is common: tenant allows 5 GB but nginx rejects at 1 GB.

Fix: raise the proxy limit to at least the tenant limit. For files >1 GB, use resumable uploads — they don't stream the whole body through the proxy in one request.

Symptom: chunks upload successfully, finalize fails with CHECKSUM_MISMATCH#

Check: Whether the client supplied checksum_sha256 on session creation, and whether the concatenation of all chunks actually matches.

Common cause: client computed hash over the file, but wrote one chunk with a trailing \n that wasn't in the original.

Fix: re-compute the full-file hash client-side from the same bytes being uploaded. If the checksums still mismatch, the bytes being read differ from the bytes being hashed — usually a file-open mode issue (text vs binary on Windows).

Symptom: upload session not found immediately after creation#

Check: Redis session cache and DB. Sessions are written to both.

Common cause: the client is talking to one ScaiDrive instance for the POST and a different instance for the PUT, and Redis is not shared.

Fix: ensure all API replicas share a single Redis.

Sync doesn't catch up#

Symptom: client's cursor hasn't moved in hours#

Check: GET /api/v1/sync/conflicts?device_id=<id>&include_resolved=false. An unresolved MANUAL conflict blocks sync on that resource, but not usually the whole stream.

More common cause: the client isn't calling POST /api/v1/sync/cursor after consuming changes. The server never advances cursors on its own.

Fix: fix client logic. As a server-side workaround for a stuck device, manually advance its cursor via the cursor endpoint.

Symptom: SYNC_CURSOR_INVALID on every pull#

Check: How old is the cursor? The change log has a per-tenant retention (default 90 days). Cursors older than that are rejected.

Fix: start the client from cursor=0 — a full resync. For tenants with massive change volumes, consider shortening changelog retention or increasing client reconnection frequency.

Symptom: WebSocket connects then disconnects immediately#

Check: Token in the WebSocket URL. Most WebSocket libraries don't expose the response body on handshake failure — the close code is your only signal.

Close codes:

  • 4401 — auth failure (invalid or expired token)
  • 4403 — authorization (user suspended, no share access)
  • 4429 — connection limit hit
  • 1006 — network issue (not ScaiDrive)

Search returns nothing#

Symptom: keyword search misses obvious matches#

Check: Is the content actually indexed? GET /api/v1/search/index-status/{file_id}.

Cause: the file may not have text extractable (image, scanned PDF without OCR, encrypted ZIP). ScaiDrive doesn't OCR by default.

Fix: enable OCR at the vectorization-provider level, or use search on the filename only.

Symptom: semantic search returns SERVICE_UNAVAILABLE#

Check: GET /api/v1/search/health. If weaviate_connected: false or embedding_service_available: false, the subsystem is down.

Fixes:

  • Weaviate unreachable: check network, Weaviate pod, SCAIDRIVE_WEAVIATE_URL.
  • Embedding provider down: if using OpenAI/Cohere/etc., that provider is probably having issues. Swap to a secondary provider via policy.

Symptom: search results are stale#

Check: Vectorization queue depth: GET /api/v1/search/queue.

Cause: workers are behind. Uploaded files take minutes to index — normal, not a bug. If the queue is >1000 and growing, workers are underscaled.

Fix: scale worker replicas. Check scaidrive_queue_depth{queue="high"} metric.

Performance#

Symptom: API latency spikes#

Check: Database slow query log. Large GET /children calls on massive folders (>50k items) are the most common cause.

Fix: paginate client-side. The endpoint supports pagination; some client integrations don't use it.

Symptom: high memory on API pods#

Check: Active WebSocket connections (scaidrive_sync_websocket_connections). Each connection holds ~8 KB plus a small per-share subscription set.

Cause: too many connections on too few pods.

Fix: scale out. With 5k connections per pod, a typical 3-pod deployment comfortably handles 15k — but 50k needs 10 pods.

Symptom: slow upload throughput#

Check:

  • Client parallelism — are chunks uploading serially?
  • Network path — upload from a test client inside the same network; if that's fast, the bottleneck is upstream.
  • S3 throughput — aws s3 cp a test file to the bucket directly and measure.
  • Dedup hit rate — scaidrive_chunks_deduplicated_total. A low hit rate means most uploads are going to S3.

Common fix: increase parallelism on the client to 4–8 chunks.

Connectors#

Symptom: SMB connector never completes initial sync#

Check: Job logs: GET /api/v1/smb-connectors/{id}/jobs. Look at the most recent job's error_count and stderr.

Common causes:

  • Locked files (Office temp files starting with ~$). Add to exclude_patterns.
  • Case-sensitive vs case-insensitive filesystems producing false conflicts. Check conflict_resolution.
  • SMB timeouts on very large directories. Increase per-directory timeout via connector settings.

Symptom: SharePoint connector auth suddenly fails#

Check: Azure app secret expiry. Azure client secrets have a max 2-year lifetime.

Fix: rotate the secret in Azure, update via PATCH /api/v1/sharepoint-connectors/{id} with the new azure_client_secret.

Quotas#

Symptom: user hitting QUOTA_EXCEEDED but dashboard shows they're under#

Check: Which quota is failing — the error response's details names it. Common case: group quota or share quota is tighter than user quota.

Fix: identify the binding quota, adjust or exempt.

Symptom: tenant used_bytes doesn't match reality#

Check: Last usage recalculation: GET /api/v1/quotas/usage/tenant. calculated_at shows when it was last fully recomputed.

Fix: force a recalculation: GET /api/v1/quotas/usage/tenant?recalculate=true. Expensive on large tenants; expect minutes.

Getting help#

When filing a support ticket, include:

  • The request ID (X-Request-Id header or meta.request_id in the response body).
  • The tenant ID and user ID of the caller.
  • A timestamp in UTC.
  • The exact request (method, path, body) and response.

Support tickets with request IDs get triaged in under five minutes. Tickets without them can take days.

What's next#

Updated 2026-05-18 15:04:23 View source (.md) rev 2