---
title: Troubleshooting
path: operations/troubleshooting
status: published
---

Common failure modes and what to check. Grouped by symptom, not by subsystem.

## Authentication fails

### Symptom: every request returns `AUTH_TOKEN_INVALID`

**Check:** The ScaiKey URL configured in ScaiDrive (`SCAIDRIVE_SCAIKEY_URL`) must match the issuer in the JWT (`iss` claim).

Common cause: deploying across environments. A token issued by `scaikey-staging.example.com` won't validate against `scaikey.example.com` unless the `iss` claim matches.

Fix: align issuer and URL, or set `SCAIDRIVE_JWT_ISSUER` explicitly.

### Symptom: token is valid but `AUTHZ_USER_SUSPENDED`

**Check:** `SELECT status, is_active FROM users WHERE id = ?`. If `is_active = false`, the user was deactivated — via ScaiKey webhook, via admin API, or manually.

Fix: re-activate via `POST /api/v1/admin/users/{user_id}/activate` or fix the underlying ScaiKey state.

### Symptom: valid token randomly fails

**Check:** JWKS cache. ScaiDrive caches ScaiKey's JWKS for 5 minutes. If ScaiKey rotated a signing key recently, in-flight tokens signed with the new key may be rejected until the next JWKS fetch.

Fix: transient; resolves within 5 minutes. For immediate resolution, restart the API pod (flushes the cache).

## Uploads fail

### Symptom: `PAYLOAD_TOO_LARGE`

**Check:** Tenant settings and reverse proxy config.

Tenant: `GET /api/v1/admin/tenant` → `settings.max_file_size_bytes`.
Proxy: nginx's `client_max_body_size`, cloud LB's request-body limits.

A mismatch is common: tenant allows 5 GB but nginx rejects at 1 GB.

Fix: raise the proxy limit to at least the tenant limit. For files >1 GB, use resumable uploads — they don't stream the whole body through the proxy in one request.

### Symptom: chunks upload successfully, finalize fails with `CHECKSUM_MISMATCH`

**Check:** Whether the client supplied `checksum_sha256` on session creation, and whether the concatenation of all chunks actually matches.

Common cause: client computed hash over the file, but wrote one chunk with a trailing `\n` that wasn't in the original.

Fix: re-compute the full-file hash client-side from the same bytes being uploaded. If the checksums still mismatch, the bytes being read differ from the bytes being hashed — usually a file-open mode issue (text vs binary on Windows).

### Symptom: upload session not found immediately after creation

**Check:** Redis session cache and DB. Sessions are written to both.

Common cause: the client is talking to one ScaiDrive instance for the POST and a different instance for the PUT, and Redis is not shared.

Fix: ensure all API replicas share a single Redis.

## Sync doesn't catch up

### Symptom: client's cursor hasn't moved in hours

**Check:** `GET /api/v1/sync/conflicts?device_id=<id>&include_resolved=false`. An unresolved `MANUAL` conflict blocks sync on that resource, but not usually the whole stream.

More common cause: the client isn't calling `POST /api/v1/sync/cursor` after consuming changes. The server never advances cursors on its own.

Fix: fix client logic. As a server-side workaround for a stuck device, manually advance its cursor via the cursor endpoint.

### Symptom: `SYNC_CURSOR_INVALID` on every pull

**Check:** How old is the cursor? The change log has a per-tenant retention (default 90 days). Cursors older than that are rejected.

Fix: start the client from `cursor=0` — a full resync. For tenants with massive change volumes, consider shortening changelog retention or increasing client reconnection frequency.

### Symptom: WebSocket connects then disconnects immediately

**Check:** Token in the WebSocket URL. Most WebSocket libraries don't expose the response body on handshake failure — the close code is your only signal.

Close codes:

- `4401` — auth failure (invalid or expired token)
- `4403` — authorization (user suspended, no share access)
- `4429` — connection limit hit
- `1006` — network issue (not ScaiDrive)

## Search returns nothing

### Symptom: keyword search misses obvious matches

**Check:** Is the content actually indexed? `GET /api/v1/search/index-status/{file_id}`.

Cause: the file may not have text extractable (image, scanned PDF without OCR, encrypted ZIP). ScaiDrive doesn't OCR by default.

Fix: enable OCR at the vectorization-provider level, or use `search` on the filename only.

### Symptom: semantic search returns `SERVICE_UNAVAILABLE`

**Check:** `GET /api/v1/search/health`. If `weaviate_connected: false` or `embedding_service_available: false`, the subsystem is down.

Fixes:

- Weaviate unreachable: check network, Weaviate pod, `SCAIDRIVE_WEAVIATE_URL`.
- Embedding provider down: if using OpenAI/Cohere/etc., that provider is probably having issues. Swap to a secondary provider via policy.

### Symptom: search results are stale

**Check:** Vectorization queue depth: `GET /api/v1/search/queue`.

Cause: workers are behind. Uploaded files take minutes to index — normal, not a bug. If the queue is >1000 and growing, workers are underscaled.

Fix: scale worker replicas. Check `scaidrive_queue_depth{queue="high"}` metric.

## Performance

### Symptom: API latency spikes

**Check:** Database slow query log. Large `GET /children` calls on massive folders (>50k items) are the most common cause.

Fix: paginate client-side. The endpoint supports pagination; some client integrations don't use it.

### Symptom: high memory on API pods

**Check:** Active WebSocket connections (`scaidrive_sync_websocket_connections`). Each connection holds ~8 KB plus a small per-share subscription set.

Cause: too many connections on too few pods.

Fix: scale out. With 5k connections per pod, a typical 3-pod deployment comfortably handles 15k — but 50k needs 10 pods.

### Symptom: slow upload throughput

**Check:**

- Client parallelism — are chunks uploading serially?
- Network path — upload from a test client inside the same network; if that's fast, the bottleneck is upstream.
- S3 throughput — `aws s3 cp` a test file to the bucket directly and measure.
- Dedup hit rate — `scaidrive_chunks_deduplicated_total`. A low hit rate means most uploads are going to S3.

Common fix: increase parallelism on the client to 4–8 chunks.

## Connectors

### Symptom: SMB connector never completes initial sync

**Check:** Job logs: `GET /api/v1/smb-connectors/{id}/jobs`. Look at the most recent job's `error_count` and stderr.

Common causes:

- Locked files (Office temp files starting with `~$`). Add to exclude_patterns.
- Case-sensitive vs case-insensitive filesystems producing false conflicts. Check `conflict_resolution`.
- SMB timeouts on very large directories. Increase per-directory timeout via connector settings.

### Symptom: SharePoint connector auth suddenly fails

**Check:** Azure app secret expiry. Azure client secrets have a max 2-year lifetime.

Fix: rotate the secret in Azure, update via `PATCH /api/v1/sharepoint-connectors/{id}` with the new `azure_client_secret`.

## Quotas

### Symptom: user hitting `QUOTA_EXCEEDED` but dashboard shows they're under

**Check:** Which quota is failing — the error response's `details` names it. Common case: group quota or share quota is tighter than user quota.

Fix: identify the binding quota, adjust or exempt.

### Symptom: tenant `used_bytes` doesn't match reality

**Check:** Last usage recalculation: `GET /api/v1/quotas/usage/tenant`. `calculated_at` shows when it was last fully recomputed.

Fix: force a recalculation: `GET /api/v1/quotas/usage/tenant?recalculate=true`. Expensive on large tenants; expect minutes.

## Getting help

When filing a support ticket, include:

- The **request ID** (`X-Request-Id` header or `meta.request_id` in the response body).
- The tenant ID and user ID of the caller.
- A timestamp in UTC.
- The exact request (method, path, body) and response.

Support tickets with request IDs get triaged in under five minutes. Tickets without them can take days.

## What's next

- [Errors](/docs/scaidrive/core-concepts/errors)
- [Error Codes Reference](/docs/scaidrive/reference/error-codes)
- [Health and Monitoring](/docs/scaidrive/operations/health-and-monitoring)