Troubleshooting
A short list of things that go wrong and how to fix them. If none of these match, check the request id in the response envelope and grep the ScaiGrid logs.
Document upload returns 201 but never reaches indexed#
Check GET /collections/{id}/documents/{doc_id}. The status field tells you where it's stuck.
- Stuck at
pending— the arq worker isn't running, or its queue is backed up. Check the worker process and thearq:queueRedis length. processing->failedwith an extractor error — the file is probably password-protected, corrupted, or an unsupported variant. Checkerror_message.- Stuck at
embedding— the configured embedding model is returning errors or its provider is rate-limiting. Look in the inference logs for the request id. - Stuck at
graph_extracting— the chat model behindgraph_extraction_modelis failing. Disable graph extraction on the collection while you investigate (the doc itself stays indexed), thenPOST /graph/re-extractonce it's fixed.
Search returns nothing#
In order of likelihood:
- No documents indexed yet.
GET /collections/{id}/documentsand confirm at least one isindexed. - Wrong embedding model on the query path. The collection's
embedding_modelis used for both index and query — if you changed it after indexing, the new query embeddings live in a different geometry. Fork the collection (POST /collections/{id}/forkwith the new model) and re-ingest. - ACL denied every result. Run
GET /permissions/collection/{id}/effectiveas the same user and confirmcan.read: true. If it's false, you're getting an empty page from the chokepoint by design. min_scoretoo high. Drop it to 0.0 first to see whether anything matches at all, then tighten.- Vector backend unavailable. The route handler degrades to no-results when Weaviate is unreachable. Check
/health/detailed.
A search hit references a document the user "shouldn't see"#
That should never happen — every search response is filtered by the chokepoint. If you can reproduce it:
- Re-run the search and grab the request id.
- Call
GET /permissions/document/{doc_id}/effectivefor the same user. - If
can.read: true, the ACL is permitting the read — check for an unexpected allow ACE, an inherited collection-level allow, or owner status. - If
can.read: false, file a bug — that's the chokepoint failing and we want to know.
"I'm seeing 49 results but total says 67"#
That's the gap between unfiltered total and ACL-filtered visible_count on list endpoints. Documents you can't read are dropped silently. Use visible_count for the "real" count.
Crawl finishes with 0 pages crawled#
robots.txtblocked it. ScaiMatrix respects robots; check the seed URL's robots file.- Seed URL is broken or redirects off-domain with
follow_external: false. Try the URL in a browser. - All pages were over
max_total_bytes. Tighten the per-page budget or raise the cap. - Single-page apps. The crawler doesn't run JavaScript. If your docs site needs JS to render content, server-side render or expose a sitemap of pre-rendered pages.
"Crawl already running" 409#
A crawl job for the same collection is in flight. Either wait for it (GET /collections/{id}/crawl/{job_id}), cancel it (DELETE on the same path), or accept that only one crawl per collection at a time is supported.
Webhook-triggered crawl returns 401#
- Missing
X-Crawl-SignatureorX-Crawl-Timestamp— add both headers. - Body changed between signing and POST — sign the exact bytes you POST. Common culprit: framework JSON serialisation differing from what you signed.
- Clock skew — the verifier rejects timestamps too far from the server's
now. Sync your CI's clock. - Wrong secret — if you regenerated the webhook config without saving the new
webhook_secret, the only fix is to re-create the config and copy the secret again.
Graph extraction is slow / expensive#
Extraction calls the chat model once per document with a structured prompt. Big PDFs and verbose models add up.
- Drop the model to a smaller variant via
graph_extraction_modelon the collection. The graph quality stays acceptable for most use cases. - Pre-filter what you ingest — extracting from boilerplate (terms of service, navigation pages) wastes tokens.
- If you're crashing into
graph_max_nodes/graph_max_edges, raise them or accept that some documents stop contributing once the cap is hit.
"GRAPH_NOT_ENABLED" on a graph endpoint#
The collection has graph_enabled: false. Either enable it (PUT /collections/{id} with graph_enabled: true) and wait for the auto-queued re-extract, or stop calling graph endpoints on that collection.
Re-extract / re-chunk is stuck at running#
Both jobs are monotonic counters — processed should rise over time. If it hasn't moved in an hour:
- Check the arq worker is alive.
- Inspect
erroron the status response. Re-extract / re-chunk catch and record exceptions; a recurrent error against the same document blocks the job. - Last-resort recovery: the operator can clear the status by setting
rechunk_status/graph_reextract_statusback toidlein the database, then re-running. Don't do this without checking the logs first.
"STORAGE_QUOTA_EXCEEDED" on upload#
The collection's storage_quota_bytes is set and the upload would push past it. Raise the quota via PUT /collections/{id} or remove documents you don't need. total_size_bytes on the collection is the running counter.
"ACL_NOT_FOUND" when reading my own permissions#
The resolver returns 404-shaped errors to avoid leaking existence. Confirm:
- the resource id is correct,
- it's in your tenant,
- you have
READ_PERMISSIONSon it (otherwise the ACL is invisible to you even when the resource isn't).
GET /permissions/{type}/{id}/effective works without READ_PERMISSIONS and tells you what you can do — start there.
SSE event streams disconnect after ~30 seconds#
Some proxies drop idle connections. The server sends :keep-alive comment lines every 15 seconds; if your client is behind a proxy that's stripping comments, set the proxy's idle timeout above 60 seconds or reconnect from the EventSource API (browsers do this automatically).
Forking a collection didn't copy documents#
By design — fork copies metadata, config, and ACLs (with copy_acls: true). Documents stay in the source so it keeps serving traffic. Re-ingest into the fork separately.
Two-document near-duplicate citations#
Indexed the same source twice — easy on re-uploads. Either delete one of the duplicates or use metadata filters in your search to scope to the canonical version.