Troubleshooting

A short list of things that go wrong and how to fix them. If none of these match, check the request id in the response envelope and grep the ScaiGrid logs.

Document upload returns 201 but never reaches `indexed`#

Check GET /collections/{id}/documents/{doc_id}. The status field tells you where it's stuck.

Stuck at pending — the arq worker isn't running, or its queue is backed up. Check the worker process and the arq:queue Redis length.
processing -> failed with an extractor error — the file is probably password-protected, corrupted, or an unsupported variant. Check error_message.
Stuck at embedding — the configured embedding model is returning errors or its provider is rate-limiting. Look in the inference logs for the request id.
Stuck at graph_extracting — the chat model behind graph_extraction_model is failing. Disable graph extraction on the collection while you investigate (the doc itself stays indexed), then POST /graph/re-extract once it's fixed.

Search returns nothing#

In order of likelihood:

No documents indexed yet. GET /collections/{id}/documents and confirm at least one is indexed.
Wrong embedding model on the query path. The collection's embedding_model is used for both index and query — if you changed it after indexing, the new query embeddings live in a different geometry. Fork the collection (POST /collections/{id}/fork with the new model) and re-ingest.
ACL denied every result. Run GET /permissions/collection/{id}/effective as the same user and confirm can.read: true. If it's false, you're getting an empty page from the chokepoint by design.
min_score too high. Drop it to 0.0 first to see whether anything matches at all, then tighten.
Vector backend unavailable. The route handler degrades to no-results when Weaviate is unreachable. Check /health/detailed.

A search hit references a document the user "shouldn't see"#

That should never happen — every search response is filtered by the chokepoint. If you can reproduce it:

Re-run the search and grab the request id.
Call GET /permissions/document/{doc_id}/effective for the same user.
If can.read: true, the ACL is permitting the read — check for an unexpected allow ACE, an inherited collection-level allow, or owner status.
If can.read: false, file a bug — that's the chokepoint failing and we want to know.

"I'm seeing 49 results but `total` says 67"#

That's the gap between unfiltered total and ACL-filtered visible_count on list endpoints. Documents you can't read are dropped silently. Use visible_count for the "real" count.

Crawl finishes with 0 pages crawled#

robots.txt blocked it. ScaiMatrix respects robots; check the seed URL's robots file.
Seed URL is broken or redirects off-domain with follow_external: false. Try the URL in a browser.
All pages were over max_total_bytes. Tighten the per-page budget or raise the cap.
Single-page apps. The crawler doesn't run JavaScript. If your docs site needs JS to render content, server-side render or expose a sitemap of pre-rendered pages.

"Crawl already running" 409#

A crawl job for the same collection is in flight. Either wait for it (GET /collections/{id}/crawl/{job_id}), cancel it (DELETE on the same path), or accept that only one crawl per collection at a time is supported.

Webhook-triggered crawl returns 401#

Missing X-Crawl-Signature or X-Crawl-Timestamp — add both headers.
Body changed between signing and POST — sign the exact bytes you POST. Common culprit: framework JSON serialisation differing from what you signed.
Clock skew — the verifier rejects timestamps too far from the server's now. Sync your CI's clock.
Wrong secret — if you regenerated the webhook config without saving the new webhook_secret, the only fix is to re-create the config and copy the secret again.

Graph extraction is slow / expensive#

Extraction calls the chat model once per document with a structured prompt. Big PDFs and verbose models add up.

Drop the model to a smaller variant via graph_extraction_model on the collection. The graph quality stays acceptable for most use cases.
Pre-filter what you ingest — extracting from boilerplate (terms of service, navigation pages) wastes tokens.
If you're crashing into graph_max_nodes/graph_max_edges, raise them or accept that some documents stop contributing once the cap is hit.

"GRAPH_NOT_ENABLED" on a graph endpoint#

The collection has graph_enabled: false. Either enable it (PUT /collections/{id} with graph_enabled: true) and wait for the auto-queued re-extract, or stop calling graph endpoints on that collection.

Re-extract / re-chunk is stuck at `running`#

Both jobs are monotonic counters — processed should rise over time. If it hasn't moved in an hour:

Check the arq worker is alive.
Inspect error on the status response. Re-extract / re-chunk catch and record exceptions; a recurrent error against the same document blocks the job.
Last-resort recovery: the operator can clear the status by setting rechunk_status / graph_reextract_status back to idle in the database, then re-running. Don't do this without checking the logs first.

"STORAGE_QUOTA_EXCEEDED" on upload#

The collection's storage_quota_bytes is set and the upload would push past it. Raise the quota via PUT /collections/{id} or remove documents you don't need. total_size_bytes on the collection is the running counter.

"ACL_NOT_FOUND" when reading my own permissions#

The resolver returns 404-shaped errors to avoid leaking existence. Confirm:

the resource id is correct,
it's in your tenant,
you have READ_PERMISSIONS on it (otherwise the ACL is invisible to you even when the resource isn't).

GET /permissions/{type}/{id}/effective works without READ_PERMISSIONS and tells you what you can do — start there.

SSE event streams disconnect after ~30 seconds#

Some proxies drop idle connections. The server sends :keep-alive comment lines every 15 seconds; if your client is behind a proxy that's stripping comments, set the proxy's idle timeout above 60 seconds or reconnect from the EventSource API (browsers do this automatically).

Forking a collection didn't copy documents#

By design — fork copies metadata, config, and ACLs (with copy_acls: true). Documents stay in the source so it keeps serving traffic. Re-ingest into the fork separately.

Two-document near-duplicate citations#

Indexed the same source twice — easy on re-uploads. Either delete one of the duplicates or use metadata filters in your search to scope to the canonical version.