---
summary: Common symptoms and what they usually mean.
title: Troubleshooting
path: troubleshooting
status: published
---

# Troubleshooting

A short list of things that go wrong and how to fix them. If none of these match, check the request id in the response envelope and grep the ScaiGrid logs.

## Document upload returns 201 but never reaches `indexed`

Check `GET /collections/{id}/documents/{doc_id}`. The `status` field tells you where it's stuck.

- **Stuck at `pending`** — the arq worker isn't running, or its queue is backed up. Check the worker process and the `arq:queue` Redis length.
- **`processing` -> `failed` with an extractor error** — the file is probably password-protected, corrupted, or an unsupported variant. Check `error_message`.
- **Stuck at `embedding`** — the configured embedding model is returning errors or its provider is rate-limiting. Look in the inference logs for the request id.
- **Stuck at `graph_extracting`** — the chat model behind `graph_extraction_model` is failing. Disable graph extraction on the collection while you investigate (the doc itself stays indexed), then `POST /graph/re-extract` once it's fixed.

## Search returns nothing

In order of likelihood:

- **No documents indexed yet.** `GET /collections/{id}/documents` and confirm at least one is `indexed`.
- **Wrong embedding model on the query path.** The collection's `embedding_model` is used for both index and query — if you changed it after indexing, the new query embeddings live in a different geometry. Fork the collection (`POST /collections/{id}/fork` with the new model) and re-ingest.
- **ACL denied every result.** Run `GET /permissions/collection/{id}/effective` as the same user and confirm `can.read: true`. If it's false, you're getting an empty page from the chokepoint by design.
- **`min_score` too high.** Drop it to 0.0 first to see whether anything matches at all, then tighten.
- **Vector backend unavailable.** The route handler degrades to no-results when Weaviate is unreachable. Check `/health/detailed`.

## A search hit references a document the user "shouldn't see"

That should never happen — every search response is filtered by the chokepoint. If you can reproduce it:

1. Re-run the search and grab the request id.
2. Call `GET /permissions/document/{doc_id}/effective` for the same user.
3. If `can.read: true`, the ACL is permitting the read — check for an unexpected allow ACE, an inherited collection-level allow, or owner status.
4. If `can.read: false`, file a bug — that's the chokepoint failing and we want to know.

## "I'm seeing 49 results but `total` says 67"

That's the gap between unfiltered `total` and ACL-filtered `visible_count` on list endpoints. Documents you can't read are dropped silently. Use `visible_count` for the "real" count.

## Crawl finishes with 0 pages crawled

- **`robots.txt` blocked it.** ScaiMatrix respects robots; check the seed URL's robots file.
- **Seed URL is broken or redirects off-domain with `follow_external: false`.** Try the URL in a browser.
- **All pages were over `max_total_bytes`.** Tighten the per-page budget or raise the cap.
- **Single-page apps.** The crawler doesn't run JavaScript. If your docs site needs JS to render content, server-side render or expose a sitemap of pre-rendered pages.

## "Crawl already running" 409

A crawl job for the same collection is in flight. Either wait for it (`GET /collections/{id}/crawl/{job_id}`), cancel it (`DELETE` on the same path), or accept that only one crawl per collection at a time is supported.

## Webhook-triggered crawl returns 401

- **Missing `X-Crawl-Signature` or `X-Crawl-Timestamp`** — add both headers.
- **Body changed between signing and POST** — sign the exact bytes you POST. Common culprit: framework JSON serialisation differing from what you signed.
- **Clock skew** — the verifier rejects timestamps too far from the server's `now`. Sync your CI's clock.
- **Wrong secret** — if you regenerated the webhook config without saving the new `webhook_secret`, the only fix is to re-create the config and copy the secret again.

## Graph extraction is slow / expensive

Extraction calls the chat model once per document with a structured prompt. Big PDFs and verbose models add up.

- Drop the model to a smaller variant via `graph_extraction_model` on the collection. The graph quality stays acceptable for most use cases.
- Pre-filter what you ingest — extracting from boilerplate (terms of service, navigation pages) wastes tokens.
- If you're crashing into `graph_max_nodes`/`graph_max_edges`, raise them or accept that some documents stop contributing once the cap is hit.

## "GRAPH_NOT_ENABLED" on a graph endpoint

The collection has `graph_enabled: false`. Either enable it (`PUT /collections/{id}` with `graph_enabled: true`) and wait for the auto-queued re-extract, or stop calling graph endpoints on that collection.

## Re-extract / re-chunk is stuck at `running`

Both jobs are monotonic counters — `processed` should rise over time. If it hasn't moved in an hour:

- Check the arq worker is alive.
- Inspect `error` on the status response. Re-extract / re-chunk catch and record exceptions; a recurrent error against the same document blocks the job.
- Last-resort recovery: the operator can clear the status by setting `rechunk_status` / `graph_reextract_status` back to `idle` in the database, then re-running. Don't do this without checking the logs first.

## "STORAGE_QUOTA_EXCEEDED" on upload

The collection's `storage_quota_bytes` is set and the upload would push past it. Raise the quota via `PUT /collections/{id}` or remove documents you don't need. `total_size_bytes` on the collection is the running counter.

## "ACL_NOT_FOUND" when reading my own permissions

The resolver returns 404-shaped errors to avoid leaking existence. Confirm:

- the resource id is correct,
- it's in your tenant,
- you have `READ_PERMISSIONS` on it (otherwise the ACL is invisible to you even when the resource isn't).

`GET /permissions/{type}/{id}/effective` works without `READ_PERMISSIONS` and tells you what you *can* do — start there.

## SSE event streams disconnect after ~30 seconds

Some proxies drop idle connections. The server sends `:keep-alive` comment lines every 15 seconds; if your client is behind a proxy that's stripping comments, set the proxy's idle timeout above 60 seconds or reconnect from the EventSource API (browsers do this automatically).

## Forking a collection didn't copy documents

By design — fork copies metadata, config, and ACLs (with `copy_acls: true`). Documents stay in the source so it keeps serving traffic. Re-ingest into the fork separately.

## Two-document near-duplicate citations

Indexed the same source twice — easy on re-uploads. Either delete one of the duplicates or use `metadata` filters in your search to scope to the canonical version.
