---
summary: How collections, the ingestion pipeline, the vector and graph stores, and
  the ACL chokepoint fit together.
title: Architecture
path: concepts/architecture
status: published
---

# Architecture

ScaiMatrix is a ScaiGrid module — it runs in the same FastAPI process, behind the same auth, against the same MariaDB. The only external services it adds are a vector store (Weaviate) and a graph store (Neo4j), both optional in the sense that absence degrades gracefully rather than 5xx'ing.

## Components

```mermaid
flowchart LR
    App["App<br/>Caller"]
    subgraph SG ["ScaiGrid"]
        Routes["/v1/modules/scaimatrix<br/>/collections/..."]
        Core["Routes<br/>Services<br/>AclResolver<br/>SearchService<br/>GraphService"]
        Maria["MariaDB<br/>collections,<br/>documents, ACLs"]
        Worker["arq worker pool<br/>ingest, crawl,<br/>reextract, rechunk"]
    end
    SD["ScaiDrive<br/>S3"]
    Wv["Weaviate<br/>vectors"]
    Neo["Neo4j<br/>optional"]
    Inf["Inference<br/>embedding model"]
    App -- "upload doc" --> Routes
    App -- search --> Routes
    Routes --> Core
    Core --> Maria
    Routes <-- blob --> SD
    Core <-- vectors --> Wv
    Core <-- graph --> Neo
    Worker <-- embed --> Inf
    Core -. "ACL-gated response" .-> App
```

There's no separate ScaiMatrix deployment. Routes mount under the module registry; the arq worker pool runs ingestion as background tasks.

## Request flow: search

1. **HTTP** -> `POST /collections/{id}/search` -> auth + module permission check.
2. **Collection load** -> `CollectionService.get_for_tenant` (tenant scoped).
3. **Access check** -> `CollectionAccessService.require_access(user, collection, "read")`. Fails closed.
4. **Embed query** -> `InferenceService.embed` with the collection's `embedding_model`. The call is metered to the caller.
5. **Vector store query** -> Weaviate `near_vector` against the collection's class, scoped by tenant.
6. **ACL chokepoint** -> `filter_results_by_acl(session, user, candidates)`. Every result with `document_id` is evaluated by `AclResolver.can(user, ref_for_document(doc), Permission.READ)`. Denied rows are dropped before serialization.
7. **Response** assembled with `success(...)` — no chunk leaks into counts or metadata if its document was denied.

## Request flow: ingestion

1. **HTTP** -> `POST /collections/{id}/documents` (multipart). Tenant + collection write check.
2. **Blob write** -> file goes to S3 via the document-store client; a row is inserted into `mod_scaimatrix_documents` with `status: pending`.
3. **Enqueue** -> `ingest_document` job pushed to arq.
4. **Worker** picks up the job:
   - `processing` — extract text per content type (PDF, DOCX, HTML, Markdown, plain text, source code).
   - `chunking` — split per collection's `chunking_strategy` (`fixed`, `paragraph`, `semantic`, `markdown`, `code`) at `chunk_size` with `chunk_overlap`.
   - `embedding` — call the embedding model in batches; write vectors to Weaviate.
   - `graph_extracting` — if `graph_enabled`, prompt `graph_extraction_model` to emit nodes + edges, dedupe against existing graph, write to Neo4j.
   - `indexed` on success; `failed` with `error_message` otherwise.
5. **Counters** on the collection (`document_count`, `chunk_count`, `total_size_bytes`, `node_count`, `edge_count`) are maintained as the worker progresses.

## Request flow: crawl

1. **Trigger** — ad-hoc `POST /collections/{id}/crawl`, manual `POST /crawls/{id}/run`, webhook `POST /crawls/{id}/trigger` (HMAC-verified), or scheduled by the worker.
2. **Job row** in `mod_scaimatrix_crawl_jobs` with `status: pending`, limits (`max_depth`, `max_pages`, `max_total_bytes`, `follow_external`).
3. **Worker** fetches the seed, respects `robots.txt`, walks links breadth-first within limits, and posts each fetched page back through the document ingestion path.
4. **Live progress** via `GET /collections/{id}/crawl/{job_id}/stream` (SSE), driven by polling the job row every two seconds.
5. **Terminal** statuses are `completed`, `failed`, `cancelled`. The job row stays around for history.

## State

- **Collections, documents, ACLs, ACEs, crawl configs, crawl jobs, graph views** — MariaDB.
- **Chunks + embeddings** — Weaviate, one class per collection-slug, tenant tag in every object.
- **Graph nodes + edges** — Neo4j, labelled with tenant and collection ids.
- **Document blobs** — S3 via the document-store client.
- **Re-chunk / re-extract state** — denormalised onto the collection row (`rechunk_status`, `graph_reextract_status` plus counters).

## The ACL chokepoint

The v2 correctness invariant is "search and retrieval never return data the calling user lacks `READ` on." That's enforced by a single function — `filter_results_by_acl` — that every search, list, and graph result passes through before serialization. Any new surface that returns documents or chunks must route results through that chokepoint or the property tests fail.

`AclResolver.can(user, ref, Permission.X)` is the underlying primitive. It walks: explicit deny on the resource -> explicit allow -> inherited deny -> inherited allow, with super-admin / tenant-admin / owner bypasses. Group expansion is transitive (mirrored from ScaiKey via the `mod_scaimatrix_scaikey_nested_groups` table + every-10-min reconcile cron).

## Trust boundary

The HTTP layer is the only boundary that matters. Inside the process:

- The vector store query is **not** ACL-aware — it returns whatever matches, and the chokepoint filters.
- The graph store query is the same — Neo4j returns whatever Cypher asks for, and `filter_graph_results_by_acl` gates it.
- Re-running the resolver in two places (the chokepoint + per-document fetch in `GET /documents/{id}`) is intentional defense in depth; the cost is negligible against a hit-only set.

That layering exists because index-side filtering would push tenant + group + ACE state into Weaviate and Neo4j, which is operationally expensive and easy to skew. One chokepoint, exhaustively tested, beats two.

## Tenant isolation

Every ScaiMatrix row carries a `tenant_id`. Every Weaviate object is written with a tenant tag; every Cypher query against Neo4j filters by `tenant_id` in the `MATCH` clause. Cross-tenant reads are impossible at the storage layer — not just gated, structurally absent — because the queries that the route handlers issue never widen scope past the caller's tenant. Super-admin operations are the one exception, and they take an explicit tenant id parameter.

The same is true of the audit log: every ScaiMatrix-emitted entry tags `tenant_id` so a tenant admin querying `/v1/audit/events?module=scaimatrix` sees only their tenant's history.

## Background workers

Four arq jobs back the slow paths:

- **`ingest_document`** — extract / chunk / embed / (optional) graph-extract for one document.
- **`crawl_website`** — crawl a seed URL under depth + page + byte budgets, posting each fetched page through the ingestion path.
- **`rechunk_collection`** — drop + recreate every document's chunks under the collection's current chunking parameters.
- **`reextract_collection_graph`** — wipe and re-run graph extraction over every indexed document.

Each job updates counters on the collection row as it progresses so dashboards stay live without polling the workers themselves. `rechunk_status` / `graph_reextract_status` go through `idle -> queued -> running -> completed | failed`, with `total`, `processed`, `failed` numbers maintained throughout.

## Graceful degradation

ScaiMatrix is designed to keep the rest of the API alive when an external dependency is down:

- **Weaviate down** — search endpoints return zero results and log a warning; ingestion stays queued until Weaviate is back.
- **Neo4j down** — graph endpoints return zero-shaped responses with `graph_available: false` instead of 5xx; the rest of the module is unaffected.
- **Embedding model unavailable** — ingestion documents stop progressing past `embedding` and surface the upstream error on `error_message`; the route layer still serves reads.

Health is reflected at `/health/detailed` so operators can see which backends are degraded before users report symptoms.
