---
summary: "How documents become grounded answers \u2014 managed knowledge, linked ScaiMatrix\
  \ collections, indexing, retrieval, citations."
title: Knowledge and RAG
path: concepts/knowledge-and-rag
status: published
---

# Knowledge and RAG

A bot can answer from documents you upload (or from a shared ScaiMatrix collection you point it at). That's the retrieval-augmented-generation (RAG) loop, built in.

## Two knowledge modes

You choose per bot:

**Managed** — the bot owns its documents. Each upload creates chunks, embeddings, and an index slice scoped to *that bot*. Other bots can't see the documents. Good for: bot-specific content (a single product's FAQ, a single team's handbook).

**Linked** — the bot reads from a ScaiMatrix collection you've already set up. Many bots can share the same collection. Good for: corporate knowledge that powers multiple bots, content managed by a separate team using ScaiMatrix directly.

Set `knowledge_mode` on the bot:

```json
{ "knowledge_mode": "managed" }
```

```json
{ "knowledge_mode": "linked", "knowledge_collection_id": "col_shared_docs" }
```

## What "indexed" means

When you `POST /bots/{id}/documents`:

1. The file lands in object storage (ScaiDrive under the hood).
2. ScaiBot fans out a background task that:
   - Extracts text (PDF / DOCX / HTML / Markdown / plain).
   - Chunks at semantic boundaries (typically 400-600 tokens per chunk with 50-token overlap).
   - Embeds each chunk with ScaiGrid's default embedding model.
   - Writes chunks to ScaiMatrix tagged with the bot's collection id.
3. The document's `status` flips: `uploaded` → `extracting` → `indexing` → `indexed` (or `failed`).

For most documents under a few hundred pages, the whole pipeline completes in under a minute.

## What's retrieved at chat time

When the visitor sends a message:

1. The message is embedded with the same model used for chunks.
2. ScaiMatrix returns the top-K (default 5) chunks by hybrid score (BM25 keyword + cosine semantic).
3. Chunks below the relevance threshold are dropped.
4. The remaining chunks are stitched into the system prompt as labelled context.
5. The model is told to cite the chunk number it used for each statement.

Tune retrieval via the bot's `knowledge_settings`:

```json
{
  "top_k": 5,
  "score_threshold": 0.3,
  "max_chunks_per_doc": 2,
  "deduplicate": true
}
```

`max_chunks_per_doc` prevents one document from monopolising retrieval when it has many near-identical sections (typical for FAQs).

## Citations

Every assistant message comes back with `citations`:

```json
{
  "role": "assistant",
  "content": "Refunds are processed within 14 business days [^1].",
  "citations": [
    {
      "marker": "1",
      "document_id": "doc_abc",
      "document_name": "Refund Policy.pdf",
      "chunk_id": "chk_xyz",
      "snippet": "Refunds shall be remitted to the original payment instrument within fourteen (14) business days...",
      "score": 0.84
    }
  ]
}
```

The widget renders these as superscripts that expand to show the snippet.

## Updating and removing documents

```bash
# Replace a document — same name, new file
curl -X PUT "$SCAIGRID_HOST/v1/modules/scaibot/bots/$BOT_ID/documents/$DOC_ID" \
  -H "Authorization: Bearer $SCAIGRID_API_KEY" \
  -F "file=@updated-handbook.pdf"
```

```bash
# Remove a document — also drops its chunks from the index
curl -X DELETE "$SCAIGRID_HOST/v1/modules/scaibot/bots/$BOT_ID/documents/$DOC_ID" \
  -H "Authorization: Bearer $SCAIGRID_API_KEY"
```

Removals are immediate at the index level — the chunks vanish from retrieval. Object-storage cleanup happens asynchronously.

## When to switch to linked mode

Managed mode is the simplest path and works well for the first few hundred MB of content per bot.

Switch to linked mode when:

- The same content powers multiple bots (one tenant, multiple deployments — internal-Slack-bot + public-help-bot answering from the same handbook).
- A non-bot team owns the knowledge (the legal team manages a ScaiMatrix collection of contracts; bots only read it).
- The corpus is too large for per-bot management (thousands of documents, terabytes of source material).
- You need fine-grained access control on chunks (ScaiMatrix supports per-document ACLs; managed mode treats the whole bot uniformly).

## Supported document types

| Type | Notes |
|---|---|
| PDF | Most common. Tables and footnotes preserved; column flow detected. |
| DOCX | Headings preserved as semantic boundaries. |
| Markdown | H1/H2/H3 used as boundaries. |
| HTML | Stripped of nav/script/style; `<main>` preferred when present. |
| Plain text | Chunked by paragraph and sentence. |
| JSON / YAML | Treated as plain text — structured retrieval is not yet supported. |

For images of scanned PDFs (no text layer), OCR is performed automatically — quality varies with scan quality.

## Limits

- Single-document max size: 50 MB.
- Per-bot managed-knowledge cap: 5,000 documents (raise the limit through your account team for larger corpora).
- Per-chunk retrieval: maximum 50 chunks at chat time (you should aim much lower).
- Indexing timeout: 10 minutes per document. Larger documents are accepted but may need retry.
