---
summary: How the persona, the request enricher, ScaiMatrix, ScaiDrive, and the inference
  pipeline fit together.
title: Architecture
path: concepts/architecture
status: published
---

ScaiPersona is a thin product layer over ScaiGrid's existing primitives — the model catalogue, the inference pipeline, ScaiMatrix retrieval, and ScaiDrive shares. There is no separate "persona engine"; a persona is a configuration plus a request enricher that runs inside the inference pipeline.

## Components

```mermaid
flowchart LR
    C[Caller]
    subgraph SG[ScaiGrid]
        R[Routing<br/>FrontendModel<br/>persona's published model]
        E[Persona enricher<br/>- reads persona<br/>- runs RAG<br/>- injects context]
        D[Inference dispatch<br/>to backend]
        R --> E
        E --> D
    end
    SM[ScaiMatrix]
    SD[ScaiDrive]
    C -- /v1/inference/chat --> R
    E --> SM
    E --> SD
    D -- stream / json --> C
```

There is no separate ScaiPersona deployment. ScaiPersona is a ScaiGrid module — it runs in the same FastAPI process, behind the same auth, accounted against the same budgets. The CRUD routes live at `/v1/modules/scaipersona/`; the actual hot path runs inside the inference enricher.

## Request flow for one persona invocation

1. **Caller** calls `POST /v1/inference/chat` with `model = "tenant/{tenant_slug}/{persona_slug}"`.
2. **Routing** resolves the slug to the persona's published `FrontendModel`. The frontend model's `metadata.persona_id` ties it back to the persona row.
3. **Enricher** (`persona_enricher`, registered on `app.state.persona_enricher` at module init) is invoked by `InferenceService`. It reads the persona, and if `rag_enabled` is true, runs the configured RAG strategy.
4. **RAG retrieval.** For each active `PersonaSource`:
   - `collection` sources call ScaiMatrix's vector search after checking the caller's collection access.
   - `scaidrive` sources call ScaiDrive's `/api/v1/search/context` after exchanging the caller's bearer token for a ScaiDrive-scoped one.
   Results across sources are weighted by `source.weight`, deduplicated, and capped at `rag_top_k`.
5. **Context injection.** Top chunks are formatted with the persona's `rag_context_template` (default: a labelled list) and appended to the system message.
6. **Inference.** The enriched request is dispatched to the underlying frontend model's backend. The persona's `system_prompt` was already baked into the published frontend model at publish time.
7. **Accounting.** Tokens, latency, and dispatch metadata are recorded against the persona's frontend model — the persona ends up in usage reports as its own model.

## State

- **Personas and sources** — in MariaDB tables `mod_scaipersona_personas` and `mod_scaipersona_persona_sources`.
- **Published frontend models** — in the standard `frontend_models` table, with a `metadata.persona_id` pointer back. Backend links are copied from the underlying model at publish time.
- **Avatars** — in S3 under `scaipersona/{tenant_id}/{persona_id}/avatar`. Served back through `GET /personas/{id}/avatar` (public, no auth).
- **No persona-level conversation log.** Persona invocations land in the standard inference history; filter by model slug to find them.

## Publishing semantics

Publishing materialises a persona as a `FrontendModel` row. The new row:

- Has slug `tenant/{tenant_slug}/{persona_slug}` — globally unique and tenant-namespaced.
- Inherits `modality`, `capabilities`, `context_window`, and pricing from the underlying model.
- Uses the persona's `system_prompt` as its `system_prompt_template`.
- Carries the persona's `avatar_url` and `default_params`.
- Has its backend links copied from the underlying frontend model so dispatch resolves identically.

Editing a persona while it's published auto-syncs the frontend model on the next `PUT /personas/{id}` or `POST /personas/{id}/avatar`. There is no separate draft / live split — every update is live on the next inference call.

Unpublishing deletes the frontend model and removes it from any model groups it had joined. The persona itself stays put; you can re-publish later.

## How it differs from calling a vendor model directly

A direct ScaiGrid chat-completion call gives you tokens-out. ScaiPersona adds:

| Concern | Direct call | ScaiPersona |
|---|---|---|
| System prompt | Caller provides per-request | Baked into the published model |
| RAG retrieval | Caller orchestrates | Built-in; ScaiMatrix + ScaiDrive |
| Multi-source weighting | Caller merges results | Built-in via `source.weight` |
| Multi-step retrieval | Caller implements | `rag_strategy: multi_step` / `agentic` |
| Catalogue listing | Vendor model only | Persona appears as its own frontend model |
| Routing / model groups | Vendor model only | Persona slugs are first-class members |

For a one-off completion, call the inference endpoint with a vendor model. For a named, RAG-grounded assistant you want callers to target by slug, publish a persona.

## Where the trust boundary is

The persona itself is tenant-scoped — `PersonaService.get_for_tenant` rejects any cross-tenant read. RAG access is enforced **at retrieval time**, not at attach time: every collection search calls `collection_access_fn` with the invoking user and drops chunks the caller can't see. ScaiDrive shares are gated by the caller's bearer token, exchanged into a ScaiDrive-scoped token per call — so a persona never "borrows" access the caller doesn't already have.
