RAG strategies

A persona with rag_enabled: true runs retrieval before every inference call. How it retrieves is controlled by rag_strategy. The three strategies trade latency for answer quality on hard questions; the default single_step is fine for most workloads.

The shared knobs#

Every strategy shares the same retrieval primitives.

rag_top_k — how many chunks survive the merge across sources. Default 5; range 1-50.
rag_min_score — chunks below this similarity score are dropped. Optional; if unset, every retrieved chunk is kept up to rag_top_k.
source.weight — each PersonaSource has a weight, applied as a multiplier on raw similarity scores before the top-k cut. A weight of 2.0 doubles the source's effective rank; 0.5 halves it. Default 1.0.
rag_context_template — controls how chunks are stitched into the system prompt. The placeholder $context is replaced with a labelled list 1. [source name]: chunk\n\n2. [source name]: chunk\n\n.... Default template asks the model to "use it to inform your response".

Source results are merged across all active sources, sorted by weighted score, deduplicated by content prefix, and truncated to rag_top_k. Inactive sources (status: disabled) are skipped.

`single_step` (default)#

One retrieval pass against all sources, results merged, context injected, model called.

json
{
  "rag_strategy": "single_step",
  "rag_top_k": 5,
  "rag_min_score": 0.5
}

Use when:

Queries are answerable from a single semantic neighbourhood ("What's our refund policy?").
Latency matters more than recall on edge cases.
You haven't measured retrieval quality yet — start here, measure, then upgrade if the answers are wrong.

Cost: one retrieval per source, one inference call. No LLM hops in the retrieval path.

`multi_step`#

Two retrieval passes. First pass uses the user's question verbatim. The persona's own model is then asked to write a refined search query based on the initial chunks; second pass uses that refined query. The two result sets are merged, deduplicated, and capped at rag_top_k.

json
{
  "rag_strategy": "multi_step",
  "rag_top_k": 5
}

Use when:

Users ask vague questions ("How does this work?") that need disambiguation against the corpus.
The user's vocabulary diverges from the document vocabulary — multi-step lets the model bridge the gap.
You have a clear corpus topic and the first-pass results give a useful hint about what to look for next.

Cost: two retrievals per source, plus one short LLM call to refine. If the refined query equals the original, the second pass is skipped — multi-step degrades gracefully to single-step in that case.

`agentic`#

The model drives retrieval. First, it's asked whether retrieval is needed at all (conversational pleasantries skip retrieval entirely). If yes, it generates a search query, retrieves, then is asked whether the context is sufficient. If not, it generates a refined query and retrieves again. Iterates up to three times.

json
{
  "rag_strategy": "agentic",
  "rag_top_k": 5
}

Use when:

Questions span multiple topics in one turn ("Compare our refund policy with the SLA terms").
The persona handles a mix of factual lookup and conversational chit-chat — agentic skips retrieval when the question doesn't need it.
Retrieval quality matters more than latency.

Cost: up to three retrievals per source plus up to four short LLM calls (one decide, three "enough?" checks). Worst case: a long retrieval chain before the final inference call. Best case (chit-chat detected): one LLM call, no retrieval.

If the underlying chat function is unavailable, agentic falls back to single-step.

Picking source weights#

Sources are merged by weighted score. A few patterns work well.

Equal weighting. Default 1.0 across sources gives every source the same shot at the top-k. Use when sources are equally authoritative.

Authoritative + supporting. Weight 2.0 on the authoritative source (policy doc, official handbook); 1.0 on supplementary sources (forum threads, FAQs). The authoritative source dominates the top-k but supporting sources still fill in gaps.

Trial source. Weight 0.5 on a new source you're testing. It influences answers but doesn't dominate while you measure quality.

Weights only affect ranking — chunks below rag_min_score are still dropped regardless of weight. Weight is multiplicative, not additive; a weight of 0 effectively disables the source for ranking, but setting status: disabled is cleaner.

When retrieval returns nothing#

If no chunks survive (all sources empty, all below min-score, or the caller has no access), the enricher returns the request unchanged — the persona answers from the system prompt alone. This is intentional: a persona without retrievable context behaves like a plain system-prompted assistant rather than refusing to answer. If you want hard refusal, encode it in the system_prompt ("If no context is provided, say you don't know.").

Status messages#

While retrieval runs, the enricher emits status strings (searching…, found N results, refining query…). Streaming inference includes these in the response stream so chat UIs can show progress. Control them with status_messages_mode:

standard (default) — built-in English messages.
custom — provide your own via rag_status_messages (a map from message key to template).
none — emit no status messages.

Available keys: searching, search_complete (supports {count}), refining, tool_call (supports {source_name}), context_ready. Unknown keys fall back to the standard defaults.

RAG strategies

The shared knobs#

single_step (default)#

multi_step#

agentic#

Picking source weights#

When retrieval returns nothing#

Status messages#

`single_step` (default)#

`multi_step`#

`agentic`#