API Reference
ScaiVoice exposes a thin REST surface for session lifecycle plus one WebSocket that drives the entire conversation.
Choosing your integration shape (Phase 3a)#
ScaiVoice is intentionally agnostic about how you manage conversation memory and tool execution. Two integration patterns, picked at session-create via history_mode:
| Mode | What ScaiVoice owns | What the consumer owns |
|---|---|---|
history_mode: 'server' (default) |
conversation history (in-memory, lost on reconnect); LLM model selection | bot personality (system prompt sent once at session-open; baked into the running history) |
history_mode: 'client' |
speech I/O glue only — STT, TTS, state machine, cancellation | full conversation history sent on every text frame; tool definitions; tool execution; RAG context; persona |
The 'client' mode is the "bring your own LLM ecosystem" path. Your existing agent code in ScaiBot / ScaiWave / wherever continues to own the cognition; ScaiVoice wraps mic/STT on the way in and TTS/audio on the way out. Every text frame carries the full messages array (i.e. the conversation the agent is currently driving) plus the tools definition list. When the LLM emits tool calls, ScaiVoice forwards them as agent_tool_call events; you execute the tools in your environment and reply with tool_result frames; ScaiVoice continues the agent loop.
Phase 3b adds a second toggle, cognition_mode. With cognition_mode: 'delegated' and a cognition_callback_url set at session-create, ScaiVoice doesn't talk to any LLM — it POSTs each user utterance to your URL with {session_id, tenant_id, user_id, user_input, turn_index, request_id} and pipes the streaming response body straight into TTS. Optional cognition_callback_auth_token is forwarded as Authorization: Bearer … so your endpoint can verify the call. Tools, history, RAG, persona, error-handling — all on your side.
Most ScaiBot / ScaiWave integrations will pair history_mode: 'client' with cognition_mode: 'delegated' and own the entire agent layer in their own infrastructure. Simpler demos can stick with the all-server defaults.
Sessions REST#
All endpoints under /v1/modules/scaivoice/sessions. Permission: scaivoice:use.
POST /sessions#
Create a voice session. Body:
| Field | Required | Notes |
|---|---|---|
voice_id |
yes | A voice the caller can see in the ScaiSpeak library. Validated up-front — cross-scope returns 404. |
llm_model |
yes | An LLM model slug the caller can use. |
language_hint |
no | 2-letter ISO code. Used by STT and by the TTS text-normaliser. |
wake_word_enabled |
no | When true, the client is responsible for emitting {"type":"wake"} frames. Default false. |
vad_enabled |
no | When true, the client emits {"type":"vad"} frames; Phase 1 uses them for auto barge-in. Default false. |
speaker_recognition |
no | When true, the server attaches speaker_id to transcript frames. Phase 2; tenant opt-in required. Default false. |
diarize |
no | When true, anonymous speaker labels (speaker_0/1/...) flow through STT segments. Default false. |
tools_enabled |
no | When true, the LLM gets tool definitions (Phase 3). Default false. |
normalize_text |
no | Toggle for the ScaiSpeak text-prep pipeline. true / false / omit for tenant default. |
instructions |
no | Free-text style / emotion / delivery guidance prepended to every TTS call in this session. Example: "cheerful and energetic". Overrides the voice's default_instructions when set. |
speed |
no | Speaking speed, 0.5--2.0. Overrides the voice's default_speed when set. |
cfg_value |
no | Cloning-fidelity tradeoff, 0.5--5.0. Higher values stay closer to the reference voice. Overrides the voice's default_cfg_value. Meaningful for cloned voices only. |
warmup_trim_ms |
no | Milliseconds to trim from the start of generated audio. Overrides the voice's default_warmup_trim_ms. 0 to disable. Meaningful for cloned voices only. |
Voice defaults merge chain#
TTS parameters resolve through a three-level precedence chain:
- Engine default -- built-in values (speed 1.0, cfg ~2.0, no instructions, no trim).
- Voice default --
default_instructions,default_speed,default_cfg_value,default_warmup_trim_mson the voice row, set viaPATCH /voices/{id}in ScaiSpeak. - Session override --
instructions,speed,cfg_value,warmup_trim_msonPOST /sessions.
Each level overrides the one before it. A session that omits a field inherits the voice default; a voice that omits a default inherits the engine default. This lets voice owners bake in per-voice tuning while still allowing session-level control when needed.
Returns 201 Created:
1 2 3 4 5 | |
GET /sessions/{session_id}#
Returns the full session row including state, timestamps, turn count, and char counters. 404 on cross-tenant lookups (info-leak prevention).
DELETE /sessions/{session_id}#
Marks the session terminated. Doesn't disconnect any in-flight WS — that's the WS handler's responsibility on the next state check. Idempotent.
Session WebSocket#
1 | |
Authentication: bearer token via ?token= query param. Browsers can't set headers on the WS upgrade, so query is the only browser-direct option. Query params matching token are redacted in access logs.
Open handshake#
First client frame must be:
1 | |
Server responds with:
1 | |
then transitions the state machine to listening and emits a state event.
Client → Server frames#
| Frame | Purpose | Phase 0 behaviour |
|---|---|---|
{"type":"open"} |
First frame; opens the session | Validated; transitions to listening |
| binary | Mic frames (16 kHz PCM16 mono) | Forwarded to ScaiEcho STT |
{"type":"text","delta":"...","messages?","tools?"} |
Typed-input override. Phase 3a: messages (full conversation incl. the user's latest turn) is required when the session was opened with history_mode:'client', forbidden when 'server'. Optional tools array (per-turn). |
|
{"type":"tool_result","tool_call_id","content"} |
Phase 3a: response to a server-emitted agent_tool_call. Routed to the active turn's queue; stale results are dropped. |
|
{"type":"interrupt"} |
Stop in-flight LLM + TTS | Cancels current turn within ~100 ms |
{"type":"vad","speaking":true/false} |
Client VAD signal | Phase 1: speaking:true during thinking/speaking auto-cancels the current turn within ~100 ms. Other states + speaking:false are no-ops. See the client VAD tutorial. |
{"type":"wake","confidence":0.93} |
Wake word detected | Phase 2: when the session was opened with wake_word_enabled:true, arms the session for the next utterance. Server emits {"type":"wake_state","armed":true}. Idempotent. No-op when wake gating is off. See the client wake-word tutorial. |
{"type":"close"} |
End the session | Clean close, code 1000 |
Server → Client frames#
| Frame | When |
|---|---|
{"type":"ready", session_id, voice_id} |
After open is validated |
{"type":"state", state, reason?} |
Every state transition |
{"type":"transcript", text, is_final, speaker_id?} |
STT segment from ScaiEcho |
{"type":"agent_text", delta} |
LLM token stream |
{"type":"agent_tool_call", tool_call_id, name, arguments} |
Phase 3a: LLM emitted a tool call. Consumer executes + sends {"type":"tool_result"} back. |
| binary | TTS audio frames (WAV) |
{"type":"agent_done", stats:{chars, interrupted?, reason?}} |
Turn complete |
{"type":"wake_state", armed, wake_word_enabled} |
Phase 2 — when wake_word_enabled:true, fires on every armed/disarmed transition (and once at open with armed:false) |
{"type":"info", code, message} |
Informational (non-error) status. Phase 2 emits SCAIVOICE_WAKE_REQUIRED when a text frame is dropped because the session isn't armed |
{"type":"error", code, message} |
Anything failed |
{"type":"expression_hint", ...} |
Reserved for forward compatibility; not emitted in v1 |
Close codes#
| Code | Meaning |
|---|---|
4401 |
Unauthorized — missing or invalid token |
4403 |
Forbidden — missing scaivoice:use or no tenant context |
4404 |
Session not found |
4400 |
Bad request — bad first frame, malformed JSON, session already ended |
4502 |
Backend unavailable (downstream STT/LLM/TTS node not reachable) |
4500 |
Server error |
1000 |
Normal close — caller terminated, idle timeout, or session done |
Timeouts#
listeningwith no client frames for 30 s --> close with code1000(idle_timeout). The idle timeout only fires in thelisteningstate -- sessions inthinkingorspeakingare actively working (LLM inference, delegated callback, TTS streaming) and are not subject to the idle timer.thinkingfor >60 s --> error close (LLM stuck).speakingfor >120 s --> error close (TTS stuck).
State machine#
Five states. Transitions you'll see in {"type":"state"} events:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | |
State events carry a reason field for non-default transitions. Common values: opened, utterance_end, agent_first_frame, agent_done, interrupted_by_user, interrupted_by_error, idle_timeout, caller_terminated, protocol_close, error_<code>.