API Reference

ScaiVoice exposes a thin REST surface for session lifecycle plus one WebSocket that drives the entire conversation.

Choosing your integration shape (Phase 3a)#

ScaiVoice is intentionally agnostic about how you manage conversation memory and tool execution. Two integration patterns, picked at session-create via history_mode:

Mode	What ScaiVoice owns	What the consumer owns
`history_mode: 'server'` (default)	conversation history (in-memory, lost on reconnect); LLM model selection	bot personality (system prompt sent once at session-open; baked into the running history)
`history_mode: 'client'`	speech I/O glue only — STT, TTS, state machine, cancellation	full conversation history sent on every text frame; tool definitions; tool execution; RAG context; persona

The 'client' mode is the "bring your own LLM ecosystem" path. Your existing agent code in ScaiBot / ScaiWave / wherever continues to own the cognition; ScaiVoice wraps mic/STT on the way in and TTS/audio on the way out. Every text frame carries the full messages array (i.e. the conversation the agent is currently driving) plus the tools definition list. When the LLM emits tool calls, ScaiVoice forwards them as agent_tool_call events; you execute the tools in your environment and reply with tool_result frames; ScaiVoice continues the agent loop.

Phase 3b adds a second toggle, cognition_mode. With cognition_mode: 'delegated' and a cognition_callback_url set at session-create, ScaiVoice doesn't talk to any LLM — it POSTs each user utterance to your URL with {session_id, tenant_id, user_id, user_input, turn_index, request_id} and pipes the streaming response body straight into TTS. Optional cognition_callback_auth_token is forwarded as Authorization: Bearer … so your endpoint can verify the call. Tools, history, RAG, persona, error-handling — all on your side.

Most ScaiBot / ScaiWave integrations will pair history_mode: 'client' with cognition_mode: 'delegated' and own the entire agent layer in their own infrastructure. Simpler demos can stick with the all-server defaults.

Sessions REST#

All endpoints under /v1/modules/scaivoice/sessions. Permission: scaivoice:use.

`POST /sessions`#

Create a voice session. Body:

Field	Required	Notes
`voice_id`	yes	A voice the caller can see in the ScaiSpeak library. Validated up-front — cross-scope returns 404.
`llm_model`	yes	An LLM model slug the caller can use.
`language_hint`	no	2-letter ISO code. Used by STT and by the TTS text-normaliser.
`wake_word_enabled`	no	When true, the client is responsible for emitting `{"type":"wake"}` frames. Default false.
`vad_enabled`	no	When true, the client emits `{"type":"vad"}` frames; Phase 1 uses them for auto barge-in. Default false.
`speaker_recognition`	no	When true, the server attaches `speaker_id` to transcript frames. Phase 2; tenant opt-in required. Default false.
`diarize`	no	When true, anonymous speaker labels (speaker_0/1/...) flow through STT segments. Default false.
`tools_enabled`	no	When true, the LLM gets tool definitions (Phase 3). Default false.
`normalize_text`	no	Toggle for the ScaiSpeak text-prep pipeline. `true` / `false` / omit for tenant default.
`instructions`	no	Free-text style / emotion / delivery guidance prepended to every TTS call in this session. Example: `"cheerful and energetic"`. Overrides the voice's `default_instructions` when set.
`speed`	no	Speaking speed, 0.5--2.0. Overrides the voice's `default_speed` when set.
`cfg_value`	no	Cloning-fidelity tradeoff, 0.5--5.0. Higher values stay closer to the reference voice. Overrides the voice's `default_cfg_value`. Meaningful for cloned voices only.
`warmup_trim_ms`	no	Milliseconds to trim from the start of generated audio. Overrides the voice's `default_warmup_trim_ms`. 0 to disable. Meaningful for cloned voices only.

Voice defaults merge chain#

TTS parameters resolve through a three-level precedence chain:

Engine default -- built-in values (speed 1.0, cfg ~2.0, no instructions, no trim).
Voice default -- default_instructions, default_speed, default_cfg_value, default_warmup_trim_ms on the voice row, set via PATCH /voices/{id} in ScaiSpeak.
Session override -- instructions, speed, cfg_value, warmup_trim_ms on POST /sessions.

Each level overrides the one before it. A session that omits a field inherits the voice default; a voice that omits a default inherits the engine default. This lets voice owners bake in per-voice tuning while still allowing session-level control when needed.

Returns 201 Created:

json
{
  "session_id": "ses_abc123",
  "ws_url": "/v1/modules/scaivoice/sessions/ses_abc123/stream",
  "state": "idle"
}

`GET /sessions/{session_id}`#

Returns the full session row including state, timestamps, turn count, and char counters. 404 on cross-tenant lookups (info-leak prevention).

`DELETE /sessions/{session_id}`#

Marks the session terminated. Doesn't disconnect any in-flight WS — that's the WS handler's responsibility on the next state check. Idempotent.

Session WebSocket#

scdoc

1	`WS /v1/modules/scaivoice/sessions/{session_id}/stream?token=<jwt>`

Authentication: bearer token via ?token= query param. Browsers can't set headers on the WS upgrade, so query is the only browser-direct option. Query params matching token are redacted in access logs.

Open handshake#

First client frame must be:

json
{"type": "open"}

Server responds with:

json
{"type": "ready", "session_id": "ses_abc123", "voice_id": "vc_..."}

then transitions the state machine to listening and emits a state event.

Client → Server frames#

Frame	Purpose	Phase 0 behaviour
`{"type":"open"}`	First frame; opens the session	Validated; transitions to listening
binary	Mic frames (16 kHz PCM16 mono)	Forwarded to ScaiEcho STT
`{"type":"text","delta":"...","messages?","tools?"}`	Typed-input override. Phase 3a: `messages` (full conversation incl. the user's latest turn) is required when the session was opened with `history_mode:'client'`, forbidden when `'server'`. Optional `tools` array (per-turn).
`{"type":"tool_result","tool_call_id","content"}`	Phase 3a: response to a server-emitted `agent_tool_call`. Routed to the active turn's queue; stale results are dropped.
`{"type":"interrupt"}`	Stop in-flight LLM + TTS	Cancels current turn within ~100 ms
`{"type":"vad","speaking":true/false}`	Client VAD signal	Phase 1: `speaking:true` during `thinking`/`speaking` auto-cancels the current turn within ~100 ms. Other states + `speaking:false` are no-ops. See the client VAD tutorial.
`{"type":"wake","confidence":0.93}`	Wake word detected	Phase 2: when the session was opened with `wake_word_enabled:true`, arms the session for the next utterance. Server emits `{"type":"wake_state","armed":true}`. Idempotent. No-op when wake gating is off. See the client wake-word tutorial.
`{"type":"close"}`	End the session	Clean close, code 1000

Server → Client frames#

Frame	When
`{"type":"ready", session_id, voice_id}`	After open is validated
`{"type":"state", state, reason?}`	Every state transition
`{"type":"transcript", text, is_final, speaker_id?}`	STT segment from ScaiEcho
`{"type":"agent_text", delta}`	LLM token stream
`{"type":"agent_tool_call", tool_call_id, name, arguments}`	Phase 3a: LLM emitted a tool call. Consumer executes + sends `{"type":"tool_result"}` back.
binary	TTS audio frames (WAV)
`{"type":"agent_done", stats:{chars, interrupted?, reason?}}`	Turn complete
`{"type":"wake_state", armed, wake_word_enabled}`	Phase 2 — when `wake_word_enabled:true`, fires on every armed/disarmed transition (and once at open with `armed:false`)
`{"type":"info", code, message}`	Informational (non-error) status. Phase 2 emits `SCAIVOICE_WAKE_REQUIRED` when a text frame is dropped because the session isn't armed
`{"type":"error", code, message}`	Anything failed
`{"type":"expression_hint", ...}`	Reserved for forward compatibility; not emitted in v1

Close codes#

Code	Meaning
`4401`	Unauthorized — missing or invalid token
`4403`	Forbidden — missing `scaivoice:use` or no tenant context
`4404`	Session not found
`4400`	Bad request — bad first frame, malformed JSON, session already ended
`4502`	Backend unavailable (downstream STT/LLM/TTS node not reachable)
`4500`	Server error
`1000`	Normal close — caller terminated, idle timeout, or session done

Timeouts#

listening with no client frames for 30 s --> close with code 1000 (idle_timeout). The idle timeout only fires in the listening state -- sessions in thinking or speaking are actively working (LLM inference, delegated callback, TTS streaming) and are not subject to the idle timer.
thinking for >60 s --> error close (LLM stuck).
speaking for >120 s --> error close (TTS stuck).

State machine#

Five states. Transitions you'll see in {"type":"state"} events: