Platform
ScaiWave ScaiGrid ScaiCore ScaiBot ScaiDrive ScaiKey Models Tools & Services
Solutions
Organisations Developers Internet Service Providers Managed Service Providers AI-in-a-Box
Resources
Support Documentation Blog Downloads
Company
About Research Careers Investment Opportunities Contact
Log in

API Reference

ScaiVoice exposes a thin REST surface for session lifecycle plus one WebSocket that drives the entire conversation.

Choosing your integration shape (Phase 3a)#

ScaiVoice is intentionally agnostic about how you manage conversation memory and tool execution. Two integration patterns, picked at session-create via history_mode:

Mode What ScaiVoice owns What the consumer owns
history_mode: 'server' (default) conversation history (in-memory, lost on reconnect); LLM model selection bot personality (system prompt sent once at session-open; baked into the running history)
history_mode: 'client' speech I/O glue only — STT, TTS, state machine, cancellation full conversation history sent on every text frame; tool definitions; tool execution; RAG context; persona

The 'client' mode is the "bring your own LLM ecosystem" path. Your existing agent code in ScaiBot / ScaiWave / wherever continues to own the cognition; ScaiVoice wraps mic/STT on the way in and TTS/audio on the way out. Every text frame carries the full messages array (i.e. the conversation the agent is currently driving) plus the tools definition list. When the LLM emits tool calls, ScaiVoice forwards them as agent_tool_call events; you execute the tools in your environment and reply with tool_result frames; ScaiVoice continues the agent loop.

Phase 3b adds a second toggle, cognition_mode. With cognition_mode: 'delegated' and a cognition_callback_url set at session-create, ScaiVoice doesn't talk to any LLM — it POSTs each user utterance to your URL with {session_id, tenant_id, user_id, user_input, turn_index, request_id} and pipes the streaming response body straight into TTS. Optional cognition_callback_auth_token is forwarded as Authorization: Bearer … so your endpoint can verify the call. Tools, history, RAG, persona, error-handling — all on your side.

Most ScaiBot / ScaiWave integrations will pair history_mode: 'client' with cognition_mode: 'delegated' and own the entire agent layer in their own infrastructure. Simpler demos can stick with the all-server defaults.

Sessions REST#

All endpoints under /v1/modules/scaivoice/sessions. Permission: scaivoice:use.

POST /sessions#

Create a voice session. Body:

Field Required Notes
voice_id yes A voice the caller can see in the ScaiSpeak library. Validated up-front — cross-scope returns 404.
llm_model yes An LLM model slug the caller can use.
language_hint no 2-letter ISO code. Used by STT and by the TTS text-normaliser.
wake_word_enabled no When true, the client is responsible for emitting {"type":"wake"} frames. Default false.
vad_enabled no When true, the client emits {"type":"vad"} frames; Phase 1 uses them for auto barge-in. Default false.
speaker_recognition no When true, the server attaches speaker_id to transcript frames. Phase 2; tenant opt-in required. Default false.
diarize no When true, anonymous speaker labels (speaker_0/1/...) flow through STT segments. Default false.
tools_enabled no When true, the LLM gets tool definitions (Phase 3). Default false.
normalize_text no Toggle for the ScaiSpeak text-prep pipeline. true / false / omit for tenant default.
instructions no Free-text style / emotion / delivery guidance prepended to every TTS call in this session. Example: "cheerful and energetic". Overrides the voice's default_instructions when set.
speed no Speaking speed, 0.5--2.0. Overrides the voice's default_speed when set.
cfg_value no Cloning-fidelity tradeoff, 0.5--5.0. Higher values stay closer to the reference voice. Overrides the voice's default_cfg_value. Meaningful for cloned voices only.
warmup_trim_ms no Milliseconds to trim from the start of generated audio. Overrides the voice's default_warmup_trim_ms. 0 to disable. Meaningful for cloned voices only.

Voice defaults merge chain#

TTS parameters resolve through a three-level precedence chain:

  1. Engine default -- built-in values (speed 1.0, cfg ~2.0, no instructions, no trim).
  2. Voice default -- default_instructions, default_speed, default_cfg_value, default_warmup_trim_ms on the voice row, set via PATCH /voices/{id} in ScaiSpeak.
  3. Session override -- instructions, speed, cfg_value, warmup_trim_ms on POST /sessions.

Each level overrides the one before it. A session that omits a field inherits the voice default; a voice that omits a default inherits the engine default. This lets voice owners bake in per-voice tuning while still allowing session-level control when needed.

Returns 201 Created:

json
1
2
3
4
5
{
  "session_id": "ses_abc123",
  "ws_url": "/v1/modules/scaivoice/sessions/ses_abc123/stream",
  "state": "idle"
}

GET /sessions/{session_id}#

Returns the full session row including state, timestamps, turn count, and char counters. 404 on cross-tenant lookups (info-leak prevention).

DELETE /sessions/{session_id}#

Marks the session terminated. Doesn't disconnect any in-flight WS — that's the WS handler's responsibility on the next state check. Idempotent.

Session WebSocket#

scdoc
1
WS /v1/modules/scaivoice/sessions/{session_id}/stream?token=<jwt>

Authentication: bearer token via ?token= query param. Browsers can't set headers on the WS upgrade, so query is the only browser-direct option. Query params matching token are redacted in access logs.

Open handshake#

First client frame must be:

json
1
{"type": "open"}

Server responds with:

json
1
{"type": "ready", "session_id": "ses_abc123", "voice_id": "vc_..."}

then transitions the state machine to listening and emits a state event.

Client → Server frames#

Frame Purpose Phase 0 behaviour
{"type":"open"} First frame; opens the session Validated; transitions to listening
binary Mic frames (16 kHz PCM16 mono) Forwarded to ScaiEcho STT
{"type":"text","delta":"...","messages?","tools?"} Typed-input override. Phase 3a: messages (full conversation incl. the user's latest turn) is required when the session was opened with history_mode:'client', forbidden when 'server'. Optional tools array (per-turn).
{"type":"tool_result","tool_call_id","content"} Phase 3a: response to a server-emitted agent_tool_call. Routed to the active turn's queue; stale results are dropped.
{"type":"interrupt"} Stop in-flight LLM + TTS Cancels current turn within ~100 ms
{"type":"vad","speaking":true/false} Client VAD signal Phase 1: speaking:true during thinking/speaking auto-cancels the current turn within ~100 ms. Other states + speaking:false are no-ops. See the client VAD tutorial.
{"type":"wake","confidence":0.93} Wake word detected Phase 2: when the session was opened with wake_word_enabled:true, arms the session for the next utterance. Server emits {"type":"wake_state","armed":true}. Idempotent. No-op when wake gating is off. See the client wake-word tutorial.
{"type":"close"} End the session Clean close, code 1000

Server → Client frames#

Frame When
{"type":"ready", session_id, voice_id} After open is validated
{"type":"state", state, reason?} Every state transition
{"type":"transcript", text, is_final, speaker_id?} STT segment from ScaiEcho
{"type":"agent_text", delta} LLM token stream
{"type":"agent_tool_call", tool_call_id, name, arguments} Phase 3a: LLM emitted a tool call. Consumer executes + sends {"type":"tool_result"} back.
binary TTS audio frames (WAV)
{"type":"agent_done", stats:{chars, interrupted?, reason?}} Turn complete
{"type":"wake_state", armed, wake_word_enabled} Phase 2 — when wake_word_enabled:true, fires on every armed/disarmed transition (and once at open with armed:false)
{"type":"info", code, message} Informational (non-error) status. Phase 2 emits SCAIVOICE_WAKE_REQUIRED when a text frame is dropped because the session isn't armed
{"type":"error", code, message} Anything failed
{"type":"expression_hint", ...} Reserved for forward compatibility; not emitted in v1

Close codes#

Code Meaning
4401 Unauthorized — missing or invalid token
4403 Forbidden — missing scaivoice:use or no tenant context
4404 Session not found
4400 Bad request — bad first frame, malformed JSON, session already ended
4502 Backend unavailable (downstream STT/LLM/TTS node not reachable)
4500 Server error
1000 Normal close — caller terminated, idle timeout, or session done

Timeouts#

  • listening with no client frames for 30 s --> close with code 1000 (idle_timeout). The idle timeout only fires in the listening state -- sessions in thinking or speaking are actively working (LLM inference, delegated callback, TTS streaming) and are not subject to the idle timer.
  • thinking for >60 s --> error close (LLM stuck).
  • speaking for >120 s --> error close (TTS stuck).

State machine#

Five states. Transitions you'll see in {"type":"state"} events:

scdoc
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
   ┌─────────┐  open    ┌───────────┐  user utterance end  ┌──────────┐
   │  idle   │ ───────► │ listening │ ──────────────────►  │ thinking │
   └─────────┘          └───────────┘                      └────┬─────┘
                              ▲                                  │
                              │ ready_for_next                   │ first TTS frame
                              │                                  ▼
                        ┌─────┴───────┐                    ┌──────────┐
                        │ interrupted │ ◄───────interrupt──│ speaking │
                        └─────────────┘                    └──────────┘
                              │                                  │ agent_done
                              ▼                                  ▼
                        ┌───────────┐                      ┌───────────┐
                        │ listening │ ◄────────────────────│ listening │
                        └───────────┘                      └───────────┘

State events carry a reason field for non-default transitions. Common values: opened, utterance_end, agent_first_frame, agent_done, interrupted_by_user, interrupted_by_error, idle_timeout, caller_terminated, protocol_close, error_<code>.

Updated 2026-05-25 23:46:26 View source (.md) rev 3