---
summary: REST endpoints to create / poll / terminate voice sessions, and the WebSocket
  protocol for the live conversation pipeline.
title: API Reference
path: reference/api
status: published
---

ScaiVoice exposes a thin REST surface for session lifecycle plus one WebSocket that drives the entire conversation.

## Choosing your integration shape (Phase 3a)

ScaiVoice is intentionally agnostic about how you manage conversation memory and tool execution. Two integration patterns, picked at session-create via `history_mode`:

| Mode | What ScaiVoice owns | What the consumer owns |
|---|---|---|
| `history_mode: 'server'` (default) | conversation history (in-memory, lost on reconnect); LLM model selection | bot personality (system prompt sent once at session-open; baked into the running history) |
| `history_mode: 'client'` | speech I/O glue only — STT, TTS, state machine, cancellation | full conversation history sent on every text frame; tool definitions; tool execution; RAG context; persona |

The `'client'` mode is the "bring your own LLM ecosystem" path. Your existing agent code in ScaiBot / ScaiWave / wherever continues to own the cognition; ScaiVoice wraps mic/STT on the way in and TTS/audio on the way out. Every text frame carries the full `messages` array (i.e. the conversation the agent is currently driving) plus the `tools` definition list. When the LLM emits tool calls, ScaiVoice forwards them as `agent_tool_call` events; you execute the tools in your environment and reply with `tool_result` frames; ScaiVoice continues the agent loop.

Phase 3b adds a second toggle, `cognition_mode`. With `cognition_mode: 'delegated'` and a `cognition_callback_url` set at session-create, ScaiVoice doesn't talk to any LLM — it POSTs each user utterance to your URL with `{session_id, tenant_id, user_id, user_input, turn_index, request_id}` and pipes the streaming response body straight into TTS. Optional `cognition_callback_auth_token` is forwarded as `Authorization: Bearer …` so your endpoint can verify the call. Tools, history, RAG, persona, error-handling — all on your side.

Most ScaiBot / ScaiWave integrations will pair `history_mode: 'client'` with `cognition_mode: 'delegated'` and own the entire agent layer in their own infrastructure. Simpler demos can stick with the all-server defaults.

## Sessions REST

All endpoints under `/v1/modules/scaivoice/sessions`. Permission: `scaivoice:use`.

### `POST /sessions`

Create a voice session. Body:

| Field | Required | Notes |
|---|---|---|
| `voice_id` | yes | A voice the caller can see in the ScaiSpeak library. Validated up-front — cross-scope returns 404. |
| `llm_model` | yes | An LLM model slug the caller can use. |
| `language_hint` | no | 2-letter ISO code. Used by STT and by the TTS text-normaliser. |
| `wake_word_enabled` | no | When true, the client is responsible for emitting `{"type":"wake"}` frames. Default false. |
| `vad_enabled` | no | When true, the client emits `{"type":"vad"}` frames; Phase 1 uses them for auto barge-in. Default false. |
| `speaker_recognition` | no | When true, the server attaches `speaker_id` to transcript frames. Phase 2; tenant opt-in required. Default false. |
| `diarize` | no | When true, anonymous speaker labels (speaker_0/1/...) flow through STT segments. Default false. |
| `tools_enabled` | no | When true, the LLM gets tool definitions (Phase 3). Default false. |
| `normalize_text` | no | Toggle for the ScaiSpeak text-prep pipeline. `true` / `false` / omit for tenant default. |
| `instructions` | no | Free-text style / emotion / delivery guidance prepended to every TTS call in this session. Example: `"cheerful and energetic"`. Overrides the voice's `default_instructions` when set. |
| `speed` | no | Speaking speed, 0.5--2.0. Overrides the voice's `default_speed` when set. |
| `cfg_value` | no | Cloning-fidelity tradeoff, 0.5--5.0. Higher values stay closer to the reference voice. Overrides the voice's `default_cfg_value`. Meaningful for cloned voices only. |
| `warmup_trim_ms` | no | Milliseconds to trim from the start of generated audio. Overrides the voice's `default_warmup_trim_ms`. 0 to disable. Meaningful for cloned voices only. |

### Voice defaults merge chain

TTS parameters resolve through a three-level precedence chain:

1. **Engine default** -- built-in values (speed 1.0, cfg ~2.0, no instructions, no trim).
2. **Voice default** -- `default_instructions`, `default_speed`, `default_cfg_value`, `default_warmup_trim_ms` on the voice row, set via `PATCH /voices/{id}` in ScaiSpeak.
3. **Session override** -- `instructions`, `speed`, `cfg_value`, `warmup_trim_ms` on `POST /sessions`.

Each level overrides the one before it. A session that omits a field inherits the voice default; a voice that omits a default inherits the engine default. This lets voice owners bake in per-voice tuning while still allowing session-level control when needed.

Returns `201 Created`:

```json
{
  "session_id": "ses_abc123",
  "ws_url": "/v1/modules/scaivoice/sessions/ses_abc123/stream",
  "state": "idle"
}
```

### `GET /sessions/{session_id}`

Returns the full session row including state, timestamps, turn count, and char counters. 404 on cross-tenant lookups (info-leak prevention).

### `DELETE /sessions/{session_id}`

Marks the session terminated. Doesn't disconnect any in-flight WS — that's the WS handler's responsibility on the next state check. Idempotent.

## Session WebSocket

```
WS /v1/modules/scaivoice/sessions/{session_id}/stream?token=<jwt>
```

Authentication: bearer token via `?token=` query param. Browsers can't set headers on the WS upgrade, so query is the only browser-direct option. Query params matching `token` are redacted in access logs.

### Open handshake

First client frame must be:

```json
{"type": "open"}
```

Server responds with:

```json
{"type": "ready", "session_id": "ses_abc123", "voice_id": "vc_..."}
```

then transitions the state machine to `listening` and emits a `state` event.

### Client → Server frames

| Frame | Purpose | Phase 0 behaviour |
|---|---|---|
| `{"type":"open"}` | First frame; opens the session | Validated; transitions to listening |
| binary | Mic frames (16 kHz PCM16 mono) | Forwarded to ScaiEcho STT |
| `{"type":"text","delta":"...","messages?","tools?"}` | Typed-input override. **Phase 3a**: `messages` (full conversation incl. the user's latest turn) is required when the session was opened with `history_mode:'client'`, forbidden when `'server'`. Optional `tools` array (per-turn). |
| `{"type":"tool_result","tool_call_id","content"}` | **Phase 3a**: response to a server-emitted `agent_tool_call`. Routed to the active turn's queue; stale results are dropped. |
| `{"type":"interrupt"}` | Stop in-flight LLM + TTS | Cancels current turn within ~100 ms |
| `{"type":"vad","speaking":true/false}` | Client VAD signal | **Phase 1**: `speaking:true` during `thinking`/`speaking` auto-cancels the current turn within ~100 ms. Other states + `speaking:false` are no-ops. See the [client VAD tutorial](../tutorials/client-vad-integration). |
| `{"type":"wake","confidence":0.93}` | Wake word detected | **Phase 2**: when the session was opened with `wake_word_enabled:true`, arms the session for the next utterance. Server emits `{"type":"wake_state","armed":true}`. Idempotent. No-op when wake gating is off. See the [client wake-word tutorial](../tutorials/client-wake-word). |
| `{"type":"close"}` | End the session | Clean close, code 1000 |

### Server → Client frames

| Frame | When |
|---|---|
| `{"type":"ready", session_id, voice_id}` | After open is validated |
| `{"type":"state", state, reason?}` | Every state transition |
| `{"type":"transcript", text, is_final, speaker_id?}` | STT segment from ScaiEcho |
| `{"type":"agent_text", delta}` | LLM token stream |
| `{"type":"agent_tool_call", tool_call_id, name, arguments}` | **Phase 3a**: LLM emitted a tool call. Consumer executes + sends `{"type":"tool_result"}` back. |
| binary | TTS audio frames (WAV) |
| `{"type":"agent_done", stats:{chars, interrupted?, reason?}}` | Turn complete |
| `{"type":"wake_state", armed, wake_word_enabled}` | Phase 2 — when `wake_word_enabled:true`, fires on every armed/disarmed transition (and once at open with `armed:false`) |
| `{"type":"info", code, message}` | Informational (non-error) status. Phase 2 emits `SCAIVOICE_WAKE_REQUIRED` when a text frame is dropped because the session isn't armed |
| `{"type":"error", code, message}` | Anything failed |
| `{"type":"expression_hint", ...}` | **Reserved** for forward compatibility; not emitted in v1 |

### Close codes

| Code | Meaning |
|---|---|
| `4401` | Unauthorized — missing or invalid token |
| `4403` | Forbidden — missing `scaivoice:use` or no tenant context |
| `4404` | Session not found |
| `4400` | Bad request — bad first frame, malformed JSON, session already ended |
| `4502` | Backend unavailable (downstream STT/LLM/TTS node not reachable) |
| `4500` | Server error |
| `1000` | Normal close — caller terminated, idle timeout, or session done |

### Timeouts

- `listening` with no client frames for 30 s --> close with code `1000` (`idle_timeout`). The idle timeout only fires in the `listening` state -- sessions in `thinking` or `speaking` are actively working (LLM inference, delegated callback, TTS streaming) and are not subject to the idle timer.
- `thinking` for >60 s --> error close (LLM stuck).
- `speaking` for >120 s --> error close (TTS stuck).

## State machine

Five states. Transitions you'll see in `{"type":"state"}` events:

```
   ┌─────────┐  open    ┌───────────┐  user utterance end  ┌──────────┐
   │  idle   │ ───────► │ listening │ ──────────────────►  │ thinking │
   └─────────┘          └───────────┘                      └────┬─────┘
                              ▲                                  │
                              │ ready_for_next                   │ first TTS frame
                              │                                  ▼
                        ┌─────┴───────┐                    ┌──────────┐
                        │ interrupted │ ◄───────interrupt──│ speaking │
                        └─────────────┘                    └──────────┘
                              │                                  │ agent_done
                              ▼                                  ▼
                        ┌───────────┐                      ┌───────────┐
                        │ listening │ ◄────────────────────│ listening │
                        └───────────┘                      └───────────┘
```

State events carry a `reason` field for non-default transitions. Common values: `opened`, `utterance_end`, `agent_first_frame`, `agent_done`, `interrupted_by_user`, `interrupted_by_error`, `idle_timeout`, `caller_terminated`, `protocol_close`, `error_<code>`.