---
summary: User-visible changes to ScaiVoice.
title: Changelog
path: changelog
status: published
---

## v0.6.0 — Voice control + stabilisation (2026-05-25)

Per-session voice tuning and a round of infrastructure hardening.

**Voice control:**

- **Per-session TTS parameters.** `POST /sessions` now accepts `instructions`, `speed`, `cfg_value`, and `warmup_trim_ms`. These override the voice's defaults (set via ScaiSpeak `PATCH /voices/{id}`) for the lifetime of the session. Omitted fields inherit the voice default; voices without defaults inherit the engine default (speed 1.0, cfg ~2.0, no instructions, no trim).
- **Three-level merge chain.** Engine default < voice default < session override. Documented in the [API reference](./reference/api).

**Stabilisation fixes:**

- **Idle timeout scoped to listening state.** The 30-second idle timeout now fires only when the session is in the `listening` state. Previously it could trigger during `thinking` or `speaking`, killing sessions mid-reply when a delegated callback or TTS stream took longer than 30 seconds.
- **Streaming STT uses raw PCM.** The STT path now sends `audio/L16;rate=16000;channels=1` (raw PCM) via the `buffer_realtime_audio` path for real-time partial deltas, replacing the previous framing. Lower latency for interim transcripts.
- **No fixed gRPC deadlines on streaming paths.** Streaming STT and TTS gRPC calls no longer carry a fixed deadline. Streams live as long as activity flows; the session-level idle timeout handles cleanup.
- **Three-tier node health.** Node selection now distinguishes `healthy`, `warning`, and `stale` states with configurable margins. The router prefers healthy nodes, falls back to warning nodes, and never routes to stale nodes. Replaces the binary online/offline check.
- **Fresh DB reads for node resolution.** The audio-node resolver now opens a fresh database session for every lookup instead of reusing the WebSocket handler's long-lived session. Fixes intermittent `TTS_BACKEND_UNAVAILABLE` errors caused by stale REPEATABLE READ snapshots that couldn't see heartbeats committed after the WebSocket connected.
- **Heartbeat handler hardened.** The heartbeat acknowledgement path now retries up to 3 times on transient failures. A failed heartbeat no longer crashes the handler or leaves the node in an ambiguous state.

## v0.5.0 — Phase 3b delegated cognition (2026-05-24)

Consumers can now run their LLM call entirely outside ScaiGrid. ScaiVoice becomes pure speech I/O — it never touches an LLM in this mode.

**Shipped:**

- **`cognition_mode` session config.** `'inference_service'` (default — ScaiGrid's LLM gateway, current behaviour) or `'delegated'` (consumer-owned callback URL).
- **Delegated turn path.** When `cognition_mode='delegated'`, the orchestrator POSTs the user input to `cognition_callback_url` with body `{session_id, tenant_id, user_id, user_input, turn_index, request_id}` and streams the response body (any content type — read via `aiter_text`) as `agent_text` deltas. End-of-stream → TTS. Auth via optional `cognition_callback_auth_token` forwarded as `Authorization: Bearer …`.
- **Cancellation propagates through the HTTP stream.** Barge-in / VAD interrupts an in-flight callback turn within the same ~100 ms budget as the InferenceService path — the async-for over `aiter_text` exits early and the response closes via context-manager teardown.
- **Clean errors.** `SCAIVOICE_DELEGATED_CALLBACK_HTTP_ERROR` (non-200, with truncated body for debugging), `SCAIVOICE_DELEGATED_CALLBACK_TIMEOUT` (60s default), `SCAIVOICE_DELEGATED_CALLBACK_ERROR` (connection failures). Validated at session-create — delegated mode without a URL is rejected with `SCAIVOICE_BAD_CONFIG`.
- **Shared TTS pipe.** Extracted `_stream_to_tts` so the InferenceService path and the delegated path share the voxcpm mode resolution + dispatcher logic — single source of truth.
- **Token write-only.** `cognition_callback_auth_token` is persisted but never appears in `GET /sessions/{id}` responses; the read schema deliberately omits it.

**The integrator now has three knobs at session-create:**

| Setting | Server-owned | Client-owned |
|---|---|---|
| Conversation history | `history_mode: 'server'` | `history_mode: 'client'` |
| Tool execution | n/a (server-owned LLM has no tools) OR caller `tools` array in client-history mode | consumer's callback executes tools internally |
| LLM call | `cognition_mode: 'inference_service'` | `cognition_mode: 'delegated'` |

Each is independent. Most ScaiBot/ScaiWave integrations will pick `history_mode='client'` + `cognition_mode='delegated'` — full control of cognition; ScaiVoice handles only speech I/O. Smaller demos can stick with the all-server defaults.

## v0.4.0 — Phase 3a bring-your-own agent (2026-05-24)

ScaiVoice steps out of the agent-runtime business. Consumer products (ScaiBot, ScaiWave, the phone-taking sibling) bring their own LLM ecosystem — history, tools, RAG, persona — and ScaiVoice handles only the speech I/O glue.

**Shipped:**

- **`history_mode` session config.** `'server'` (default, current behaviour — ScaiVoice accumulates history) or `'client'` (consumer sends the full `messages` array on every text frame). Persisted on the session row.
- **Caller-owned `messages` per turn.** In client mode, the `text` frame must include a `messages` array. Validated via Pydantic; bad shapes surface `SCAIVOICE_BAD_MESSAGES`. Missing in client mode → `SCAIVOICE_HISTORY_REQUIRED`. Present in server mode → `SCAIVOICE_HISTORY_OWNED_BY_SERVER`.
- **Caller-supplied `tools` per turn.** Optional `tools` array on the text frame (client mode only). Passes verbatim to `InferenceService.chat_stream`. Per-turn — every utterance can carry a different tool set.
- **Multi-step agent loop.** When the LLM emits `finish_reason: tool_calls`, the orchestrator yields one `{"type":"agent_tool_call","tool_call_id","name","arguments"}` event per call, waits for matching `{"type":"tool_result","tool_call_id","content"}` frames from the consumer, appends the tool messages to the working conversation, and re-invokes the LLM. Loops until the LLM produces final text → pipes to TTS. Runaway-loop guard at 8 iterations.
- **Tool-call delta accumulator.** Coalesces streamed tool_call fragments (engines emit id+name on the first chunk, then argument fragments) into complete `ToolCall` objects for the consumer.
- **Cancellation through the tool wait.** Barge-in / VAD interrupts a turn even while waiting for a tool result — the queue race is `tool_result` vs `cancel`.

**Why:** voice-bot framework concerns end at "speech in, speech out". Conversation memory, tool execution policy, RAG context, persona prompting — those belong in the consumer's agent layer (where they already exist). ScaiVoice's job is to be invisible plumbing for whatever LLM ecosystem the consumer brings.

**Phase 3b (deferred):** delegated cognition — consumer registers an HTTP callback or proxied WS as the "LLM" endpoint, ScaiVoice forwards user transcripts there and pipes the reply into TTS. Lets consumers run their agent loop entirely outside ScaiGrid. Will ship as a `cognition_mode` session-level toggle.

## v0.3.0 — Phase 2 wake-word + speaker-identify shape (2026-05-24)

Opt-in wake-word gating fully wired controller-side. Speaker-identification endpoint contract shipped with a clean stub pending ScaiInfer-side RPC.

**Shipped:**

- **Wake-word gating.** When `wake_word_enabled: true` is set at session-create, the WS handler drops text/utterance frames until a `{"type":"wake"}` frame arrives. After each turn completes, the session re-arms (one wake = one turn). Server emits `{"type":"wake_state","armed":<bool>}` on every transition so the client UI can render the right prompt.
- **`SCAIVOICE_WAKE_REQUIRED` info event.** Pre-wake text frames are dropped with this informational event so clients can show "say the wake word" guidance without ambiguity.
- **`POST /v1/modules/scaiecho/speakers/identify` endpoint shape.** Stable contract for one-shot speaker identification against tenant-enrolled speakers. Returns 503 `SCAIECHO_IDENTIFY_NOT_WIRED` today (ScaiInfer engine RPC pending — see [`REQUEST-SPEAKER-IDENTIFY-2026-05-24.md`](../../../integrations/scaiinfer/REQUEST-SPEAKER-IDENTIFY-2026-05-24.md)). Counts the tenant's enrolled-speaker pool in the error details so callers get useful telemetry even today.
- **`scaiecho:identify` module permission** added.
- **Client-side wake-word tutorial.** `tutorials/client-wake-word.md` covers openwakeword integration, Picovoice as an alternative, and the wake+VAD interaction pattern.

**Reserved (no-op until engine wiring lands):** `speaker_recognition` session flag still doesn't populate `speaker_id` on `transcript` frames — the orchestrator passes the field through but ScaiInfer's streaming-STT doesn't emit it yet. Lands transparently once the `IdentifySpeaker` RPC ships.

## v0.2.0 — Phase 1 barge-in (2026-05-24)

Concurrent pump loop + VAD-driven cancellation.

**Shipped:**

- **Concurrent pump loop.** The WS handler now reads client frames concurrently with the per-turn task. Control frames (interrupt, vad, close) propagate cancellation mid-turn — previously they were queued behind the in-flight turn iterator and only took effect after the turn naturally finished.
- **VAD-driven barge-in.** `{"type":"vad","speaking":true}` arriving during `thinking` or `speaking` cancels the current turn, transitions back to `listening` with `reason: "interrupted_by_user"`, and is ready for the next utterance within ~100 ms (asserted in unit tests). `speaking:false` is informational.
- **Supersede semantics.** A `{"type":"text"}` frame arriving mid-turn cancels the in-flight turn before kicking off the new one. No turn-task pileup on a session.
- **State machine update.** `thinking → listening` is now a legal transition (covers cancel-during-thinking and natural-end-without-TTS-audio cases). Other transitions unchanged.
- **Client-side VAD integration reference.** Added `tutorials/client-vad-integration.md` with the silero-vad-in-browser recipe and the emit-pattern recommendation.

**Reserved (no-op until later phases):** wake-word, live speaker recognition, tools, expression_hint server frame, mic-piped end-of-utterance detection.

## v0.1.0 — Phase 0 framework foundation (2026-05-24)

Initial release. Framework, not a usable bot — consumer products build their own bot personality on top.

**Shipped:**

- New module `scaivoice` with sidebar entry (SuperAdmin telemetry only).
- ORM `mod_scaivoice_session` (21 columns, 3 indexes) capturing session config + state + audit counters.
- REST surface: `POST /sessions`, `GET /sessions/{id}`, `DELETE /sessions/{id}`.
- WebSocket: `WS /sessions/{id}/stream` with the full protocol (open/state/transcript/agent_text/agent_audio/agent_done/error frames + close codes 4401/4403/4404/4400/4502/4500/1000).
- State machine: idle / listening / thinking / speaking / interrupted / ended with explicit transition validation + idle/thinking/speaking timeouts.
- Cancellation primitive: `{"type":"interrupt"}` frame cancels in-flight LLM + TTS within ~100 ms.
- Bare-bones pipeline wired: ScaiEcho streaming STT, InferenceService.chat_stream, ScaiSpeak streaming TTS, all in-process.
- Module permissions: `scaivoice:use` (open sessions), `scaivoice:admin` (telemetry).
- Voxcpm mode resolver extracted from SpeakService to `modules.scaispeak.services.voxcpm_mode` so the orchestrator can reuse it.

**Reserved on the protocol (no-op in v0.1.0):**

- `{"type":"vad"}` frames — Phase 1 will wire them to auto barge-in.
- `{"type":"wake"}` frames — Phase 2 will wire them to state transitions.
- `speaker_recognition` flag on session config — Phase 2 will populate `speaker_id` on transcript frames.
- `tools_enabled` flag — Phase 3 will pipe tool calls through.
- `expression_hint` server frame — reserved slot for future avatar / expression metadata.

**Out of scope:**

- Avatar / lipsync — separate solution.
- Hosted bot personalities — consumer products own them.
