Changelog

v0.6.0 — Voice control + stabilisation (2026-05-25)#

Per-session voice tuning and a round of infrastructure hardening.

Voice control:

Per-session TTS parameters. POST /sessions now accepts instructions, speed, cfg_value, and warmup_trim_ms. These override the voice's defaults (set via ScaiSpeak PATCH /voices/{id}) for the lifetime of the session. Omitted fields inherit the voice default; voices without defaults inherit the engine default (speed 1.0, cfg ~2.0, no instructions, no trim).
Three-level merge chain. Engine default < voice default < session override. Documented in the API reference.

Stabilisation fixes:

Idle timeout scoped to listening state. The 30-second idle timeout now fires only when the session is in the listening state. Previously it could trigger during thinking or speaking, killing sessions mid-reply when a delegated callback or TTS stream took longer than 30 seconds.
Streaming STT uses raw PCM. The STT path now sends audio/L16;rate=16000;channels=1 (raw PCM) via the buffer_realtime_audio path for real-time partial deltas, replacing the previous framing. Lower latency for interim transcripts.
No fixed gRPC deadlines on streaming paths. Streaming STT and TTS gRPC calls no longer carry a fixed deadline. Streams live as long as activity flows; the session-level idle timeout handles cleanup.
Three-tier node health. Node selection now distinguishes healthy, warning, and stale states with configurable margins. The router prefers healthy nodes, falls back to warning nodes, and never routes to stale nodes. Replaces the binary online/offline check.
Fresh DB reads for node resolution. The audio-node resolver now opens a fresh database session for every lookup instead of reusing the WebSocket handler's long-lived session. Fixes intermittent TTS_BACKEND_UNAVAILABLE errors caused by stale REPEATABLE READ snapshots that couldn't see heartbeats committed after the WebSocket connected.
Heartbeat handler hardened. The heartbeat acknowledgement path now retries up to 3 times on transient failures. A failed heartbeat no longer crashes the handler or leaves the node in an ambiguous state.

v0.5.0 — Phase 3b delegated cognition (2026-05-24)#

Consumers can now run their LLM call entirely outside ScaiGrid. ScaiVoice becomes pure speech I/O — it never touches an LLM in this mode.

Shipped:

cognition_mode session config. 'inference_service' (default — ScaiGrid's LLM gateway, current behaviour) or 'delegated' (consumer-owned callback URL).
Delegated turn path. When cognition_mode='delegated', the orchestrator POSTs the user input to cognition_callback_url with body {session_id, tenant_id, user_id, user_input, turn_index, request_id} and streams the response body (any content type — read via aiter_text) as agent_text deltas. End-of-stream → TTS. Auth via optional cognition_callback_auth_token forwarded as Authorization: Bearer ….
Cancellation propagates through the HTTP stream. Barge-in / VAD interrupts an in-flight callback turn within the same ~100 ms budget as the InferenceService path — the async-for over aiter_text exits early and the response closes via context-manager teardown.
Clean errors. SCAIVOICE_DELEGATED_CALLBACK_HTTP_ERROR (non-200, with truncated body for debugging), SCAIVOICE_DELEGATED_CALLBACK_TIMEOUT (60s default), SCAIVOICE_DELEGATED_CALLBACK_ERROR (connection failures). Validated at session-create — delegated mode without a URL is rejected with SCAIVOICE_BAD_CONFIG.
Shared TTS pipe. Extracted _stream_to_tts so the InferenceService path and the delegated path share the voxcpm mode resolution + dispatcher logic — single source of truth.
Token write-only. cognition_callback_auth_token is persisted but never appears in GET /sessions/{id} responses; the read schema deliberately omits it.

The integrator now has three knobs at session-create:

Setting	Server-owned	Client-owned
Conversation history	`history_mode: 'server'`	`history_mode: 'client'`
Tool execution	n/a (server-owned LLM has no tools) OR caller `tools` array in client-history mode	consumer's callback executes tools internally
LLM call	`cognition_mode: 'inference_service'`	`cognition_mode: 'delegated'`

Each is independent. Most ScaiBot/ScaiWave integrations will pick history_mode='client' + cognition_mode='delegated' — full control of cognition; ScaiVoice handles only speech I/O. Smaller demos can stick with the all-server defaults.

v0.4.0 — Phase 3a bring-your-own agent (2026-05-24)#

ScaiVoice steps out of the agent-runtime business. Consumer products (ScaiBot, ScaiWave, the phone-taking sibling) bring their own LLM ecosystem — history, tools, RAG, persona — and ScaiVoice handles only the speech I/O glue.

Shipped:

history_mode session config. 'server' (default, current behaviour — ScaiVoice accumulates history) or 'client' (consumer sends the full messages array on every text frame). Persisted on the session row.
Caller-owned messages per turn. In client mode, the text frame must include a messages array. Validated via Pydantic; bad shapes surface SCAIVOICE_BAD_MESSAGES. Missing in client mode → SCAIVOICE_HISTORY_REQUIRED. Present in server mode → SCAIVOICE_HISTORY_OWNED_BY_SERVER.
Caller-supplied tools per turn. Optional tools array on the text frame (client mode only). Passes verbatim to InferenceService.chat_stream. Per-turn — every utterance can carry a different tool set.
Multi-step agent loop. When the LLM emits finish_reason: tool_calls, the orchestrator yields one {"type":"agent_tool_call","tool_call_id","name","arguments"} event per call, waits for matching {"type":"tool_result","tool_call_id","content"} frames from the consumer, appends the tool messages to the working conversation, and re-invokes the LLM. Loops until the LLM produces final text → pipes to TTS. Runaway-loop guard at 8 iterations.
Tool-call delta accumulator. Coalesces streamed tool_call fragments (engines emit id+name on the first chunk, then argument fragments) into complete ToolCall objects for the consumer.
Cancellation through the tool wait. Barge-in / VAD interrupts a turn even while waiting for a tool result — the queue race is tool_result vs cancel.

Why: voice-bot framework concerns end at "speech in, speech out". Conversation memory, tool execution policy, RAG context, persona prompting — those belong in the consumer's agent layer (where they already exist). ScaiVoice's job is to be invisible plumbing for whatever LLM ecosystem the consumer brings.

Phase 3b (deferred): delegated cognition — consumer registers an HTTP callback or proxied WS as the "LLM" endpoint, ScaiVoice forwards user transcripts there and pipes the reply into TTS. Lets consumers run their agent loop entirely outside ScaiGrid. Will ship as a cognition_mode session-level toggle.

v0.3.0 — Phase 2 wake-word + speaker-identify shape (2026-05-24)#

Opt-in wake-word gating fully wired controller-side. Speaker-identification endpoint contract shipped with a clean stub pending ScaiInfer-side RPC.

Shipped:

Wake-word gating. When wake_word_enabled: true is set at session-create, the WS handler drops text/utterance frames until a {"type":"wake"} frame arrives. After each turn completes, the session re-arms (one wake = one turn). Server emits {"type":"wake_state","armed":<bool>} on every transition so the client UI can render the right prompt.
SCAIVOICE_WAKE_REQUIRED info event. Pre-wake text frames are dropped with this informational event so clients can show "say the wake word" guidance without ambiguity.
POST /v1/modules/scaiecho/speakers/identify endpoint shape. Stable contract for one-shot speaker identification against tenant-enrolled speakers. Returns 503 SCAIECHO_IDENTIFY_NOT_WIRED today (ScaiInfer engine RPC pending — see REQUEST-SPEAKER-IDENTIFY-2026-05-24.md). Counts the tenant's enrolled-speaker pool in the error details so callers get useful telemetry even today.
scaiecho:identify module permission added.
Client-side wake-word tutorial. tutorials/client-wake-word.md covers openwakeword integration, Picovoice as an alternative, and the wake+VAD interaction pattern.

Reserved (no-op until engine wiring lands): speaker_recognition session flag still doesn't populate speaker_id on transcript frames — the orchestrator passes the field through but ScaiInfer's streaming-STT doesn't emit it yet. Lands transparently once the IdentifySpeaker RPC ships.

v0.2.0 — Phase 1 barge-in (2026-05-24)#

Concurrent pump loop + VAD-driven cancellation.

Shipped:

Concurrent pump loop. The WS handler now reads client frames concurrently with the per-turn task. Control frames (interrupt, vad, close) propagate cancellation mid-turn — previously they were queued behind the in-flight turn iterator and only took effect after the turn naturally finished.
VAD-driven barge-in. {"type":"vad","speaking":true} arriving during thinking or speaking cancels the current turn, transitions back to listening with reason: "interrupted_by_user", and is ready for the next utterance within ~100 ms (asserted in unit tests). speaking:false is informational.
Supersede semantics. A {"type":"text"} frame arriving mid-turn cancels the in-flight turn before kicking off the new one. No turn-task pileup on a session.
State machine update. thinking → listening is now a legal transition (covers cancel-during-thinking and natural-end-without-TTS-audio cases). Other transitions unchanged.
Client-side VAD integration reference. Added tutorials/client-vad-integration.md with the silero-vad-in-browser recipe and the emit-pattern recommendation.

Reserved (no-op until later phases): wake-word, live speaker recognition, tools, expression_hint server frame, mic-piped end-of-utterance detection.

v0.1.0 — Phase 0 framework foundation (2026-05-24)#

Initial release. Framework, not a usable bot — consumer products build their own bot personality on top.

Shipped:

New module scaivoice with sidebar entry (SuperAdmin telemetry only).
ORM mod_scaivoice_session (21 columns, 3 indexes) capturing session config + state + audit counters.
REST surface: POST /sessions, GET /sessions/{id}, DELETE /sessions/{id}.
WebSocket: WS /sessions/{id}/stream with the full protocol (open/state/transcript/agent_text/agent_audio/agent_done/error frames + close codes 4401/4403/4404/4400/4502/4500/1000).
State machine: idle / listening / thinking / speaking / interrupted / ended with explicit transition validation + idle/thinking/speaking timeouts.
Cancellation primitive: {"type":"interrupt"} frame cancels in-flight LLM + TTS within ~100 ms.
Bare-bones pipeline wired: ScaiEcho streaming STT, InferenceService.chat_stream, ScaiSpeak streaming TTS, all in-process.
Module permissions: scaivoice:use (open sessions), scaivoice:admin (telemetry).
Voxcpm mode resolver extracted from SpeakService to modules.scaispeak.services.voxcpm_mode so the orchestrator can reuse it.

Reserved on the protocol (no-op in v0.1.0):

{"type":"vad"} frames — Phase 1 will wire them to auto barge-in.
{"type":"wake"} frames — Phase 2 will wire them to state transitions.
speaker_recognition flag on session config — Phase 2 will populate speaker_id on transcript frames.
tools_enabled flag — Phase 3 will pipe tool calls through.
expression_hint server frame — reserved slot for future avatar / expression metadata.

Out of scope:

Avatar / lipsync — separate solution.
Hosted bot personalities — consumer products own them.