Changelog
v0.6.0 — Voice control + stabilisation (2026-05-25)#
Per-session voice tuning and a round of infrastructure hardening.
Voice control:
- Per-session TTS parameters.
POST /sessionsnow acceptsinstructions,speed,cfg_value, andwarmup_trim_ms. These override the voice's defaults (set via ScaiSpeakPATCH /voices/{id}) for the lifetime of the session. Omitted fields inherit the voice default; voices without defaults inherit the engine default (speed 1.0, cfg ~2.0, no instructions, no trim). - Three-level merge chain. Engine default < voice default < session override. Documented in the API reference.
Stabilisation fixes:
- Idle timeout scoped to listening state. The 30-second idle timeout now fires only when the session is in the
listeningstate. Previously it could trigger duringthinkingorspeaking, killing sessions mid-reply when a delegated callback or TTS stream took longer than 30 seconds. - Streaming STT uses raw PCM. The STT path now sends
audio/L16;rate=16000;channels=1(raw PCM) via thebuffer_realtime_audiopath for real-time partial deltas, replacing the previous framing. Lower latency for interim transcripts. - No fixed gRPC deadlines on streaming paths. Streaming STT and TTS gRPC calls no longer carry a fixed deadline. Streams live as long as activity flows; the session-level idle timeout handles cleanup.
- Three-tier node health. Node selection now distinguishes
healthy,warning, andstalestates with configurable margins. The router prefers healthy nodes, falls back to warning nodes, and never routes to stale nodes. Replaces the binary online/offline check. - Fresh DB reads for node resolution. The audio-node resolver now opens a fresh database session for every lookup instead of reusing the WebSocket handler's long-lived session. Fixes intermittent
TTS_BACKEND_UNAVAILABLEerrors caused by stale REPEATABLE READ snapshots that couldn't see heartbeats committed after the WebSocket connected. - Heartbeat handler hardened. The heartbeat acknowledgement path now retries up to 3 times on transient failures. A failed heartbeat no longer crashes the handler or leaves the node in an ambiguous state.
v0.5.0 — Phase 3b delegated cognition (2026-05-24)#
Consumers can now run their LLM call entirely outside ScaiGrid. ScaiVoice becomes pure speech I/O — it never touches an LLM in this mode.
Shipped:
cognition_modesession config.'inference_service'(default — ScaiGrid's LLM gateway, current behaviour) or'delegated'(consumer-owned callback URL).- Delegated turn path. When
cognition_mode='delegated', the orchestrator POSTs the user input tocognition_callback_urlwith body{session_id, tenant_id, user_id, user_input, turn_index, request_id}and streams the response body (any content type — read viaaiter_text) asagent_textdeltas. End-of-stream → TTS. Auth via optionalcognition_callback_auth_tokenforwarded asAuthorization: Bearer …. - Cancellation propagates through the HTTP stream. Barge-in / VAD interrupts an in-flight callback turn within the same ~100 ms budget as the InferenceService path — the async-for over
aiter_textexits early and the response closes via context-manager teardown. - Clean errors.
SCAIVOICE_DELEGATED_CALLBACK_HTTP_ERROR(non-200, with truncated body for debugging),SCAIVOICE_DELEGATED_CALLBACK_TIMEOUT(60s default),SCAIVOICE_DELEGATED_CALLBACK_ERROR(connection failures). Validated at session-create — delegated mode without a URL is rejected withSCAIVOICE_BAD_CONFIG. - Shared TTS pipe. Extracted
_stream_to_ttsso the InferenceService path and the delegated path share the voxcpm mode resolution + dispatcher logic — single source of truth. - Token write-only.
cognition_callback_auth_tokenis persisted but never appears inGET /sessions/{id}responses; the read schema deliberately omits it.
The integrator now has three knobs at session-create:
| Setting | Server-owned | Client-owned |
|---|---|---|
| Conversation history | history_mode: 'server' |
history_mode: 'client' |
| Tool execution | n/a (server-owned LLM has no tools) OR caller tools array in client-history mode |
consumer's callback executes tools internally |
| LLM call | cognition_mode: 'inference_service' |
cognition_mode: 'delegated' |
Each is independent. Most ScaiBot/ScaiWave integrations will pick history_mode='client' + cognition_mode='delegated' — full control of cognition; ScaiVoice handles only speech I/O. Smaller demos can stick with the all-server defaults.
v0.4.0 — Phase 3a bring-your-own agent (2026-05-24)#
ScaiVoice steps out of the agent-runtime business. Consumer products (ScaiBot, ScaiWave, the phone-taking sibling) bring their own LLM ecosystem — history, tools, RAG, persona — and ScaiVoice handles only the speech I/O glue.
Shipped:
history_modesession config.'server'(default, current behaviour — ScaiVoice accumulates history) or'client'(consumer sends the fullmessagesarray on every text frame). Persisted on the session row.- Caller-owned
messagesper turn. In client mode, thetextframe must include amessagesarray. Validated via Pydantic; bad shapes surfaceSCAIVOICE_BAD_MESSAGES. Missing in client mode →SCAIVOICE_HISTORY_REQUIRED. Present in server mode →SCAIVOICE_HISTORY_OWNED_BY_SERVER. - Caller-supplied
toolsper turn. Optionaltoolsarray on the text frame (client mode only). Passes verbatim toInferenceService.chat_stream. Per-turn — every utterance can carry a different tool set. - Multi-step agent loop. When the LLM emits
finish_reason: tool_calls, the orchestrator yields one{"type":"agent_tool_call","tool_call_id","name","arguments"}event per call, waits for matching{"type":"tool_result","tool_call_id","content"}frames from the consumer, appends the tool messages to the working conversation, and re-invokes the LLM. Loops until the LLM produces final text → pipes to TTS. Runaway-loop guard at 8 iterations. - Tool-call delta accumulator. Coalesces streamed tool_call fragments (engines emit id+name on the first chunk, then argument fragments) into complete
ToolCallobjects for the consumer. - Cancellation through the tool wait. Barge-in / VAD interrupts a turn even while waiting for a tool result — the queue race is
tool_resultvscancel.
Why: voice-bot framework concerns end at "speech in, speech out". Conversation memory, tool execution policy, RAG context, persona prompting — those belong in the consumer's agent layer (where they already exist). ScaiVoice's job is to be invisible plumbing for whatever LLM ecosystem the consumer brings.
Phase 3b (deferred): delegated cognition — consumer registers an HTTP callback or proxied WS as the "LLM" endpoint, ScaiVoice forwards user transcripts there and pipes the reply into TTS. Lets consumers run their agent loop entirely outside ScaiGrid. Will ship as a cognition_mode session-level toggle.
v0.3.0 — Phase 2 wake-word + speaker-identify shape (2026-05-24)#
Opt-in wake-word gating fully wired controller-side. Speaker-identification endpoint contract shipped with a clean stub pending ScaiInfer-side RPC.
Shipped:
- Wake-word gating. When
wake_word_enabled: trueis set at session-create, the WS handler drops text/utterance frames until a{"type":"wake"}frame arrives. After each turn completes, the session re-arms (one wake = one turn). Server emits{"type":"wake_state","armed":<bool>}on every transition so the client UI can render the right prompt. SCAIVOICE_WAKE_REQUIREDinfo event. Pre-wake text frames are dropped with this informational event so clients can show "say the wake word" guidance without ambiguity.POST /v1/modules/scaiecho/speakers/identifyendpoint shape. Stable contract for one-shot speaker identification against tenant-enrolled speakers. Returns 503SCAIECHO_IDENTIFY_NOT_WIREDtoday (ScaiInfer engine RPC pending — seeREQUEST-SPEAKER-IDENTIFY-2026-05-24.md). Counts the tenant's enrolled-speaker pool in the error details so callers get useful telemetry even today.scaiecho:identifymodule permission added.- Client-side wake-word tutorial.
tutorials/client-wake-word.mdcovers openwakeword integration, Picovoice as an alternative, and the wake+VAD interaction pattern.
Reserved (no-op until engine wiring lands): speaker_recognition session flag still doesn't populate speaker_id on transcript frames — the orchestrator passes the field through but ScaiInfer's streaming-STT doesn't emit it yet. Lands transparently once the IdentifySpeaker RPC ships.
v0.2.0 — Phase 1 barge-in (2026-05-24)#
Concurrent pump loop + VAD-driven cancellation.
Shipped:
- Concurrent pump loop. The WS handler now reads client frames concurrently with the per-turn task. Control frames (interrupt, vad, close) propagate cancellation mid-turn — previously they were queued behind the in-flight turn iterator and only took effect after the turn naturally finished.
- VAD-driven barge-in.
{"type":"vad","speaking":true}arriving duringthinkingorspeakingcancels the current turn, transitions back tolisteningwithreason: "interrupted_by_user", and is ready for the next utterance within ~100 ms (asserted in unit tests).speaking:falseis informational. - Supersede semantics. A
{"type":"text"}frame arriving mid-turn cancels the in-flight turn before kicking off the new one. No turn-task pileup on a session. - State machine update.
thinking → listeningis now a legal transition (covers cancel-during-thinking and natural-end-without-TTS-audio cases). Other transitions unchanged. - Client-side VAD integration reference. Added
tutorials/client-vad-integration.mdwith the silero-vad-in-browser recipe and the emit-pattern recommendation.
Reserved (no-op until later phases): wake-word, live speaker recognition, tools, expression_hint server frame, mic-piped end-of-utterance detection.
v0.1.0 — Phase 0 framework foundation (2026-05-24)#
Initial release. Framework, not a usable bot — consumer products build their own bot personality on top.
Shipped:
- New module
scaivoicewith sidebar entry (SuperAdmin telemetry only). - ORM
mod_scaivoice_session(21 columns, 3 indexes) capturing session config + state + audit counters. - REST surface:
POST /sessions,GET /sessions/{id},DELETE /sessions/{id}. - WebSocket:
WS /sessions/{id}/streamwith the full protocol (open/state/transcript/agent_text/agent_audio/agent_done/error frames + close codes 4401/4403/4404/4400/4502/4500/1000). - State machine: idle / listening / thinking / speaking / interrupted / ended with explicit transition validation + idle/thinking/speaking timeouts.
- Cancellation primitive:
{"type":"interrupt"}frame cancels in-flight LLM + TTS within ~100 ms. - Bare-bones pipeline wired: ScaiEcho streaming STT, InferenceService.chat_stream, ScaiSpeak streaming TTS, all in-process.
- Module permissions:
scaivoice:use(open sessions),scaivoice:admin(telemetry). - Voxcpm mode resolver extracted from SpeakService to
modules.scaispeak.services.voxcpm_modeso the orchestrator can reuse it.
Reserved on the protocol (no-op in v0.1.0):
{"type":"vad"}frames — Phase 1 will wire them to auto barge-in.{"type":"wake"}frames — Phase 2 will wire them to state transitions.speaker_recognitionflag on session config — Phase 2 will populatespeaker_idon transcript frames.tools_enabledflag — Phase 3 will pipe tool calls through.expression_hintserver frame — reserved slot for future avatar / expression metadata.
Out of scope:
- Avatar / lipsync — separate solution.
- Hosted bot personalities — consumer products own them.