ScaiVoice

ScaiVoice is a backend framework for building voice bots. You open a session over a single WebSocket, pipe audio in, and get coordinated state events, transcripts, agent text, and synthesized audio back. STT, LLM cognition, and TTS run on the same ScaiGrid infrastructure that powers ScaiEcho, our chat completions, and ScaiSpeak — no separate wiring required.

ScaiVoice is a framework, not a product. There is no end-user UI shipped with it. Consumer applications (ScaiBot's voice mode, a telephony bot, your in-app personal assistant) build their own personality, UX, and business logic on top of the protocol it exposes.

What you get out of the box#

Capability	Default	Opt-in flag
Mic → STT → LLM → TTS pipeline	always	—
Conversation state machine (`idle / listening / thinking / speaking / interrupted`)	always	—
Streaming user transcripts (interim + final)	always	—
Streaming agent text + audio	always	—
Pick any voice from the ScaiSpeak voice library	always	`voice_id` per session
Per-session voice control (instructions, speed, cloning fidelity, warmup trim)	voice defaults	`instructions`, `speed`, `cfg_value`, `warmup_trim_ms` per session
Text normalisation (dates, times, currency, pronunciations)	tenant default	`normalize_text` per session
Anonymous speaker diarization	off	`diarize` per session
Barge-in (explicit interrupt frame)	always	—
Auto barge-in via VAD	off	`vad_enabled` (Phase 1)
Wake-word triggering	off	`wake_word_enabled` (Phase 2)
Live speaker identification	off	`speaker_recognition` (Phase 2; tenant opt-in)
Tool / skill execution	off	`tools_enabled` (Phase 3)

The protocol is stable from Phase 0 — later phases light up opt-in flags without breaking integrations.

What you do on your side#

Audio capture + playback. Browser: AudioContext + AudioWorklet for 16 kHz PCM16 mono out, MediaSource or Web Audio for playback. Native: equivalent.
VAD (optional). Client-side via silero-vad / webrtcvad, emit {"type":"vad", speaking:true/false} frames when you want auto barge-in.
Wake word (optional). Client-side via openwakeword, emit {"type":"wake", confidence} when triggered.
Bot personality, UI, business logic. All yours.

Out of scope (deliberately)#

Avatar / lipsync. Separate solution; ScaiVoice reserves an expression_hint field on the WS protocol for forward compatibility but emits nothing in v1.
Hosted bot personalities. Consumer products own their personality + business logic.
Hard real-time guarantees. Streaming first-frame latency is typically 100–300 ms; ScaiVoice is suitable for chat-style and IVR-style bots, not for ultra-low-latency call-routing.

Permissions#

Permission	Who needs it
`scaivoice:use`	Any caller opening a session. Granted via direct module permission or via a custom role that bundles it.
`scaivoice:admin`	Tenant admins viewing session telemetry.

Status#

v0.6.0 ships per-session voice control and a round of infrastructure hardening. See the changelog for the full history.