ScaiVoice
ScaiVoice is a backend framework for building voice bots. You open a session over a single WebSocket, pipe audio in, and get coordinated state events, transcripts, agent text, and synthesized audio back. STT, LLM cognition, and TTS run on the same ScaiGrid infrastructure that powers ScaiEcho, our chat completions, and ScaiSpeak — no separate wiring required.
ScaiVoice is a framework, not a product. There is no end-user UI shipped with it. Consumer applications (ScaiBot's voice mode, a telephony bot, your in-app personal assistant) build their own personality, UX, and business logic on top of the protocol it exposes.
What you get out of the box#
| Capability | Default | Opt-in flag |
|---|---|---|
| Mic → STT → LLM → TTS pipeline | always | — |
Conversation state machine (idle / listening / thinking / speaking / interrupted) |
always | — |
| Streaming user transcripts (interim + final) | always | — |
| Streaming agent text + audio | always | — |
| Pick any voice from the ScaiSpeak voice library | always | voice_id per session |
| Per-session voice control (instructions, speed, cloning fidelity, warmup trim) | voice defaults | instructions, speed, cfg_value, warmup_trim_ms per session |
| Text normalisation (dates, times, currency, pronunciations) | tenant default | normalize_text per session |
| Anonymous speaker diarization | off | diarize per session |
| Barge-in (explicit interrupt frame) | always | — |
| Auto barge-in via VAD | off | vad_enabled (Phase 1) |
| Wake-word triggering | off | wake_word_enabled (Phase 2) |
| Live speaker identification | off | speaker_recognition (Phase 2; tenant opt-in) |
| Tool / skill execution | off | tools_enabled (Phase 3) |
The protocol is stable from Phase 0 — later phases light up opt-in flags without breaking integrations.
What you do on your side#
- Audio capture + playback. Browser:
AudioContext+AudioWorkletfor 16 kHz PCM16 mono out, MediaSource or Web Audio for playback. Native: equivalent. - VAD (optional). Client-side via silero-vad / webrtcvad, emit
{"type":"vad", speaking:true/false}frames when you want auto barge-in. - Wake word (optional). Client-side via openwakeword, emit
{"type":"wake", confidence}when triggered. - Bot personality, UI, business logic. All yours.
Out of scope (deliberately)#
- Avatar / lipsync. Separate solution; ScaiVoice reserves an
expression_hintfield on the WS protocol for forward compatibility but emits nothing in v1. - Hosted bot personalities. Consumer products own their personality + business logic.
- Hard real-time guarantees. Streaming first-frame latency is typically 100–300 ms; ScaiVoice is suitable for chat-style and IVR-style bots, not for ultra-low-latency call-routing.
Permissions#
| Permission | Who needs it |
|---|---|
scaivoice:use |
Any caller opening a session. Granted via direct module permission or via a custom role that bundles it. |
scaivoice:admin |
Tenant admins viewing session telemetry. |
Status#
v0.6.0 ships per-session voice control and a round of infrastructure hardening. See the changelog for the full history.