Platform
ScaiWave ScaiGrid ScaiCore ScaiBot ScaiDrive ScaiKey Models Tools & Services
Solutions
Organisations Developers Internet Service Providers Managed Service Providers AI-in-a-Box
Resources
Support Documentation Blog Downloads
Company
About Research Careers Investment Opportunities Contact
Log in

ScaiVoice

ScaiVoice is a backend framework for building voice bots. You open a session over a single WebSocket, pipe audio in, and get coordinated state events, transcripts, agent text, and synthesized audio back. STT, LLM cognition, and TTS run on the same ScaiGrid infrastructure that powers ScaiEcho, our chat completions, and ScaiSpeak — no separate wiring required.

ScaiVoice is a framework, not a product. There is no end-user UI shipped with it. Consumer applications (ScaiBot's voice mode, a telephony bot, your in-app personal assistant) build their own personality, UX, and business logic on top of the protocol it exposes.

What you get out of the box#

Capability Default Opt-in flag
Mic → STT → LLM → TTS pipeline always
Conversation state machine (idle / listening / thinking / speaking / interrupted) always
Streaming user transcripts (interim + final) always
Streaming agent text + audio always
Pick any voice from the ScaiSpeak voice library always voice_id per session
Per-session voice control (instructions, speed, cloning fidelity, warmup trim) voice defaults instructions, speed, cfg_value, warmup_trim_ms per session
Text normalisation (dates, times, currency, pronunciations) tenant default normalize_text per session
Anonymous speaker diarization off diarize per session
Barge-in (explicit interrupt frame) always
Auto barge-in via VAD off vad_enabled (Phase 1)
Wake-word triggering off wake_word_enabled (Phase 2)
Live speaker identification off speaker_recognition (Phase 2; tenant opt-in)
Tool / skill execution off tools_enabled (Phase 3)

The protocol is stable from Phase 0 — later phases light up opt-in flags without breaking integrations.

What you do on your side#

  • Audio capture + playback. Browser: AudioContext + AudioWorklet for 16 kHz PCM16 mono out, MediaSource or Web Audio for playback. Native: equivalent.
  • VAD (optional). Client-side via silero-vad / webrtcvad, emit {"type":"vad", speaking:true/false} frames when you want auto barge-in.
  • Wake word (optional). Client-side via openwakeword, emit {"type":"wake", confidence} when triggered.
  • Bot personality, UI, business logic. All yours.

Out of scope (deliberately)#

  • Avatar / lipsync. Separate solution; ScaiVoice reserves an expression_hint field on the WS protocol for forward compatibility but emits nothing in v1.
  • Hosted bot personalities. Consumer products own their personality + business logic.
  • Hard real-time guarantees. Streaming first-frame latency is typically 100–300 ms; ScaiVoice is suitable for chat-style and IVR-style bots, not for ultra-low-latency call-routing.

Permissions#

Permission Who needs it
scaivoice:use Any caller opening a session. Granted via direct module permission or via a custom role that bundles it.
scaivoice:admin Tenant admins viewing session telemetry.

Status#

v0.6.0 ships per-session voice control and a round of infrastructure hardening. See the changelog for the full history.

Updated 2026-05-25 23:46:26 View source (.md) rev 3