---
summary: "Voice-bot framework \u2014 coordinated speech-to-text, LLM cognition, and\
  \ text-to-speech behind a single WebSocket session. Build voice bots without wiring\
  \ STT + LLM + TTS yourself."
title: ScaiVoice
path: overview
status: published
---

ScaiVoice is a backend framework for building voice bots. You open a session over a single WebSocket, pipe audio in, and get coordinated state events, transcripts, agent text, and synthesized audio back. STT, LLM cognition, and TTS run on the same ScaiGrid infrastructure that powers ScaiEcho, our chat completions, and ScaiSpeak — no separate wiring required.

ScaiVoice is a **framework**, not a product. There is no end-user UI shipped with it. Consumer applications (ScaiBot's voice mode, a telephony bot, your in-app personal assistant) build their own personality, UX, and business logic on top of the protocol it exposes.

## What you get out of the box

| Capability | Default | Opt-in flag |
|---|---|---|
| Mic → STT → LLM → TTS pipeline | always | — |
| Conversation state machine (`idle / listening / thinking / speaking / interrupted`) | always | — |
| Streaming user transcripts (interim + final) | always | — |
| Streaming agent text + audio | always | — |
| Pick any voice from the ScaiSpeak voice library | always | `voice_id` per session |
| Per-session voice control (instructions, speed, cloning fidelity, warmup trim) | voice defaults | `instructions`, `speed`, `cfg_value`, `warmup_trim_ms` per session |
| Text normalisation (dates, times, currency, pronunciations) | tenant default | `normalize_text` per session |
| Anonymous speaker diarization | off | `diarize` per session |
| Barge-in (explicit interrupt frame) | always | — |
| Auto barge-in via VAD | off | `vad_enabled` (Phase 1) |
| Wake-word triggering | off | `wake_word_enabled` (Phase 2) |
| Live speaker identification | off | `speaker_recognition` (Phase 2; tenant opt-in) |
| Tool / skill execution | off | `tools_enabled` (Phase 3) |

The protocol is stable from Phase 0 — later phases light up opt-in flags without breaking integrations.

## What you do on your side

- **Audio capture + playback.** Browser: `AudioContext` + `AudioWorklet` for 16 kHz PCM16 mono out, MediaSource or Web Audio for playback. Native: equivalent.
- **VAD (optional).** Client-side via silero-vad / webrtcvad, emit `{"type":"vad", speaking:true/false}` frames when you want auto barge-in.
- **Wake word (optional).** Client-side via openwakeword, emit `{"type":"wake", confidence}` when triggered.
- **Bot personality, UI, business logic.** All yours.

## Out of scope (deliberately)

- **Avatar / lipsync.** Separate solution; ScaiVoice reserves an `expression_hint` field on the WS protocol for forward compatibility but emits nothing in v1.
- **Hosted bot personalities.** Consumer products own their personality + business logic.
- **Hard real-time guarantees.** Streaming first-frame latency is typically 100–300 ms; ScaiVoice is suitable for chat-style and IVR-style bots, not for ultra-low-latency call-routing.

## Permissions

| Permission | Who needs it |
|---|---|
| `scaivoice:use` | Any caller opening a session. Granted via direct module permission or via a custom role that bundles it. |
| `scaivoice:admin` | Tenant admins viewing session telemetry. |

## Status

v0.6.0 ships per-session voice control and a round of infrastructure hardening. See the [changelog](./changelog) for the full history.