Client-side VAD integration
ScaiVoice's auto barge-in (cancel the bot mid-reply when the user starts talking) needs the client to detect "user is now speaking" and tell the server. VAD lives on the client side — sending continuous mic frames purely to detect silence-vs-speech server-side would be wasteful, and the latency from a round-trip would defeat the purpose.
This page covers the recommended browser-side integration. Native clients use the same emit pattern; only the VAD library differs.
What ScaiVoice expects#
Two frames you can emit any time:
1 2 | |
Behaviour by state:
| Client emits | Session state | Server does |
|---|---|---|
speaking: true |
thinking or speaking |
Cancels the current turn (LLM + TTS) within ~100 ms. State → listening with reason: "interrupted_by_user". |
speaking: true |
listening |
No-op — user talking during listening is the expected state. |
speaking: true |
idle or interrupted |
No-op — nothing to cancel. |
speaking: false |
any | Informational. Doesn't drive state. |
There is no specific minimum interval — only emit when state actually transitions (don't spam at the VAD's frame rate).
Recommended browser library#
silero-vad is the strong default — small (4 MB), fast (single-millisecond inference), Apache-2.0, ships an ONNX model that runs in onnxruntime-web. Trade-off versus webrtcvad: silero is more accurate on low-SNR audio at the cost of needing the WASM runtime.
Reference integration#
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 | |
Mic audio: separate from the VAD signal#
Phase 1 doesn't pipe mic frames into ScaiVoice's STT yet — turns are driven by {"type":"text"} frames in the demo path. When Phase 2 wires mic-piped STT, the audio path is:
1 2 3 | |
The same AudioWorklet downsample runs the bytes that go to the binary path AND the bytes the VAD library sees. One mic source, two consumers.
Tuning advice#
- False positives during TTS playback. If your TTS output bleeds into the mic, the VAD will trigger on the bot's own voice. Mitigation: use a headset, or apply acoustic echo cancellation client-side.
getUserMedia({audio: {echoCancellation: true}})is the cheap option; works well for most browser scenarios. - Holding the speak threshold too high. Below 0.85 you get false triggers on background noise; above 0.95 the bot can't be interrupted by a quiet "actually, wait". Start at 0.85 and tune from there.
- Minimum speech frames.
minSpeechFrames: 3means ~96 ms of confirmed speech beforeonSpeechStartfires. Lower for snappier barge-in; higher to absorb tongue clicks / breath sounds. The trade-off is barge-in latency versus false-positive rate.
Without VAD#
Skip everything above and the bot still works — barge-in is opt-in. Without VAD the user has two options:
- Click an "interrupt" button in the UI that sends
{"type":"interrupt"}. - Wait for the bot to finish.
Most chat UIs ship with the button as a fallback even when VAD is enabled, so a user can interrupt before VAD picks up their first word.
Server-side VAD (if you really need it)#
ScaiVoice doesn't run VAD server-side today. If you have a use case (server-recorded audio with no client to run VAD, batch scenarios), file an integration request — wiring silero-vad into the streaming-STT path is a small change but it's not in current scope.