--- summary: How to wire voice-activity detection on the client and emit the right WebSocket frames so ScaiVoice can drive automatic barge-in. title: Client-side VAD integration path: tutorials/client-vad-integration status: published --- ScaiVoice's auto barge-in (cancel the bot mid-reply when the user starts talking) needs the client to detect "user is now speaking" and tell the server. VAD lives on the client side — sending continuous mic frames purely to detect silence-vs-speech server-side would be wasteful, and the latency from a round-trip would defeat the purpose. This page covers the recommended browser-side integration. Native clients use the same emit pattern; only the VAD library differs. ## What ScaiVoice expects Two frames you can emit any time: ```json {"type": "vad", "speaking": true} {"type": "vad", "speaking": false} ``` Behaviour by state: | Client emits | Session state | Server does | |---|---|---| | `speaking: true` | `thinking` or `speaking` | Cancels the current turn (LLM + TTS) within ~100 ms. State → `listening` with `reason: "interrupted_by_user"`. | | `speaking: true` | `listening` | No-op — user talking during listening is the expected state. | | `speaking: true` | `idle` or `interrupted` | No-op — nothing to cancel. | | `speaking: false` | any | Informational. Doesn't drive state. | There is no specific minimum interval — only emit when state actually transitions (don't spam at the VAD's frame rate). ## Recommended browser library [silero-vad](https://github.com/snakers4/silero-vad) is the strong default — small (4 MB), fast (single-millisecond inference), Apache-2.0, ships an ONNX model that runs in `onnxruntime-web`. Trade-off versus webrtcvad: silero is more accurate on low-SNR audio at the cost of needing the WASM runtime. ## Reference integration ```html ``` ## Mic audio: separate from the VAD signal Phase 1 doesn't pipe mic frames into ScaiVoice's STT yet — turns are driven by `{"type":"text"}` frames in the demo path. When Phase 2 wires mic-piped STT, the audio path is: ``` Microphone → AudioWorklet → 16 kHz PCM16 mono frames → WS binary ↘ VAD inference → onSpeechStart/End → JSON frames ``` The same AudioWorklet downsample runs the bytes that go to the binary path AND the bytes the VAD library sees. One mic source, two consumers. ## Tuning advice - **False positives during TTS playback.** If your TTS output bleeds into the mic, the VAD will trigger on the bot's own voice. Mitigation: use a headset, or apply acoustic echo cancellation client-side. `getUserMedia({audio: {echoCancellation: true}})` is the cheap option; works well for most browser scenarios. - **Holding the speak threshold too high.** Below 0.85 you get false triggers on background noise; above 0.95 the bot can't be interrupted by a quiet "actually, wait". Start at 0.85 and tune from there. - **Minimum speech frames.** `minSpeechFrames: 3` means ~96 ms of confirmed speech before `onSpeechStart` fires. Lower for snappier barge-in; higher to absorb tongue clicks / breath sounds. The trade-off is barge-in latency versus false-positive rate. ## Without VAD Skip everything above and the bot still works — barge-in is opt-in. Without VAD the user has two options: - Click an "interrupt" button in the UI that sends `{"type":"interrupt"}`. - Wait for the bot to finish. Most chat UIs ship with the button as a fallback even when VAD is enabled, so a user can interrupt before VAD picks up their first word. ## Server-side VAD (if you really need it) ScaiVoice doesn't run VAD server-side today. If you have a use case (server-recorded audio with no client to run VAD, batch scenarios), file an integration request — wiring silero-vad into the streaming-STT path is a small change but it's not in current scope.