Client-side VAD integration

ScaiVoice's auto barge-in (cancel the bot mid-reply when the user starts talking) needs the client to detect "user is now speaking" and tell the server. VAD lives on the client side — sending continuous mic frames purely to detect silence-vs-speech server-side would be wasteful, and the latency from a round-trip would defeat the purpose.

This page covers the recommended browser-side integration. Native clients use the same emit pattern; only the VAD library differs.

What ScaiVoice expects#

Two frames you can emit any time:

json
{"type": "vad", "speaking": true}
{"type": "vad", "speaking": false}

Behaviour by state:

Client emits	Session state	Server does
`speaking: true`	`thinking` or `speaking`	Cancels the current turn (LLM + TTS) within ~100 ms. State → `listening` with `reason: "interrupted_by_user"`.
`speaking: true`	`listening`	No-op — user talking during listening is the expected state.
`speaking: true`	`idle` or `interrupted`	No-op — nothing to cancel.
`speaking: false`	any	Informational. Doesn't drive state.

There is no specific minimum interval — only emit when state actually transitions (don't spam at the VAD's frame rate).

Recommended browser library#

silero-vad is the strong default — small (4 MB), fast (single-millisecond inference), Apache-2.0, ships an ONNX model that runs in onnxruntime-web. Trade-off versus webrtcvad: silero is more accurate on low-SNR audio at the cost of needing the WASM runtime.

Reference integration#

html
<script type="module">
import { MicVAD } from "@ricky0123/vad-web";

const ws = new WebSocket(`wss://scaigrid.scailabs.ai${WS_URL}?token=${TOKEN}`);
ws.binaryType = "arraybuffer";

let lastSpeaking = false;

const vad = await MicVAD.new({
  // Tune these for your room conditions; the defaults are sane.
  positiveSpeechThreshold: 0.85,
  negativeSpeechThreshold: 0.5,
  minSpeechFrames: 3,

  onSpeechStart: () => {
    if (lastSpeaking) return;
    lastSpeaking = true;
    if (ws.readyState === WebSocket.OPEN) {
      ws.send(JSON.stringify({type: "vad", speaking: true}));
    }
  },

  onSpeechEnd: () => {
    if (!lastSpeaking) return;
    lastSpeaking = false;
    if (ws.readyState === WebSocket.OPEN) {
      ws.send(JSON.stringify({type: "vad", speaking: false}));
    }
  },
});

vad.start();

// Don't forget to vad.pause() when the user leaves the voice UI.
</script>

Mic audio: separate from the VAD signal#

Phase 1 doesn't pipe mic frames into ScaiVoice's STT yet — turns are driven by {"type":"text"} frames in the demo path. When Phase 2 wires mic-piped STT, the audio path is:

teratermmacro
Microphone → AudioWorklet → 16 kHz PCM16 mono frames → WS binary
                                                    ↘
                                              VAD inference → onSpeechStart/End → JSON frames

The same AudioWorklet downsample runs the bytes that go to the binary path AND the bytes the VAD library sees. One mic source, two consumers.

Tuning advice#

False positives during TTS playback. If your TTS output bleeds into the mic, the VAD will trigger on the bot's own voice. Mitigation: use a headset, or apply acoustic echo cancellation client-side. getUserMedia({audio: {echoCancellation: true}}) is the cheap option; works well for most browser scenarios.
Holding the speak threshold too high. Below 0.85 you get false triggers on background noise; above 0.95 the bot can't be interrupted by a quiet "actually, wait". Start at 0.85 and tune from there.
Minimum speech frames. minSpeechFrames: 3 means ~96 ms of confirmed speech before onSpeechStart fires. Lower for snappier barge-in; higher to absorb tongue clicks / breath sounds. The trade-off is barge-in latency versus false-positive rate.

Without VAD#

Skip everything above and the bot still works — barge-in is opt-in. Without VAD the user has two options:

Click an "interrupt" button in the UI that sends {"type":"interrupt"}.
Wait for the bot to finish.

Most chat UIs ship with the button as a fallback even when VAD is enabled, so a user can interrupt before VAD picks up their first word.

Server-side VAD (if you really need it)#

ScaiVoice doesn't run VAD server-side today. If you have a use case (server-recorded audio with no client to run VAD, batch scenarios), file an integration request — wiring silero-vad into the streaming-STT path is a small change but it's not in current scope.