Platform
ScaiWave ScaiGrid ScaiCore ScaiBot ScaiDrive ScaiKey Models Tools & Services
Solutions
Organisations Developers Internet Service Providers Managed Service Providers AI-in-a-Box
Resources
Support Documentation Blog Downloads
Company
About Research Careers Investment Opportunities Contact
Log in

Client-side VAD integration

ScaiVoice's auto barge-in (cancel the bot mid-reply when the user starts talking) needs the client to detect "user is now speaking" and tell the server. VAD lives on the client side — sending continuous mic frames purely to detect silence-vs-speech server-side would be wasteful, and the latency from a round-trip would defeat the purpose.

This page covers the recommended browser-side integration. Native clients use the same emit pattern; only the VAD library differs.

What ScaiVoice expects#

Two frames you can emit any time:

json
1
2
{"type": "vad", "speaking": true}
{"type": "vad", "speaking": false}

Behaviour by state:

Client emits Session state Server does
speaking: true thinking or speaking Cancels the current turn (LLM + TTS) within ~100 ms. State → listening with reason: "interrupted_by_user".
speaking: true listening No-op — user talking during listening is the expected state.
speaking: true idle or interrupted No-op — nothing to cancel.
speaking: false any Informational. Doesn't drive state.

There is no specific minimum interval — only emit when state actually transitions (don't spam at the VAD's frame rate).

silero-vad is the strong default — small (4 MB), fast (single-millisecond inference), Apache-2.0, ships an ONNX model that runs in onnxruntime-web. Trade-off versus webrtcvad: silero is more accurate on low-SNR audio at the cost of needing the WASM runtime.

Reference integration#

html
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
<script type="module">
import { MicVAD } from "@ricky0123/vad-web";

const ws = new WebSocket(`wss://scaigrid.scailabs.ai${WS_URL}?token=${TOKEN}`);
ws.binaryType = "arraybuffer";

let lastSpeaking = false;

const vad = await MicVAD.new({
  // Tune these for your room conditions; the defaults are sane.
  positiveSpeechThreshold: 0.85,
  negativeSpeechThreshold: 0.5,
  minSpeechFrames: 3,

  onSpeechStart: () => {
    if (lastSpeaking) return;
    lastSpeaking = true;
    if (ws.readyState === WebSocket.OPEN) {
      ws.send(JSON.stringify({type: "vad", speaking: true}));
    }
  },

  onSpeechEnd: () => {
    if (!lastSpeaking) return;
    lastSpeaking = false;
    if (ws.readyState === WebSocket.OPEN) {
      ws.send(JSON.stringify({type: "vad", speaking: false}));
    }
  },
});

vad.start();

// Don't forget to vad.pause() when the user leaves the voice UI.
</script>

Mic audio: separate from the VAD signal#

Phase 1 doesn't pipe mic frames into ScaiVoice's STT yet — turns are driven by {"type":"text"} frames in the demo path. When Phase 2 wires mic-piped STT, the audio path is:

teratermmacro
1
2
3
Microphone  AudioWorklet  16 kHz PCM16 mono frames  WS binary
                                                                                                  VAD inference  onSpeechStart/End  JSON frames

The same AudioWorklet downsample runs the bytes that go to the binary path AND the bytes the VAD library sees. One mic source, two consumers.

Tuning advice#

  • False positives during TTS playback. If your TTS output bleeds into the mic, the VAD will trigger on the bot's own voice. Mitigation: use a headset, or apply acoustic echo cancellation client-side. getUserMedia({audio: {echoCancellation: true}}) is the cheap option; works well for most browser scenarios.
  • Holding the speak threshold too high. Below 0.85 you get false triggers on background noise; above 0.95 the bot can't be interrupted by a quiet "actually, wait". Start at 0.85 and tune from there.
  • Minimum speech frames. minSpeechFrames: 3 means ~96 ms of confirmed speech before onSpeechStart fires. Lower for snappier barge-in; higher to absorb tongue clicks / breath sounds. The trade-off is barge-in latency versus false-positive rate.

Without VAD#

Skip everything above and the bot still works — barge-in is opt-in. Without VAD the user has two options:

  • Click an "interrupt" button in the UI that sends {"type":"interrupt"}.
  • Wait for the bot to finish.

Most chat UIs ship with the button as a fallback even when VAD is enabled, so a user can interrupt before VAD picks up their first word.

Server-side VAD (if you really need it)#

ScaiVoice doesn't run VAD server-side today. If you have a use case (server-recorded audio with no client to run VAD, batch scenarios), file an integration request — wiring silero-vad into the streaming-STT path is a small change but it's not in current scope.

Updated 2026-05-25 23:46:26 View source (.md) rev 3