---
summary: How to wire voice-activity detection on the client and emit the right WebSocket
  frames so ScaiVoice can drive automatic barge-in.
title: Client-side VAD integration
path: tutorials/client-vad-integration
status: published
---

ScaiVoice's auto barge-in (cancel the bot mid-reply when the user starts talking) needs the client to detect "user is now speaking" and tell the server. VAD lives on the client side — sending continuous mic frames purely to detect silence-vs-speech server-side would be wasteful, and the latency from a round-trip would defeat the purpose.

This page covers the recommended browser-side integration. Native clients use the same emit pattern; only the VAD library differs.

## What ScaiVoice expects

Two frames you can emit any time:

```json
{"type": "vad", "speaking": true}
{"type": "vad", "speaking": false}
```

Behaviour by state:

| Client emits | Session state | Server does |
|---|---|---|
| `speaking: true` | `thinking` or `speaking` | Cancels the current turn (LLM + TTS) within ~100 ms. State → `listening` with `reason: "interrupted_by_user"`. |
| `speaking: true` | `listening` | No-op — user talking during listening is the expected state. |
| `speaking: true` | `idle` or `interrupted` | No-op — nothing to cancel. |
| `speaking: false` | any | Informational. Doesn't drive state. |

There is no specific minimum interval — only emit when state actually transitions (don't spam at the VAD's frame rate).

## Recommended browser library

[silero-vad](https://github.com/snakers4/silero-vad) is the strong default — small (4 MB), fast (single-millisecond inference), Apache-2.0, ships an ONNX model that runs in `onnxruntime-web`. Trade-off versus webrtcvad: silero is more accurate on low-SNR audio at the cost of needing the WASM runtime.

## Reference integration

```html
<script type="module">
import { MicVAD } from "@ricky0123/vad-web";

const ws = new WebSocket(`wss://scaigrid.scailabs.ai${WS_URL}?token=${TOKEN}`);
ws.binaryType = "arraybuffer";

let lastSpeaking = false;

const vad = await MicVAD.new({
  // Tune these for your room conditions; the defaults are sane.
  positiveSpeechThreshold: 0.85,
  negativeSpeechThreshold: 0.5,
  minSpeechFrames: 3,

  onSpeechStart: () => {
    if (lastSpeaking) return;
    lastSpeaking = true;
    if (ws.readyState === WebSocket.OPEN) {
      ws.send(JSON.stringify({type: "vad", speaking: true}));
    }
  },

  onSpeechEnd: () => {
    if (!lastSpeaking) return;
    lastSpeaking = false;
    if (ws.readyState === WebSocket.OPEN) {
      ws.send(JSON.stringify({type: "vad", speaking: false}));
    }
  },
});

vad.start();

// Don't forget to vad.pause() when the user leaves the voice UI.
</script>
```

## Mic audio: separate from the VAD signal

Phase 1 doesn't pipe mic frames into ScaiVoice's STT yet — turns are driven by `{"type":"text"}` frames in the demo path. When Phase 2 wires mic-piped STT, the audio path is:

```
Microphone → AudioWorklet → 16 kHz PCM16 mono frames → WS binary
                                                    ↘
                                              VAD inference → onSpeechStart/End → JSON frames
```

The same AudioWorklet downsample runs the bytes that go to the binary path AND the bytes the VAD library sees. One mic source, two consumers.

## Tuning advice

- **False positives during TTS playback.** If your TTS output bleeds into the mic, the VAD will trigger on the bot's own voice. Mitigation: use a headset, or apply acoustic echo cancellation client-side. `getUserMedia({audio: {echoCancellation: true}})` is the cheap option; works well for most browser scenarios.
- **Holding the speak threshold too high.** Below 0.85 you get false triggers on background noise; above 0.95 the bot can't be interrupted by a quiet "actually, wait". Start at 0.85 and tune from there.
- **Minimum speech frames.** `minSpeechFrames: 3` means ~96 ms of confirmed speech before `onSpeechStart` fires. Lower for snappier barge-in; higher to absorb tongue clicks / breath sounds. The trade-off is barge-in latency versus false-positive rate.

## Without VAD

Skip everything above and the bot still works — barge-in is opt-in. Without VAD the user has two options:

- Click an "interrupt" button in the UI that sends `{"type":"interrupt"}`.
- Wait for the bot to finish.

Most chat UIs ship with the button as a fallback even when VAD is enabled, so a user can interrupt before VAD picks up their first word.

## Server-side VAD (if you really need it)

ScaiVoice doesn't run VAD server-side today. If you have a use case (server-recorded audio with no client to run VAD, batch scenarios), file an integration request — wiring silero-vad into the streaming-STT path is a small change but it's not in current scope.
