Client-side wake word

ScaiVoice's wake-word gating lets you build always-on personal-assistant flows: the bot is dormant by default and only "wakes" for the next utterance after the user says a trigger phrase. The wake detector runs on the client; the server just gates input frames behind a wake_armed flag.

When to use wake-word gating#

Turn it on when:

The mic is always-open in your UI (kitchen assistant, hands-free car app).
You want to avoid double-processing background conversation as user input.
You need a clear "the bot is now listening to you" signal for UX.

Leave it off for click-to-talk UIs — there the user's button press is already the activation gesture.

What ScaiVoice expects#

Two pieces:

At session create, set wake_word_enabled: true:

json
POST /v1/modules/scaivoice/sessions
{"voice_id": "vc_…", "llm_model": "…", "wake_word_enabled": true}

At runtime, emit a frame when your client-side detector fires:
json
1
{"type": "wake", "confidence": 0.93}

Behaviour:

Client emits	Session state	Server does
`wake`	armed=false (initial / post-turn)	Sets `wake_armed=true`, emits `{"type":"wake_state","armed":true}`
`wake`	armed=true	No-op (idempotent — fine to re-emit).
`text` or utterance	armed=false	Drops the input, emits `{"type":"info","code":"SCAIVOICE_WAKE_REQUIRED"}`
`text` or utterance	armed=true	Processes normally. After the turn completes, server re-arms (sets armed=false).

The server emits {"type":"wake_state","armed":<bool>,"wake_word_enabled":true} on:

Initial connect (so the client knows it's armed=false).
On every wake-state transition (after the wake frame, after each turn).

Render UI off these events — typically a "Say 'hey assistant'" prompt when not armed, and a "Listening…" indicator when armed.

Recommended browser library#

openwakeword is the recommended default — Apache-2.0, pre-trained models for common phrases (hey jarvis, alexa, hey google, custom training supported), runs in browser via ONNX/WASM. Small models (~5 MB) load fast.

For production-grade accuracy or proprietary wake phrases, Picovoice Porcupine is a commercial alternative with better detection rates at the cost of a per-device license. The integration pattern is identical — both libraries expose a callback-on-detection API.

Reference integration#

html
<script type="module">
import { OpenWakeWord } from "@openwakeword/web";  // hypothetical wrapper

const ws = new WebSocket(`wss://scaigrid.scailabs.ai${WS_URL}?token=${TOKEN}`);
ws.binaryType = "arraybuffer";

// Track local state so we don't re-emit wake while already armed.
let armedLocal = false;

ws.addEventListener("message", (event) => {
  if (typeof event.data !== "string") return;
  const msg = JSON.parse(event.data);
  if (msg.type === "wake_state") {
    armedLocal = !!msg.armed;
    renderArmedIndicator(armedLocal);
  }
});

const wakeDetector = await OpenWakeWord.load({
  model: "hey_jarvis",  // or a custom-trained ONNX bundle
  threshold: 0.5,
});

wakeDetector.on("trigger", (event) => {
  // Idempotent — server no-ops if already armed.
  if (ws.readyState === WebSocket.OPEN && !armedLocal) {
    ws.send(JSON.stringify({
      type: "wake",
      confidence: event.confidence,
    }));
  }
});

wakeDetector.start();
</script>

Combining with VAD#

Wake-word + VAD play together naturally:

Wake word fires once → bot becomes armed.
VAD then drives the actual mic frames + barge-in inside the now-armed turn.

Order of events for a typical "hey assistant, what's the weather?" interaction:

User: "hey assistant" → wake-word detector fires → emit {"type":"wake"}.
Server: {"type":"wake_state","armed":true}.
User: pause briefly, then "what's the weather?" → VAD speaking:true → mic frames flow → STT segments → end-of-utterance.
Server runs the turn → emits agent_text + audio frames.
Turn done → server re-arms ({"type":"wake_state","armed":false}).
Back to waiting for the next wake.

If the user interrupts mid-reply ("never mind"), the VAD speaking-true triggers cancellation as documented in Client-side VAD integration.

Without wake word#

Skip everything above and the session is always-listening — every utterance is processed immediately. This is the default and simplest UX for push-to-talk style flows.

What about server-side wake-word?#

ScaiVoice doesn't run wake-word detection server-side. Sending the full mic stream to the server just to detect "hey assistant" would be wasteful (continuous bandwidth + STT cycles) and would add round-trip latency to the most latency-critical signal. Client-side is the right tier; we're unlikely to change this.