Platform
ScaiWave ScaiGrid ScaiCore ScaiBot ScaiDrive ScaiKey Models Tools & Services
Solutions
Organisations Developers Internet Service Providers Managed Service Providers AI-in-a-Box
Resources
Support Documentation Blog Downloads
Company
About Research Careers Investment Opportunities Contact
Log in

Client-side wake word

ScaiVoice's wake-word gating lets you build always-on personal-assistant flows: the bot is dormant by default and only "wakes" for the next utterance after the user says a trigger phrase. The wake detector runs on the client; the server just gates input frames behind a wake_armed flag.

When to use wake-word gating#

Turn it on when:

  • The mic is always-open in your UI (kitchen assistant, hands-free car app).
  • You want to avoid double-processing background conversation as user input.
  • You need a clear "the bot is now listening to you" signal for UX.

Leave it off for click-to-talk UIs — there the user's button press is already the activation gesture.

What ScaiVoice expects#

Two pieces:

  1. At session create, set wake_word_enabled: true:

    json
    1
    2
    POST /v1/modules/scaivoice/sessions
    {"voice_id": "vc_…", "llm_model": "…", "wake_word_enabled": true}
    
  2. At runtime, emit a frame when your client-side detector fires:

    json
    1
    {"type": "wake", "confidence": 0.93}
    

Behaviour:

Client emits Session state Server does
wake armed=false (initial / post-turn) Sets wake_armed=true, emits {"type":"wake_state","armed":true}
wake armed=true No-op (idempotent — fine to re-emit).
text or utterance armed=false Drops the input, emits {"type":"info","code":"SCAIVOICE_WAKE_REQUIRED"}
text or utterance armed=true Processes normally. After the turn completes, server re-arms (sets armed=false).

The server emits {"type":"wake_state","armed":<bool>,"wake_word_enabled":true} on:

  • Initial connect (so the client knows it's armed=false).
  • On every wake-state transition (after the wake frame, after each turn).

Render UI off these events — typically a "Say 'hey assistant'" prompt when not armed, and a "Listening…" indicator when armed.

openwakeword is the recommended default — Apache-2.0, pre-trained models for common phrases (hey jarvis, alexa, hey google, custom training supported), runs in browser via ONNX/WASM. Small models (~5 MB) load fast.

For production-grade accuracy or proprietary wake phrases, Picovoice Porcupine is a commercial alternative with better detection rates at the cost of a per-device license. The integration pattern is identical — both libraries expose a callback-on-detection API.

Reference integration#

html
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
<script type="module">
import { OpenWakeWord } from "@openwakeword/web";  // hypothetical wrapper

const ws = new WebSocket(`wss://scaigrid.scailabs.ai${WS_URL}?token=${TOKEN}`);
ws.binaryType = "arraybuffer";

// Track local state so we don't re-emit wake while already armed.
let armedLocal = false;

ws.addEventListener("message", (event) => {
  if (typeof event.data !== "string") return;
  const msg = JSON.parse(event.data);
  if (msg.type === "wake_state") {
    armedLocal = !!msg.armed;
    renderArmedIndicator(armedLocal);
  }
});

const wakeDetector = await OpenWakeWord.load({
  model: "hey_jarvis",  // or a custom-trained ONNX bundle
  threshold: 0.5,
});

wakeDetector.on("trigger", (event) => {
  // Idempotent — server no-ops if already armed.
  if (ws.readyState === WebSocket.OPEN && !armedLocal) {
    ws.send(JSON.stringify({
      type: "wake",
      confidence: event.confidence,
    }));
  }
});

wakeDetector.start();
</script>

Combining with VAD#

Wake-word + VAD play together naturally:

  • Wake word fires once → bot becomes armed.
  • VAD then drives the actual mic frames + barge-in inside the now-armed turn.

Order of events for a typical "hey assistant, what's the weather?" interaction:

  1. User: "hey assistant" → wake-word detector fires → emit {"type":"wake"}.
  2. Server: {"type":"wake_state","armed":true}.
  3. User: pause briefly, then "what's the weather?" → VAD speaking:true → mic frames flow → STT segments → end-of-utterance.
  4. Server runs the turn → emits agent_text + audio frames.
  5. Turn done → server re-arms ({"type":"wake_state","armed":false}).
  6. Back to waiting for the next wake.

If the user interrupts mid-reply ("never mind"), the VAD speaking-true triggers cancellation as documented in Client-side VAD integration.

Without wake word#

Skip everything above and the session is always-listening — every utterance is processed immediately. This is the default and simplest UX for push-to-talk style flows.

What about server-side wake-word?#

ScaiVoice doesn't run wake-word detection server-side. Sending the full mic stream to the server just to detect "hey assistant" would be wasteful (continuous bandwidth + STT cycles) and would add round-trip latency to the most latency-critical signal. Client-side is the right tier; we're unlikely to change this.

Updated 2026-05-25 23:46:26 View source (.md) rev 3