Client-side wake word
ScaiVoice's wake-word gating lets you build always-on personal-assistant flows: the bot is dormant by default and only "wakes" for the next utterance after the user says a trigger phrase. The wake detector runs on the client; the server just gates input frames behind a wake_armed flag.
When to use wake-word gating#
Turn it on when:
- The mic is always-open in your UI (kitchen assistant, hands-free car app).
- You want to avoid double-processing background conversation as user input.
- You need a clear "the bot is now listening to you" signal for UX.
Leave it off for click-to-talk UIs — there the user's button press is already the activation gesture.
What ScaiVoice expects#
Two pieces:
-
At session create, set
wake_word_enabled: true:json1 2
POST /v1/modules/scaivoice/sessions {"voice_id": "vc_…", "llm_model": "…", "wake_word_enabled": true} -
At runtime, emit a frame when your client-side detector fires:
json1{"type": "wake", "confidence": 0.93}
Behaviour:
| Client emits | Session state | Server does |
|---|---|---|
wake |
armed=false (initial / post-turn) | Sets wake_armed=true, emits {"type":"wake_state","armed":true} |
wake |
armed=true | No-op (idempotent — fine to re-emit). |
text or utterance |
armed=false | Drops the input, emits {"type":"info","code":"SCAIVOICE_WAKE_REQUIRED"} |
text or utterance |
armed=true | Processes normally. After the turn completes, server re-arms (sets armed=false). |
The server emits {"type":"wake_state","armed":<bool>,"wake_word_enabled":true} on:
- Initial connect (so the client knows it's armed=false).
- On every wake-state transition (after the wake frame, after each turn).
Render UI off these events — typically a "Say 'hey assistant'" prompt when not armed, and a "Listening…" indicator when armed.
Recommended browser library#
openwakeword is the recommended default — Apache-2.0, pre-trained models for common phrases (hey jarvis, alexa, hey google, custom training supported), runs in browser via ONNX/WASM. Small models (~5 MB) load fast.
For production-grade accuracy or proprietary wake phrases, Picovoice Porcupine is a commercial alternative with better detection rates at the cost of a per-device license. The integration pattern is identical — both libraries expose a callback-on-detection API.
Reference integration#
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 | |
Combining with VAD#
Wake-word + VAD play together naturally:
- Wake word fires once → bot becomes armed.
- VAD then drives the actual mic frames + barge-in inside the now-armed turn.
Order of events for a typical "hey assistant, what's the weather?" interaction:
- User: "hey assistant" → wake-word detector fires → emit
{"type":"wake"}. - Server:
{"type":"wake_state","armed":true}. - User: pause briefly, then "what's the weather?" → VAD
speaking:true→ mic frames flow → STT segments → end-of-utterance. - Server runs the turn → emits agent_text + audio frames.
- Turn done → server re-arms (
{"type":"wake_state","armed":false}). - Back to waiting for the next wake.
If the user interrupts mid-reply ("never mind"), the VAD speaking-true triggers cancellation as documented in Client-side VAD integration.
Without wake word#
Skip everything above and the session is always-listening — every utterance is processed immediately. This is the default and simplest UX for push-to-talk style flows.
What about server-side wake-word?#
ScaiVoice doesn't run wake-word detection server-side. Sending the full mic stream to the server just to detect "hey assistant" would be wasteful (continuous bandwidth + STT cycles) and would add round-trip latency to the most latency-critical signal. Client-side is the right tier; we're unlikely to change this.