Stream TTS over WebSocket
The WebSocket streaming path is what you use when the text isn't finished yet — chat assistants, narration that follows a generated stream, dialogue systems that need to interrupt the speaker. Audio frames arrive as soon as enough text has been buffered for the first sentence.
You need:
- A
voice_idfor a voice inembedding_status: readystate. - An API key (or JWT) with
scaispeak:synthesize. - A WebSocket client that handles both text (JSON) frames and binary frames.
The wire protocol#
Client sends JSON control frames:
type |
Fields | Meaning |
|---|---|---|
open |
voice_id, optional language_hint, speed, output.codec, backend_preference |
First frame. Opens the session. |
text |
delta |
Append text to the buffer. |
flush |
— | Force the current buffer to start synthesising even if it's mid-sentence. |
interrupt |
— | Barge-in: drop buffered audio, stop generating. |
close |
— | End of stream. |
Server sends JSON control frames and binary audio frames:
type |
Fields | Meaning |
|---|---|---|
ready |
voice_id, backend_used |
After open — synth path resolved, audio frames will follow. |
interrupted |
— | Acknowledgement of an interrupt. |
closed |
stats.chars, stats.backend_used |
After close or when the session tears down. |
error |
code, message |
Something went wrong; the session is over. |
Binary frames carry the audio in whatever codec was negotiated (Opus by default; PCM as an option).
1. Open the session#
Connect to WS /stream/speak, send a single open frame with the voice id and output format, and wait for the server's ready reply. The ready frame carries backend_used so you can log which backend handled the session.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | |
1 2 3 4 | |
The token can come in as a query parameter (shown above) or as a normal Authorization: Bearer ... header — both work because WebSocket clients vary on which they support.
2. Push text#
Send {"type":"text","delta":"..."} for each chunk of text. The server buffers up to a sentence boundary, then starts synthesising — you'll see binary frames arrive while you're still sending more text.
1 2 3 4 5 6 7 | |
flush is the signal that no more text is coming for the current passage — without it the server keeps the last partial buffer around in case more text follows.
3. Drain audio in parallel#
Audio frames arrive on the same WebSocket as binary messages. Run a parallel coroutine to receive them.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | |
1 2 3 4 5 6 7 8 9 | |
The binary frames are already in the codec you asked for (opus by default). Concatenate them as you receive them and the file is playable as-is.
4. Barge-in#
When the user starts talking over the bot (or your assistant decides to retract what it was saying), send interrupt:
1 2 | |
The current Phase 4 surface tears down the session on interrupt — open a new one to keep talking. Later phases will keep the session live with a reset chunker; the wire contract supports it.
5. Errors and recovery#
Common close codes:
4401— bearer token missing or invalid.4403— permission denied (you don't havescaispeak:synthesize, or no tenant context).4400— bad frame (first frame wasn'topen, missingvoice_id).4502— backend unavailable (tenant's allowed backends are all down or unreachable).4500— server-side error.
When you see 4502, check /v1/modules/scaispeak/admin/policy — your tenant might be locked to a backend that's offline.
Done#
You have a streaming session that synthesises while you're still feeding text and can be interrupted at any point. From here, wire it into your chat loop, your narration pipeline, or whatever client wanted token-by-token audio.
For browser clients, the WebRTC path (POST /stream/speak/webrtc/sessions) is the production-grade option — the audio rides RTP/SRTP instead of WebSocket binary frames, with adaptive bitrate and jitter handling built in.