Platform
ScaiWave ScaiGrid ScaiCore ScaiBot ScaiDrive ScaiKey Models Tools & Services
Solutions
Organisations Developers Internet Service Providers Managed Service Providers AI-in-a-Box
Resources
Support Documentation Blog Downloads
Company
About Research Careers Investment Opportunities Contact
Log in

Stream TTS over WebSocket

The WebSocket streaming path is what you use when the text isn't finished yet — chat assistants, narration that follows a generated stream, dialogue systems that need to interrupt the speaker. Audio frames arrive as soon as enough text has been buffered for the first sentence.

You need:

  • A voice_id for a voice in embedding_status: ready state.
  • An API key (or JWT) with scaispeak:synthesize.
  • A WebSocket client that handles both text (JSON) frames and binary frames.

The wire protocol#

Client sends JSON control frames:

type Fields Meaning
open voice_id, optional language_hint, speed, output.codec, backend_preference First frame. Opens the session.
text delta Append text to the buffer.
flush Force the current buffer to start synthesising even if it's mid-sentence.
interrupt Barge-in: drop buffered audio, stop generating.
close End of stream.

Server sends JSON control frames and binary audio frames:

type Fields Meaning
ready voice_id, backend_used After open — synth path resolved, audio frames will follow.
interrupted Acknowledgement of an interrupt.
closed stats.chars, stats.backend_used After close or when the session tears down.
error code, message Something went wrong; the session is over.

Binary frames carry the audio in whatever codec was negotiated (Opus by default; PCM as an option).

1. Open the session#

Connect to WS /stream/speak, send a single open frame with the voice id and output format, and wait for the server's ready reply. The ready frame carries backend_used so you can log which backend handled the session.

python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
import asyncio, json, os, websockets

URL = (
    f"wss://{os.environ['SCAIGRID_HOST'].removeprefix('https://')}"
    f"/v1/modules/scaispeak/stream/speak"
    f"?token={os.environ['SCAIGRID_API_KEY']}"
)

async def stream():
    async with websockets.connect(URL) as ws:
        await ws.send(json.dumps({
            "type": "open",
            "voice_id": os.environ["VOICE_ID"],
            "language_hint": "en",
            "speed": 1.0,
            "output": {"codec": "opus"},
            "backend_preference": "any",
        }))
        ready = json.loads(await ws.recv())
        assert ready["type"] == "ready"
        print("backend:", ready["backend_used"])
javascript
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
const WebSocket = require("ws");

const url = `wss://${process.env.SCAIGRID_HOST.replace(/^https?:\/\//, "")}`
  + `/v1/modules/scaispeak/stream/speak?token=${process.env.SCAIGRID_API_KEY}`;

const ws = new WebSocket(url);
ws.on("open", () => {
  ws.send(JSON.stringify({
    type: "open",
    voice_id: process.env.VOICE_ID,
    output: { codec: "opus" },
    backend_preference: "any",
  }));
});
bash
1
2
3
4
# websocat — useful for smoke-testing
websocat "wss://scaigrid.scailabs.ai/v1/modules/scaispeak/stream/speak?token=$SCAIGRID_API_KEY" <<EOF
{"type":"open","voice_id":"$VOICE_ID","output":{"codec":"opus"}}
EOF

The token can come in as a query parameter (shown above) or as a normal Authorization: Bearer ... header — both work because WebSocket clients vary on which they support.

2. Push text#

Send {"type":"text","delta":"..."} for each chunk of text. The server buffers up to a sentence boundary, then starts synthesising — you'll see binary frames arrive while you're still sending more text.

python
1
2
3
4
5
6
7
async def feed(ws):
    for delta in ["Welcome to the Acme handbook. ", "In this chapter we cover account setup, ",
                  "billing, and the most common support questions."]:
        await ws.send(json.dumps({"type": "text", "delta": delta}))
        await asyncio.sleep(0.1)
    await ws.send(json.dumps({"type": "flush"}))
    await ws.send(json.dumps({"type": "close"}))

flush is the signal that no more text is coming for the current passage — without it the server keeps the last partial buffer around in case more text follows.

3. Drain audio in parallel#

Audio frames arrive on the same WebSocket as binary messages. Run a parallel coroutine to receive them.

python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
async def drain(ws, out_path):
    with open(out_path, "wb") as f:
        async for msg in ws:
            if isinstance(msg, bytes):
                f.write(msg)
            else:
                ctrl = json.loads(msg)
                if ctrl["type"] == "closed":
                    print("done:", ctrl["stats"]); break
                if ctrl["type"] == "error":
                    raise RuntimeError(ctrl)

async def main():
    async with websockets.connect(URL) as ws:
        await ws.send(json.dumps({"type": "open", "voice_id": os.environ["VOICE_ID"],
                                  "output": {"codec": "opus"}}))
        await ws.recv()  # the "ready" frame
        await asyncio.gather(feed(ws), drain(ws, "chapter.opus"))

asyncio.run(main())
javascript
1
2
3
4
5
6
7
8
9
ws.on("message", (msg, isBinary) => {
  if (isBinary) { fs.appendFileSync("chapter.opus", msg); return; }
  const ctrl = JSON.parse(msg.toString());
  if (ctrl.type === "ready") {
    ws.send(JSON.stringify({ type: "text", delta: "Welcome." }));
    ws.send(JSON.stringify({ type: "flush" }));
    ws.send(JSON.stringify({ type: "close" }));
  } else if (ctrl.type === "closed") { console.log("stats:", ctrl.stats); }
});

The binary frames are already in the codec you asked for (opus by default). Concatenate them as you receive them and the file is playable as-is.

4. Barge-in#

When the user starts talking over the bot (or your assistant decides to retract what it was saying), send interrupt:

python
1
2
await ws.send(json.dumps({"type": "interrupt"}))
# server sends back {"type":"interrupted"} and tears down the session

The current Phase 4 surface tears down the session on interrupt — open a new one to keep talking. Later phases will keep the session live with a reset chunker; the wire contract supports it.

5. Errors and recovery#

Common close codes:

  • 4401 — bearer token missing or invalid.
  • 4403 — permission denied (you don't have scaispeak:synthesize, or no tenant context).
  • 4400 — bad frame (first frame wasn't open, missing voice_id).
  • 4502 — backend unavailable (tenant's allowed backends are all down or unreachable).
  • 4500 — server-side error.

When you see 4502, check /v1/modules/scaispeak/admin/policy — your tenant might be locked to a backend that's offline.

Done#

You have a streaming session that synthesises while you're still feeding text and can be interrupted at any point. From here, wire it into your chat loop, your narration pipeline, or whatever client wanted token-by-token audio.

For browser clients, the WebRTC path (POST /stream/speak/webrtc/sessions) is the production-grade option — the audio rides RTP/SRTP instead of WebSocket binary frames, with adaptive bitrate and jitter handling built in.

Updated 2026-05-22 14:27:32 View source (.md) rev 13