Stream TTS over WebSocket

The WebSocket streaming path is what you use when the text isn't finished yet — chat assistants, narration that follows a generated stream, dialogue systems that need to interrupt the speaker. Audio frames arrive as soon as enough text has been buffered for the first sentence.

You need:

A voice_id for a voice in embedding_status: ready state.
An API key (or JWT) with scaispeak:synthesize.
A WebSocket client that handles both text (JSON) frames and binary frames.

The wire protocol#

Client sends JSON control frames:

`type`	Fields	Meaning
`open`	`voice_id`, optional `language_hint`, `speed`, `output.codec`, `backend_preference`	First frame. Opens the session.
`text`	`delta`	Append text to the buffer.
`flush`	—	Force the current buffer to start synthesising even if it's mid-sentence.
`interrupt`	—	Barge-in: drop buffered audio, stop generating.
`close`	—	End of stream.

Server sends JSON control frames and binary audio frames:

`type`	Fields	Meaning
`ready`	`voice_id`, `backend_used`	After `open` — synth path resolved, audio frames will follow.
`interrupted`	—	Acknowledgement of an `interrupt`.
`closed`	`stats.chars`, `stats.backend_used`	After `close` or when the session tears down.
`error`	`code`, `message`	Something went wrong; the session is over.

Binary frames carry the audio in whatever codec was negotiated (Opus by default; PCM as an option).

1. Open the session#

Connect to WS /stream/speak, send a single open frame with the voice id and output format, and wait for the server's ready reply. The ready frame carries backend_used so you can log which backend handled the session.

python
import asyncio, json, os, websockets

URL = (
    f"wss://{os.environ['SCAIGRID_HOST'].removeprefix('https://')}"
    f"/v1/modules/scaispeak/stream/speak"
    f"?token={os.environ['SCAIGRID_API_KEY']}"
)

async def stream():
    async with websockets.connect(URL) as ws:
        await ws.send(json.dumps({
            "type": "open",
            "voice_id": os.environ["VOICE_ID"],
            "language_hint": "en",
            "speed": 1.0,
            "output": {"codec": "opus"},
            "backend_preference": "any",
        }))
        ready = json.loads(await ws.recv())
        assert ready["type"] == "ready"
        print("backend:", ready["backend_used"])

javascript
const WebSocket = require("ws");

const url = `wss://${process.env.SCAIGRID_HOST.replace(/^https?:\/\//, "")}`
  + `/v1/modules/scaispeak/stream/speak?token=${process.env.SCAIGRID_API_KEY}`;

const ws = new WebSocket(url);
ws.on("open", () => {
  ws.send(JSON.stringify({
    type: "open",
    voice_id: process.env.VOICE_ID,
    output: { codec: "opus" },
    backend_preference: "any",
  }));
});

bash
# websocat — useful for smoke-testing
websocat "wss://scaigrid.scailabs.ai/v1/modules/scaispeak/stream/speak?token=$SCAIGRID_API_KEY" <<EOF
{"type":"open","voice_id":"$VOICE_ID","output":{"codec":"opus"}}
EOF

The token can come in as a query parameter (shown above) or as a normal Authorization: Bearer ... header — both work because WebSocket clients vary on which they support.

2. Push text#

Send {"type":"text","delta":"..."} for each chunk of text. The server buffers up to a sentence boundary, then starts synthesising — you'll see binary frames arrive while you're still sending more text.

python
async def feed(ws):
    for delta in ["Welcome to the Acme handbook. ", "In this chapter we cover account setup, ",
                  "billing, and the most common support questions."]:
        await ws.send(json.dumps({"type": "text", "delta": delta}))
        await asyncio.sleep(0.1)
    await ws.send(json.dumps({"type": "flush"}))
    await ws.send(json.dumps({"type": "close"}))

flush is the signal that no more text is coming for the current passage — without it the server keeps the last partial buffer around in case more text follows.

3. Drain audio in parallel#

Audio frames arrive on the same WebSocket as binary messages. Run a parallel coroutine to receive them.

python
async def drain(ws, out_path):
    with open(out_path, "wb") as f:
        async for msg in ws:
            if isinstance(msg, bytes):
                f.write(msg)
            else:
                ctrl = json.loads(msg)
                if ctrl["type"] == "closed":
                    print("done:", ctrl["stats"]); break
                if ctrl["type"] == "error":
                    raise RuntimeError(ctrl)

async def main():
    async with websockets.connect(URL) as ws:
        await ws.send(json.dumps({"type": "open", "voice_id": os.environ["VOICE_ID"],
                                  "output": {"codec": "opus"}}))
        await ws.recv()  # the "ready" frame
        await asyncio.gather(feed(ws), drain(ws, "chapter.opus"))

asyncio.run(main())

javascript
ws.on("message", (msg, isBinary) => {
  if (isBinary) { fs.appendFileSync("chapter.opus", msg); return; }
  const ctrl = JSON.parse(msg.toString());
  if (ctrl.type === "ready") {
    ws.send(JSON.stringify({ type: "text", delta: "Welcome." }));
    ws.send(JSON.stringify({ type: "flush" }));
    ws.send(JSON.stringify({ type: "close" }));
  } else if (ctrl.type === "closed") { console.log("stats:", ctrl.stats); }
});

The binary frames are already in the codec you asked for (opus by default). Concatenate them as you receive them and the file is playable as-is.

4. Barge-in#

When the user starts talking over the bot (or your assistant decides to retract what it was saying), send interrupt:

python
await ws.send(json.dumps({"type": "interrupt"}))
# server sends back {"type":"interrupted"} and tears down the session

The current Phase 4 surface tears down the session on interrupt — open a new one to keep talking. Later phases will keep the session live with a reset chunker; the wire contract supports it.

5. Errors and recovery#

Common close codes:

4401 — bearer token missing or invalid.
4403 — permission denied (you don't have scaispeak:synthesize, or no tenant context).
4400 — bad frame (first frame wasn't open, missing voice_id).
4502 — backend unavailable (tenant's allowed backends are all down or unreachable).
4500 — server-side error.

When you see 4502, check /v1/modules/scaispeak/admin/policy — your tenant might be locked to a backend that's offline.

Done#

You have a streaming session that synthesises while you're still feeding text and can be interrupted at any point. From here, wire it into your chat loop, your narration pipeline, or whatever client wanted token-by-token audio.

For browser clients, the WebRTC path (POST /stream/speak/webrtc/sessions) is the production-grade option — the audio rides RTP/SRTP instead of WebSocket binary frames, with adaptive bitrate and jitter handling built in.