---
summary: Open a streaming session, push text in chunks, drain audio frames in real
  time, and barge-in with interrupt.
title: Stream TTS over WebSocket
path: tutorials/stream-with-websocket
status: published
---

The WebSocket streaming path is what you use when the text isn't finished yet — chat assistants, narration that follows a generated stream, dialogue systems that need to interrupt the speaker. Audio frames arrive as soon as enough text has been buffered for the first sentence.

You need:

- A `voice_id` for a voice in `embedding_status: ready` state.
- An API key (or JWT) with `scaispeak:synthesize`.
- A WebSocket client that handles both text (JSON) frames and binary frames.

## The wire protocol

Client sends JSON control frames:

| `type` | Fields | Meaning |
|---|---|---|
| `open` | `voice_id`, optional `language_hint`, `speed`, `output.codec`, `backend_preference` | First frame. Opens the session. |
| `text` | `delta` | Append text to the buffer. |
| `flush` | — | Force the current buffer to start synthesising even if it's mid-sentence. |
| `interrupt` | — | Barge-in: drop buffered audio, stop generating. |
| `close` | — | End of stream. |

Server sends JSON control frames and binary audio frames:

| `type` | Fields | Meaning |
|---|---|---|
| `ready` | `voice_id`, `backend_used` | After `open` — synth path resolved, audio frames will follow. |
| `interrupted` | — | Acknowledgement of an `interrupt`. |
| `closed` | `stats.chars`, `stats.backend_used` | After `close` or when the session tears down. |
| `error` | `code`, `message` | Something went wrong; the session is over. |

Binary frames carry the audio in whatever codec was negotiated (Opus by default; PCM as an option).

## 1. Open the session

Connect to `WS /stream/speak`, send a single `open` frame with the voice id and output format, and wait for the server's `ready` reply. The `ready` frame carries `backend_used` so you can log which backend handled the session.

```python
import asyncio, json, os, websockets

URL = (
    f"wss://{os.environ['SCAIGRID_HOST'].removeprefix('https://')}"
    f"/v1/modules/scaispeak/stream/speak"
    f"?token={os.environ['SCAIGRID_API_KEY']}"
)

async def stream():
    async with websockets.connect(URL) as ws:
        await ws.send(json.dumps({
            "type": "open",
            "voice_id": os.environ["VOICE_ID"],
            "language_hint": "en",
            "speed": 1.0,
            "output": {"codec": "opus"},
            "backend_preference": "any",
        }))
        ready = json.loads(await ws.recv())
        assert ready["type"] == "ready"
        print("backend:", ready["backend_used"])
```

```javascript
const WebSocket = require("ws");

const url = `wss://${process.env.SCAIGRID_HOST.replace(/^https?:\/\//, "")}`
  + `/v1/modules/scaispeak/stream/speak?token=${process.env.SCAIGRID_API_KEY}`;

const ws = new WebSocket(url);
ws.on("open", () => {
  ws.send(JSON.stringify({
    type: "open",
    voice_id: process.env.VOICE_ID,
    output: { codec: "opus" },
    backend_preference: "any",
  }));
});
```

```bash
# websocat — useful for smoke-testing
websocat "wss://scaigrid.scailabs.ai/v1/modules/scaispeak/stream/speak?token=$SCAIGRID_API_KEY" <<EOF
{"type":"open","voice_id":"$VOICE_ID","output":{"codec":"opus"}}
EOF
```

The token can come in as a query parameter (shown above) or as a normal `Authorization: Bearer ...` header — both work because WebSocket clients vary on which they support.

## 2. Push text

Send `{"type":"text","delta":"..."}` for each chunk of text. The server buffers up to a sentence boundary, then starts synthesising — you'll see binary frames arrive while you're still sending more text.

```python
async def feed(ws):
    for delta in ["Welcome to the Acme handbook. ", "In this chapter we cover account setup, ",
                  "billing, and the most common support questions."]:
        await ws.send(json.dumps({"type": "text", "delta": delta}))
        await asyncio.sleep(0.1)
    await ws.send(json.dumps({"type": "flush"}))
    await ws.send(json.dumps({"type": "close"}))
```

`flush` is the signal that no more text is coming for the current passage — without it the server keeps the last partial buffer around in case more text follows.

## 3. Drain audio in parallel

Audio frames arrive on the same WebSocket as binary messages. Run a parallel coroutine to receive them.

```python
async def drain(ws, out_path):
    with open(out_path, "wb") as f:
        async for msg in ws:
            if isinstance(msg, bytes):
                f.write(msg)
            else:
                ctrl = json.loads(msg)
                if ctrl["type"] == "closed":
                    print("done:", ctrl["stats"]); break
                if ctrl["type"] == "error":
                    raise RuntimeError(ctrl)

async def main():
    async with websockets.connect(URL) as ws:
        await ws.send(json.dumps({"type": "open", "voice_id": os.environ["VOICE_ID"],
                                  "output": {"codec": "opus"}}))
        await ws.recv()  # the "ready" frame
        await asyncio.gather(feed(ws), drain(ws, "chapter.opus"))

asyncio.run(main())
```

```javascript
ws.on("message", (msg, isBinary) => {
  if (isBinary) { fs.appendFileSync("chapter.opus", msg); return; }
  const ctrl = JSON.parse(msg.toString());
  if (ctrl.type === "ready") {
    ws.send(JSON.stringify({ type: "text", delta: "Welcome." }));
    ws.send(JSON.stringify({ type: "flush" }));
    ws.send(JSON.stringify({ type: "close" }));
  } else if (ctrl.type === "closed") { console.log("stats:", ctrl.stats); }
});
```

The binary frames are already in the codec you asked for (`opus` by default). Concatenate them as you receive them and the file is playable as-is.

## 4. Barge-in

When the user starts talking over the bot (or your assistant decides to retract what it was saying), send `interrupt`:

```python
await ws.send(json.dumps({"type": "interrupt"}))
# server sends back {"type":"interrupted"} and tears down the session
```

The current Phase 4 surface tears down the session on interrupt — open a new one to keep talking. Later phases will keep the session live with a reset chunker; the wire contract supports it.

## 5. Errors and recovery

Common close codes:

- `4401` — bearer token missing or invalid.
- `4403` — permission denied (you don't have `scaispeak:synthesize`, or no tenant context).
- `4400` — bad frame (first frame wasn't `open`, missing `voice_id`).
- `4502` — backend unavailable (tenant's allowed backends are all down or unreachable).
- `4500` — server-side error.

When you see `4502`, check `/v1/modules/scaispeak/admin/policy` — your tenant might be locked to a backend that's offline.

## Done

You have a streaming session that synthesises while you're still feeding text and can be interrupted at any point. From here, wire it into your chat loop, your narration pipeline, or whatever client wanted token-by-token audio.

For browser clients, the WebRTC path (`POST /stream/speak/webrtc/sessions`) is the production-grade option — the audio rides RTP/SRTP instead of WebSocket binary frames, with adaptive bitrate and jitter handling built in.
