---
summary: "Picking between WebSocket and WebRTC for live transcription \u2014 control\
  \ frames, audio plane, auth, and the trade-offs."
title: Streaming transports
path: concepts/streaming-transports
status: published
---

ScaiEcho exposes two real-time transports. They share a backend, a backend-selection policy, and the same `TranscriptDelta` shape on the way out. They differ in how audio gets in: WebSocket carries raw binary frames you control end-to-end; WebRTC negotiates an RTP audio track that the browser handles natively.

## WebSocket — bring your own audio bytes

A single WebSocket at `/v1/modules/scaiecho/stream/transcribe` carries both control (JSON text frames) and audio (binary frames). Authentication uses a bearer token from the `token=` query parameter or the `Authorization` header — FastAPI dependencies don't fire before the WS accept, so the route does its own check.

Frame shapes:

- Client to server: `{"type": "open", "language_hint": "en", "media_type": "audio/wav", "backend_preference": "any", "chunk_seconds": 5.0, "sample_rate": 16000, "diarize": false}` once, then binary audio frames, then `{"type": "close"}`.
- Server to client: `{"type": "ready", "backend_used": "A|B"}`, then repeating `{"type": "delta", "text": "...", "is_final": false, "start": 0.0, "end": 4.8, "confidence": 0.0}`, then `{"type": "closed", "audio_bytes": N}`. Errors come back as `{"type": "error", "code": "...", "message": "..."}`.

The server does not decode audio in the route — it just pumps bytes into `AudioInputQueue`, and the dispatcher handles the codec. Backend A speaks gRPC bidi to the ScaiInfer node; Backend B accumulates `chunk_seconds` of audio and relays each chunk to the managed STT API.

When to use it:

- You control both ends and are happy sending raw audio bytes.
- You have a backend service that already has a recording in memory.
- You're prototyping live captioning from a CLI tool or a small custom client.
- You want the simplest possible client implementation — three frame types and a bytes channel.

## WebRTC — let the browser handle audio

> **Audio plane not yet wired.** Signalling and lifecycle ship end-to-end — sessions create, SDP exchanges, ICE trickles, control WebSocket attaches — but the audio decode path from `av.AudioFrame` to the backend dispatcher is still in progress. A peer connection negotiates cleanly; transcript deltas don't yet flow back. Use the WebSocket transport for production transcription today and treat the WebRTC routes as prototyping-only until this caveat is removed.

The WebRTC variant separates signaling, media, and control:

- `POST /stream/transcribe/webrtc/sessions` creates a session and returns ICE config plus the URL of the control WebSocket.
- `POST /stream/transcribe/webrtc/sessions/{id}/offer` exchanges SDP — client sends its offer, server returns its answer.
- `POST /stream/transcribe/webrtc/sessions/{id}/ice-candidates` trickles ICE candidates as they arrive at the client.
- Audio flows over the negotiated RTP/SRTP track. The browser handles capture, encoding (Opus by default), and packetization.
- A control WebSocket at `/sessions/{id}/control` emits `delta` JSON frames out and accepts a `close` JSON frame in.
- `DELETE /stream/transcribe/webrtc/sessions/{id}` tears the peer down.

The same bearer-auth pattern protects every endpoint. The session id is self-bound — only the creating user can interact with it.

When to use it:

- Live captioning from a browser microphone where you want the platform to handle device access, echo cancellation, noise suppression, and codec selection.
- A mobile app where you've already integrated a WebRTC client SDK.
- Latency-sensitive use cases where Opus over UDP is meaningfully better than chunked WAV over TCP.
- Multi-party audio bridges (the gateway side; ScaiEcho transcribes one track at a time).

## What they share

Both transports:

- Resolve the backend the same way — tenant policy plus the optional `backend_preference` hint on `open` (WS) or session create (WebRTC).
- Emit the same `delta` payload: `text`, `is_final`, `start`, `end`, `confidence`. Diarized streams add a `speaker_label` field when the dispatcher attaches one.
- Require `scaiecho:transcribe`. Diarization additionally requires `scaiecho:diarize`.
- Get torn down cleanly on disconnect — `sess.close()` flushes the input queue, the drain task finishes, the WS sends its `closed` frame.

## What they don't share

- Audio plane. WebSocket carries arbitrary bytes; WebRTC carries an Opus track negotiated by aiortc.
- Deployment requirement. WebRTC requires `aiortc` and `av` to be installed in the ScaiGrid deployment. Without them, session create still succeeds (the row is logged for audit), but `POST /sessions/{id}/offer` returns `SCAIECHO_WEBRTC_UNAVAILABLE` (501).
- Resumability after restart. A WebSocket disconnect always means a fresh session. A WebRTC session whose in-process state is lost (operator restart) returns `SCAIECHO_WEBRTC_SESSION_STATE_LOST` (410) on the next signaling call, prompting the client to make a new session.

## Diarized streaming

Set `diarize: true` in the WS `open` frame (or in the WebRTC `POST /sessions` body) to request speaker attribution. Permission `scaiecho:diarize` is checked before the session opens; missing the permission closes the WS with `4403`.

Diarization runs in parallel on a `audio.analyze.pyannote` ScaiInfer node when one is online. Backend B has no pyannote relay, so requesting `diarize` against a B-pinned stream is silently a no-op. The speaker label on each delta is one of the enrolled profiles visible to the tenant; segments from unknown speakers get a labelled cluster id (`spk_0`, `spk_1`, …) for that session.

## Latency characteristics

WebSocket sessions on Backend A use the dispatcher's gRPC bidi stream — the ScaiInfer node returns transcript deltas roughly as fast as it can decode each chunk. End-to-end latency is dominated by the inference step, not the network. On Backend B the dispatcher accumulates `chunk_seconds` of audio before relaying each chunk to the managed STT HTTP API; latency is therefore at least `chunk_seconds` plus one HTTP round-trip per chunk. Tuning `chunk_seconds` is a deliberate trade-off — smaller is snappier but multiplies API calls.

WebRTC sessions add the cost of Opus encoding on the client, RTP packet transit, and the aiortc decode path on the server. In return you get jitter buffer, packet loss recovery, and native browser microphone handling for free. Net latency in practice is comparable to WebSocket on the same backend.

## What you cannot do over streaming

- Async jobs are batch-only. A streaming session that runs longer than the audio threshold doesn't get queued — it just keeps streaming. If you need a finished transcript file for a long recording, use `POST /transcribe` with `force_async=true`.
- You can't change `backend_preference` mid-session. The pick happens once on `open` (or session create); to switch you tear down and reopen.
- You can't change `diarize` mid-session for the same reason. Diarization runs (or doesn't) for the entire session.
- You can't resume after disconnect. A new WebSocket or a new WebRTC session is a new dispatcher session — there is no transcript-position cursor to fast-forward.

For everything that needs in-session reconfiguration, drop back to the batch endpoint and reissue with the new parameters.
