Streaming transports

ScaiEcho exposes two real-time transports. They share a backend, a backend-selection policy, and the same TranscriptDelta shape on the way out. They differ in how audio gets in: WebSocket carries raw binary frames you control end-to-end; WebRTC negotiates an RTP audio track that the browser handles natively.

WebSocket — bring your own audio bytes#

A single WebSocket at /v1/modules/scaiecho/stream/transcribe carries both control (JSON text frames) and audio (binary frames). Authentication uses a bearer token from the token= query parameter or the Authorization header — FastAPI dependencies don't fire before the WS accept, so the route does its own check.

Frame shapes:

Client to server: {"type": "open", "language_hint": "en", "media_type": "audio/wav", "backend_preference": "any", "chunk_seconds": 5.0, "sample_rate": 16000, "diarize": false} once, then binary audio frames, then {"type": "close"}.
Server to client: {"type": "ready", "backend_used": "A|B"}, then repeating {"type": "delta", "text": "...", "is_final": false, "start": 0.0, "end": 4.8, "confidence": 0.0}, then {"type": "closed", "audio_bytes": N}. Errors come back as {"type": "error", "code": "...", "message": "..."}.

The server does not decode audio in the route — it just pumps bytes into AudioInputQueue, and the dispatcher handles the codec. Backend A speaks gRPC bidi to the ScaiInfer node; Backend B accumulates chunk_seconds of audio and relays each chunk to the managed STT API.

When to use it:

You control both ends and are happy sending raw audio bytes.
You have a backend service that already has a recording in memory.
You're prototyping live captioning from a CLI tool or a small custom client.
You want the simplest possible client implementation — three frame types and a bytes channel.

WebRTC — let the browser handle audio#

Audio plane not yet wired. Signalling and lifecycle ship end-to-end — sessions create, SDP exchanges, ICE trickles, control WebSocket attaches — but the audio decode path from av.AudioFrame to the backend dispatcher is still in progress. A peer connection negotiates cleanly; transcript deltas don't yet flow back. Use the WebSocket transport for production transcription today and treat the WebRTC routes as prototyping-only until this caveat is removed.

The WebRTC variant separates signaling, media, and control:

POST /stream/transcribe/webrtc/sessions creates a session and returns ICE config plus the URL of the control WebSocket.
POST /stream/transcribe/webrtc/sessions/{id}/offer exchanges SDP — client sends its offer, server returns its answer.
POST /stream/transcribe/webrtc/sessions/{id}/ice-candidates trickles ICE candidates as they arrive at the client.
Audio flows over the negotiated RTP/SRTP track. The browser handles capture, encoding (Opus by default), and packetization.
A control WebSocket at /sessions/{id}/control emits delta JSON frames out and accepts a close JSON frame in.
DELETE /stream/transcribe/webrtc/sessions/{id} tears the peer down.

The same bearer-auth pattern protects every endpoint. The session id is self-bound — only the creating user can interact with it.

When to use it:

Live captioning from a browser microphone where you want the platform to handle device access, echo cancellation, noise suppression, and codec selection.
A mobile app where you've already integrated a WebRTC client SDK.
Latency-sensitive use cases where Opus over UDP is meaningfully better than chunked WAV over TCP.
Multi-party audio bridges (the gateway side; ScaiEcho transcribes one track at a time).

Both transports:

Resolve the backend the same way — tenant policy plus the optional backend_preference hint on open (WS) or session create (WebRTC).
Emit the same delta payload: text, is_final, start, end, confidence. Diarized streams add a speaker_label field when the dispatcher attaches one.
Require scaiecho:transcribe. Diarization additionally requires scaiecho:diarize.
Get torn down cleanly on disconnect — sess.close() flushes the input queue, the drain task finishes, the WS sends its closed frame.

Audio plane. WebSocket carries arbitrary bytes; WebRTC carries an Opus track negotiated by aiortc.
Deployment requirement. WebRTC requires aiortc and av to be installed in the ScaiGrid deployment. Without them, session create still succeeds (the row is logged for audit), but POST /sessions/{id}/offer returns SCAIECHO_WEBRTC_UNAVAILABLE (501).
Resumability after restart. A WebSocket disconnect always means a fresh session. A WebRTC session whose in-process state is lost (operator restart) returns SCAIECHO_WEBRTC_SESSION_STATE_LOST (410) on the next signaling call, prompting the client to make a new session.

Diarized streaming#

Set diarize: true in the WS open frame (or in the WebRTC POST /sessions body) to request speaker attribution. Permission scaiecho:diarize is checked before the session opens; missing the permission closes the WS with 4403.

Diarization runs in parallel on a audio.analyze.pyannote ScaiInfer node when one is online. Backend B has no pyannote relay, so requesting diarize against a B-pinned stream is silently a no-op. The speaker label on each delta is one of the enrolled profiles visible to the tenant; segments from unknown speakers get a labelled cluster id (spk_0, spk_1, …) for that session.

Latency characteristics#

WebSocket sessions on Backend A use the dispatcher's gRPC bidi stream — the ScaiInfer node returns transcript deltas roughly as fast as it can decode each chunk. End-to-end latency is dominated by the inference step, not the network. On Backend B the dispatcher accumulates chunk_seconds of audio before relaying each chunk to the managed STT HTTP API; latency is therefore at least chunk_seconds plus one HTTP round-trip per chunk. Tuning chunk_seconds is a deliberate trade-off — smaller is snappier but multiplies API calls.

WebRTC sessions add the cost of Opus encoding on the client, RTP packet transit, and the aiortc decode path on the server. In return you get jitter buffer, packet loss recovery, and native browser microphone handling for free. Net latency in practice is comparable to WebSocket on the same backend.

What you cannot do over streaming#

Async jobs are batch-only. A streaming session that runs longer than the audio threshold doesn't get queued — it just keeps streaming. If you need a finished transcript file for a long recording, use POST /transcribe with force_async=true.
You can't change backend_preference mid-session. The pick happens once on open (or session create); to switch you tear down and reopen.
You can't change diarize mid-session for the same reason. Diarization runs (or doesn't) for the entire session.
You can't resume after disconnect. A new WebSocket or a new WebRTC session is a new dispatcher session — there is no transcript-position cursor to fast-forward.

For everything that needs in-session reconfiguration, drop back to the batch endpoint and reissue with the new parameters.