ScaiEcho

ScaiEcho is the speech-to-text product on top of ScaiGrid. You send audio in — a file, a WebSocket binary stream, or a WebRTC track — and get text back. Real-time streams deliver partial and final transcript deltas as they arrive. Multi-speaker audio can be attributed by speaker if you enroll reference profiles.

It is built on top of ScaiGrid's inference, accounting, and identity layers, so every transcript is metered, budgeted, and audited the same way any other inference call is. ScaiEcho is a peer of ScaiSpeak, not a dependency — the two products share nothing at the API level.

When to use it#

You have audio (calls, meetings, voice memos, support recordings) and need text.
You need real-time captioning or live transcription in a browser, mobile app, or backend pipeline.
You want to attribute who said what in a multi-speaker recording.
You want a single API that works against both your own self-hosted STT nodes and a managed STT relay, with per-tenant routing policy.

If you only need one-off transcription with no streaming and no tenant policy, you can call ScaiGrid's /oai/v1/audio/transcriptions directly. ScaiEcho adds the streaming transports, speaker library, and dual-backend routing.

What you get#

Batch transcribe. Multipart upload in, transcript out. Short audio runs inline; long audio (>5 MiB by default) is queued and polled.
WebSocket streaming. Push audio frames, receive transcript-delta JSON frames back. Bearer-auth from query or header.
WebRTC streaming. Browser publishes audio over RTP; transcript deltas arrive on a control WebSocket.
Speaker library. Enroll reference audio + consent, then request speaker-attributed transcripts when the dispatcher supports it.
Dual backend. Self-host (STT engine on ScaiInfer GPUs) or relay (managed STT API). Per-tenant policy selects which one runs each request.
Async jobs. Long-form transcripts run on the arq worker pool and are retrievable later by job_id.

Two-minute mental model#

You manage two nouns and one verb:

A transcription job is one piece of audio going to text. It can be sync (short) or queued (long).
A speaker profile is an enrolled identity used to label segments during diarization.
And the verb: a caller transcribes audio — by file upload, WebSocket, WebRTC, or MCP tool.

Tenant policy decides which backend (A: self-host, B: managed STT relay) actually runs each call. Callers can express a preference (prefer_self_hosted, prefer_relay, or any); the policy resolver has the final say. The same policy applies whether the audio came in via a file upload, a WebSocket frame, or a WebRTC track — the routing decision happens at the service layer, not the transport.

Where ScaiEcho sits relative to ScaiSpeak#

ScaiSpeak is text-to-speech; ScaiEcho is speech-to-text. They are deliberate peers, not a single product: the API surfaces don't share types, the dispatchers are wired separately, and the tenant policy rows are independent. Any speech-to-speech orchestration belongs in a higher-level call gateway, not in either module.

The shared pattern between them is the two-backend split. Both products consult a per-tenant policy row that decides which dispatcher runs the request, both expose the same backend_preference hint, and both surface the same BACKEND_UNAVAILABLE error when the picked backend is offline. If you've integrated ScaiSpeak, ScaiEcho will feel familiar.

Where to go next#

Quickstart — transcribe a file and run one streaming session in five minutes.
Architecture — how the route, services, dispatchers, and backends fit together.
Streaming transports — WebSocket vs WebRTC, when to pick which.
API reference — every endpoint, request, response.
Enroll a speaker for diarization — full walkthrough.

ScaiEcho's module ID inside ScaiGrid is scaiecho; its API is mounted at /v1/modules/scaiecho/.