ScaiSpeak

ScaiSpeak is the text-to-speech product on top of ScaiGrid. You pick (or clone) a voice, send text, and get audio — either as a single file, as a low-latency stream over WebSocket, or over a WebRTC peer connection for browser-grade real-time playback.

Like every other ScaiGrid module it runs inside the same FastAPI process, behind the same auth, accounted against the same budgets. There's no separate ScaiSpeak deployment.

When to use it#

You need TTS in a product — narration, voice assistants, accessibility readouts, IVR prompts, audio chapters.
You want a stable voice across many calls, or a custom voice cloned from a 5-25 second reference clip with recorded consent.
You want low-latency streaming — token-by-token audio that starts before the text is finished.
You want the same voice to work across a managed TTS relay and a self-hosted TTS engine on your GPUs, picked by tenant policy.

If you only need a one-shot audio file from a fixed voice and don't care about cloning, accounting, or routing, you can call any TTS vendor directly. ScaiSpeak's value is the library, the consent / licensing trail, the routing, and the streaming surfaces.

For speech-to-text (transcription, diarization), see ScaiEcho — STT lives there.

What you get#

Batch synthesis. POST /speak returns audio inline for short text, falls through to an async job for long text.
Streaming TTS. WebSocket /stream/speak for server-side clients, WebRTC /stream/speak/webrtc/sessions for browsers.
Voice library. Platform-curated global voices (licensed), tenant-shared voices, and per-user private voices in one ranked list.
Voice cloning. Upload a reference clip + a consent recording, or live-record both over WebSocket; preflight checks audio quality before intake.
Global voices. SuperAdmin-managed, licensed voices that every tenant sees — no consent flow, license is the audit trail.
save_to ScaiDrive. Synth output can land directly in a caller-owned ScaiDrive share with no second round-trip.
Backend policy. Tenant admins choose which backends (self-hosted A, relay B) are allowed and which is default.

Two-minute mental model#

You manage three nouns and one verb:

A Voice is a record in the library — global, tenant, or user-scoped, with an embedding status (pending, processing, ready, failed, evicted).
A Consent (or a License for global voices) is the audit trail that authorises cloning and use.
A Synth job is one async render — created when text is too long for inline response.
And the verb: a caller speaks, which means sending text + a voice id and getting audio back.

The streaming endpoints are the same verb with the audio delivered in frames instead of one blob.

Where to go next#

Quickstart — list a voice and render audio in five minutes.
Architecture — backends, dispatcher, voice warming, save_to flow.
Voice library and consent — scopes, cloning, consent vs license, lifecycle.
Synthesise in a custom voice — full clone-and-speak walkthrough.
Real-time streaming — WebSocket streaming with barge-in.
API reference — every endpoint, request, response.

ScaiSpeak's module ID inside ScaiGrid is scaispeak; its API is mounted at /v1/modules/scaispeak/.