---
summary: "Speech synthesis with voice cloning \u2014 batch TTS, real-time streaming,\
  \ a multi-tier voice library, and consent-managed custom voices."
title: ScaiSpeak
path: overview
status: published
---

ScaiSpeak is the text-to-speech product on top of ScaiGrid. You pick (or clone) a voice, send text, and get audio — either as a single file, as a low-latency stream over WebSocket, or over a WebRTC peer connection for browser-grade real-time playback.

Like every other ScaiGrid module it runs inside the same FastAPI process, behind the same auth, accounted against the same budgets. There's no separate ScaiSpeak deployment.

## When to use it

- You need TTS in a product — narration, voice assistants, accessibility readouts, IVR prompts, audio chapters.
- You want a stable voice across many calls, or a custom voice cloned from a 5-25 second reference clip with recorded consent.
- You want low-latency streaming — token-by-token audio that starts before the text is finished.
- You want the same voice to work across a managed TTS relay and a self-hosted TTS engine on your GPUs, picked by tenant policy.

If you only need a one-shot audio file from a fixed voice and don't care about cloning, accounting, or routing, you can call any TTS vendor directly. ScaiSpeak's value is the library, the consent / licensing trail, the routing, and the streaming surfaces.

For speech-to-text (transcription, diarization), see [ScaiEcho](../scaiecho/overview) — STT lives there.

## What you get

- **Batch synthesis.** `POST /speak` returns audio inline for short text, falls through to an async job for long text.
- **Streaming TTS.** WebSocket `/stream/speak` for server-side clients, WebRTC `/stream/speak/webrtc/sessions` for browsers.
- **Voice library.** Platform-curated global voices (licensed), tenant-shared voices, and per-user private voices in one ranked list.
- **Voice cloning.** Upload a reference clip + a consent recording, or live-record both over WebSocket; preflight checks audio quality before intake.
- **Global voices.** SuperAdmin-managed, licensed voices that every tenant sees — no consent flow, license is the audit trail.
- **save_to ScaiDrive.** Synth output can land directly in a caller-owned ScaiDrive share with no second round-trip.
- **Backend policy.** Tenant admins choose which backends (self-hosted A, relay B) are allowed and which is default.

## Two-minute mental model

You manage three nouns and one verb:

- A **Voice** is a record in the library — global, tenant, or user-scoped, with an embedding status (`pending`, `processing`, `ready`, `failed`, `evicted`).
- A **Consent** (or a **License** for global voices) is the audit trail that authorises cloning and use.
- A **Synth job** is one async render — created when text is too long for inline response.
- And the verb: a caller **speaks**, which means sending text + a voice id and getting audio back.

The streaming endpoints are the same verb with the audio delivered in frames instead of one blob.

## Where to go next

- [Quickstart](./quickstart) — list a voice and render audio in five minutes.
- [Architecture](./concepts/architecture) — backends, dispatcher, voice warming, save_to flow.
- [Voice library and consent](./concepts/voice-library) — scopes, cloning, consent vs license, lifecycle.
- [Synthesise in a custom voice](./tutorials/clone-and-synthesise) — full clone-and-speak walkthrough.
- [Real-time streaming](./tutorials/stream-with-websocket) — WebSocket streaming with barge-in.
- [API reference](./reference/api) — every endpoint, request, response.

ScaiSpeak's module ID inside ScaiGrid is `scaispeak`; its API is mounted at `/v1/modules/scaispeak/`.
