Architecture
ScaiSpeak is a product layer on top of ScaiGrid's existing primitives — dispatch, accounting, identity, and ScaiInfer's audio engines. There is no separate "speech service"; a synth call is a routing decision plus a dispatcher call.
Components#
At a glance: the API surface, the two backends, the voice-warm registry, and the optional ScaiDrive write path. Synchronous and async calls share the same dispatch picker; streaming sessions reuse it through the orchestrator.
There is no separate ScaiSpeak deployment. ScaiSpeak is a ScaiGrid module — it runs in the same FastAPI process, behind the same auth, accounted against the same budgets.
Two backends, one API#
Every synth call ends up on one of two backends:
- Backend A — self-hosted. A ScaiInfer node carrying the configured TTS engine on your GPUs. Lowest latency, most control, mandatory for some compliance postures. Picked at dispatch time by
resolve_audio_node. - Backend B — relay. A managed third-party TTS relay. No infrastructure required, pay-per-call, useful as a fallback when A isn't deployed or is saturated.
The caller never picks the backend directly. The caller's backend_preference is an advisory — the actual decision is made by BackendPolicyResolver against the tenant's allowed_backends and default_backend. A tenant can lock to A only (sovereign-only postures), to B only (zero-infra postures), or allow both.
If the caller sends backend_preference: "prefer_self_hosted" and no A node has the engine loaded, the request falls through to B if B is in the allowed set, or returns 502 SCAISPEAK_BACKEND_UNAVAILABLE if it isn't.
Request flow for one synth call#
- Route handler parses the body and pulls
AuthUser+ permission gate. - Backend policy is resolved for the caller's tenant (
get_or_provision— first call seeds a default). - Voice visibility check: the voice id must be
global, the caller's tenant, or the caller's user. - Backend pick. If A is in the allowed set AND a node has the engine loaded, A is preferred (or chosen per the preference flag). Otherwise B.
- Dispatch. Backend A goes via
ScaiInferDispatcher(gRPC); on the self-hosted path, the reference clip is fetched from object storage, normalised to 16 kHz mono int16, and passed inline with the synth request (zero-shot cloning). Backend B goes via the managed-relay dispatcher (HTTP) and uses its own preset speaker set. - Optional save_to. If a
save_toblock was sent, the synth output is uploaded into the caller's ScaiDrive share via a token-exchanged JWT; the response carries the resultingfile_id. - Accounting. Backend used, character count, and dispatch latency are recorded.
For long-form text (default >500 chars or force_async: true), step 4 enqueues an arq job (process_synth_job) instead of dispatching inline. The async worker runs steps 4-6 in the background; the caller polls GET /speak/jobs/{id}.
Streaming transports#
Three streaming surfaces share one orchestrator (StreamService):
- WebSocket at
/stream/speak. Best for server-side clients. Bidirectional JSON control, binary audio frames. - WebRTC at
/stream/speak/webrtc/sessions/*. Best for browsers. Signalling via REST, control via WebSocket, audio over the RTP/SRTP path negotiated by aiortc. Caveat: signalling and lifecycle ship end-to-end today; the audio-track decode path through aiortc'sMediaStreamTrackraisesNotImplementedErroron firstrecv. Use WebSocket for production streaming until the audio plane lands. - gRPC bidi (spec'd, not yet exposed on this module's router). Best for native applications wanting protobuf framing.
All three speak the same control vocabulary: text, flush, interrupt, close. The interrupt verb is barge-in — drop buffered audio and stop generation immediately.
Voice warming#
Backend A's current zero-shot path doesn't require a separate warm step — the reference clip is shipped with each synth request and the engine clones in one round-trip. The VoiceWarmService and the POST /voices/{id}/warm endpoint remain in place from the previous-generation cloning architecture and are retained for compatibility; on the current self-hosted engine they're no-ops.
Operators who care about first-synth latency can pre-fetch the reference audio into a local cache (Phase 2 controller-side optimisation), which removes the per-call S3 round-trip without involving the engine.
State#
- Voices, consents, licenses, provenance, audit rows — in ScaiGrid's MariaDB.
- Reference audio + consent recordings + license documents — in object storage (Garage S3 under the hood), keyed by
scaispeak/voices/{voice_id}/.... The reference clip is the cloning input — preserved permanently with the voice row. - Voice-warm registry — Redis sorted set per voice; retained from the previous-generation cloning architecture and unused on the current zero-shot engine.
- Synth jobs — partitioned table; output blobs live in S3 referenced by
audio_uri.
Where the trust boundary is#
The synth API authenticates the caller, not the voice. ACL is by visibility: the caller sees scope='global' voices plus their tenant's plus their own. Promoting a private voice to tenant scope (POST /voices/{id}/share) needs scaispeak:voice.share beyond the standard write permission — sharing is a separate capability so you can grant cloning without grantee promotion.
save_to writes go further: the synth path exchanges the caller's JWT for a ScaiDrive-audience token at the moment of dispatch, then uploads as the caller. This works only for JWT auth (sgk_ API keys can't perform ScaiKey token exchange). Tenants who want save_to from key-authenticated workers issue per-worker JWTs through their own auth path.