Architecture

ScaiSpeak is a product layer on top of ScaiGrid's existing primitives — dispatch, accounting, identity, and ScaiInfer's audio engines. There is no separate "speech service"; a synth call is a routing decision plus a dispatcher call.

Components#

At a glance: the API surface, the two backends, the voice-warm registry, and the optional ScaiDrive write path. Synchronous and async calls share the same dispatch picker; streaming sessions reuse it through the orchestrator.

flowchart LR Caller[Caller] subgraph ScaiGrid["ScaiGrid /v1/modules/scaispeak/..."] Module[Voice lib Backend policy Speak svc Stream svc Voice warm] Accounting[Accounting Audit / GDPR] Module --> Accounting end BackendA[Backend A ScaiInfer self-hosted TTS engine] BackendB[Backend B managed TTS relay] ScaiDrive[ScaiDrive save_to] Caller -- "POST /speak" --> Module Module -- "audio bytes / job_id" --> Caller Caller <-- "WS / WebRTC audio frames" --> Module Module -- gRPC --> BackendA Module -- HTTP --> BackendB Module -- HTTP --> ScaiDrive

There is no separate ScaiSpeak deployment. ScaiSpeak is a ScaiGrid module — it runs in the same FastAPI process, behind the same auth, accounted against the same budgets.

Two backends, one API#

Every synth call ends up on one of two backends:

Backend A — self-hosted. A ScaiInfer node carrying the configured TTS engine on your GPUs. Lowest latency, most control, mandatory for some compliance postures. Picked at dispatch time by resolve_audio_node.
Backend B — relay. A managed third-party TTS relay. No infrastructure required, pay-per-call, useful as a fallback when A isn't deployed or is saturated.

The caller never picks the backend directly. The caller's backend_preference is an advisory — the actual decision is made by BackendPolicyResolver against the tenant's allowed_backends and default_backend. A tenant can lock to A only (sovereign-only postures), to B only (zero-infra postures), or allow both.

If the caller sends backend_preference: "prefer_self_hosted" and no A node has the engine loaded, the request falls through to B if B is in the allowed set, or returns 502 SCAISPEAK_BACKEND_UNAVAILABLE if it isn't.

Request flow for one synth call#

Route handler parses the body and pulls AuthUser + permission gate.
Backend policy is resolved for the caller's tenant (get_or_provision — first call seeds a default).
Voice visibility check: the voice id must be global, the caller's tenant, or the caller's user.
Backend pick. If A is in the allowed set AND a node has the engine loaded, A is preferred (or chosen per the preference flag). Otherwise B.
Dispatch. Backend A goes via ScaiInferDispatcher (gRPC); on the self-hosted path, the reference clip is fetched from object storage, normalised to 16 kHz mono int16, and passed inline with the synth request (zero-shot cloning). Backend B goes via the managed-relay dispatcher (HTTP) and uses its own preset speaker set.
Optional save_to. If a save_to block was sent, the synth output is uploaded into the caller's ScaiDrive share via a token-exchanged JWT; the response carries the resulting file_id.
Accounting. Backend used, character count, and dispatch latency are recorded.

For long-form text (default >500 chars or force_async: true), step 4 enqueues an arq job (process_synth_job) instead of dispatching inline. The async worker runs steps 4-6 in the background; the caller polls GET /speak/jobs/{id}.

Streaming transports#

Three streaming surfaces share one orchestrator (StreamService):

WebSocket at /stream/speak. Best for server-side clients. Bidirectional JSON control, binary audio frames.
WebRTC at /stream/speak/webrtc/sessions/*. Best for browsers. Signalling via REST, control via WebSocket, audio over the RTP/SRTP path negotiated by aiortc. Caveat: signalling and lifecycle ship end-to-end today; the audio-track decode path through aiortc's MediaStreamTrack raises NotImplementedError on first recv. Use WebSocket for production streaming until the audio plane lands.
gRPC bidi (spec'd, not yet exposed on this module's router). Best for native applications wanting protobuf framing.

All three speak the same control vocabulary: text, flush, interrupt, close. The interrupt verb is barge-in — drop buffered audio and stop generation immediately.

Voice warming#

Backend A's current zero-shot path doesn't require a separate warm step — the reference clip is shipped with each synth request and the engine clones in one round-trip. The VoiceWarmService and the POST /voices/{id}/warm endpoint remain in place from the previous-generation cloning architecture and are retained for compatibility; on the current self-hosted engine they're no-ops.

Operators who care about first-synth latency can pre-fetch the reference audio into a local cache (Phase 2 controller-side optimisation), which removes the per-call S3 round-trip without involving the engine.

State#

Voices, consents, licenses, provenance, audit rows — in ScaiGrid's MariaDB.
Reference audio + consent recordings + license documents — in object storage (Garage S3 under the hood), keyed by scaispeak/voices/{voice_id}/.... The reference clip is the cloning input — preserved permanently with the voice row.
Voice-warm registry — Redis sorted set per voice; retained from the previous-generation cloning architecture and unused on the current zero-shot engine.
Synth jobs — partitioned table; output blobs live in S3 referenced by audio_uri.

Where the trust boundary is#

The synth API authenticates the caller, not the voice. ACL is by visibility: the caller sees scope='global' voices plus their tenant's plus their own. Promoting a private voice to tenant scope (POST /voices/{id}/share) needs scaispeak:voice.share beyond the standard write permission — sharing is a separate capability so you can grant cloning without grantee promotion.

save_to writes go further: the synth path exchanges the caller's JWT for a ScaiDrive-audience token at the moment of dispatch, then uploads as the caller. This works only for JWT auth (sgk_ API keys can't perform ScaiKey token exchange). Tenants who want save_to from key-authenticated workers issue per-worker JWTs through their own auth path.