Platform
ScaiWave ScaiGrid ScaiCore ScaiBot ScaiDrive ScaiKey Models Tools & Services
Solutions
Organisations Developers Internet Service Providers Managed Service Providers AI-in-a-Box
Resources
Support Documentation Blog Downloads
Company
About Research Careers Investment Opportunities Contact
Log in

Architecture

ScaiSpeak is a product layer on top of ScaiGrid's existing primitives — dispatch, accounting, identity, and ScaiInfer's audio engines. There is no separate "speech service"; a synth call is a routing decision plus a dispatcher call.

Components#

At a glance: the API surface, the two backends, the voice-warm registry, and the optional ScaiDrive write path. Synchronous and async calls share the same dispatch picker; streaming sessions reuse it through the orchestrator.

flowchart LR Caller[Caller] subgraph ScaiGrid["ScaiGrid /v1/modules/scaispeak/..."] Module[Voice lib<br/>Backend policy<br/>Speak svc<br/>Stream svc<br/>Voice warm] Accounting[Accounting<br/>Audit / GDPR] Module --> Accounting end BackendA[Backend A<br/>ScaiInfer<br/>self-hosted TTS engine] BackendB[Backend B<br/>managed TTS relay] ScaiDrive[ScaiDrive<br/>save_to] Caller -- "POST /speak" --> Module Module -- "audio bytes / job_id" --> Caller Caller <-- "WS / WebRTC<br/>audio frames" --> Module Module -- gRPC --> BackendA Module -- HTTP --> BackendB Module -- HTTP --> ScaiDrive

There is no separate ScaiSpeak deployment. ScaiSpeak is a ScaiGrid module — it runs in the same FastAPI process, behind the same auth, accounted against the same budgets.

Two backends, one API#

Every synth call ends up on one of two backends:

  • Backend A — self-hosted. A ScaiInfer node carrying the configured TTS engine on your GPUs. Lowest latency, most control, mandatory for some compliance postures. Picked at dispatch time by resolve_audio_node.
  • Backend B — relay. A managed third-party TTS relay. No infrastructure required, pay-per-call, useful as a fallback when A isn't deployed or is saturated.

The caller never picks the backend directly. The caller's backend_preference is an advisory — the actual decision is made by BackendPolicyResolver against the tenant's allowed_backends and default_backend. A tenant can lock to A only (sovereign-only postures), to B only (zero-infra postures), or allow both.

If the caller sends backend_preference: "prefer_self_hosted" and no A node has the engine loaded, the request falls through to B if B is in the allowed set, or returns 502 SCAISPEAK_BACKEND_UNAVAILABLE if it isn't.

Request flow for one synth call#

  1. Route handler parses the body and pulls AuthUser + permission gate.
  2. Backend policy is resolved for the caller's tenant (get_or_provision — first call seeds a default).
  3. Voice visibility check: the voice id must be global, the caller's tenant, or the caller's user.
  4. Backend pick. If A is in the allowed set AND a node has the engine loaded, A is preferred (or chosen per the preference flag). Otherwise B.
  5. Dispatch. Backend A goes via ScaiInferDispatcher (gRPC); on the self-hosted path, the reference clip is fetched from object storage, normalised to 16 kHz mono int16, and passed inline with the synth request (zero-shot cloning). Backend B goes via the managed-relay dispatcher (HTTP) and uses its own preset speaker set.
  6. Optional save_to. If a save_to block was sent, the synth output is uploaded into the caller's ScaiDrive share via a token-exchanged JWT; the response carries the resulting file_id.
  7. Accounting. Backend used, character count, and dispatch latency are recorded.

For long-form text (default >500 chars or force_async: true), step 4 enqueues an arq job (process_synth_job) instead of dispatching inline. The async worker runs steps 4-6 in the background; the caller polls GET /speak/jobs/{id}.

Streaming transports#

Three streaming surfaces share one orchestrator (StreamService):

  • WebSocket at /stream/speak. Best for server-side clients. Bidirectional JSON control, binary audio frames.
  • WebRTC at /stream/speak/webrtc/sessions/*. Best for browsers. Signalling via REST, control via WebSocket, audio over the RTP/SRTP path negotiated by aiortc. Caveat: signalling and lifecycle ship end-to-end today; the audio-track decode path through aiortc's MediaStreamTrack raises NotImplementedError on first recv. Use WebSocket for production streaming until the audio plane lands.
  • gRPC bidi (spec'd, not yet exposed on this module's router). Best for native applications wanting protobuf framing.

All three speak the same control vocabulary: text, flush, interrupt, close. The interrupt verb is barge-in — drop buffered audio and stop generation immediately.

Voice warming#

Backend A's current zero-shot path doesn't require a separate warm step — the reference clip is shipped with each synth request and the engine clones in one round-trip. The VoiceWarmService and the POST /voices/{id}/warm endpoint remain in place from the previous-generation cloning architecture and are retained for compatibility; on the current self-hosted engine they're no-ops.

Operators who care about first-synth latency can pre-fetch the reference audio into a local cache (Phase 2 controller-side optimisation), which removes the per-call S3 round-trip without involving the engine.

State#

  • Voices, consents, licenses, provenance, audit rows — in ScaiGrid's MariaDB.
  • Reference audio + consent recordings + license documents — in object storage (Garage S3 under the hood), keyed by scaispeak/voices/{voice_id}/.... The reference clip is the cloning input — preserved permanently with the voice row.
  • Voice-warm registry — Redis sorted set per voice; retained from the previous-generation cloning architecture and unused on the current zero-shot engine.
  • Synth jobs — partitioned table; output blobs live in S3 referenced by audio_uri.

Where the trust boundary is#

The synth API authenticates the caller, not the voice. ACL is by visibility: the caller sees scope='global' voices plus their tenant's plus their own. Promoting a private voice to tenant scope (POST /voices/{id}/share) needs scaispeak:voice.share beyond the standard write permission — sharing is a separate capability so you can grant cloning without grantee promotion.

save_to writes go further: the synth path exchanges the caller's JWT for a ScaiDrive-audience token at the moment of dispatch, then uploads as the caller. This works only for JWT auth (sgk_ API keys can't perform ScaiKey token exchange). Tenants who want save_to from key-authenticated workers issue per-worker JWTs through their own auth path.

Updated 2026-05-22 14:27:32 View source (.md) rev 13