---
summary: How the speak endpoints, the two backends, the voice-warm registry, and the
  streaming transports fit together.
title: Architecture
path: concepts/architecture
status: published
---

ScaiSpeak is a product layer on top of ScaiGrid's existing primitives — dispatch, accounting, identity, and ScaiInfer's audio engines. There is no separate "speech service"; a synth call is a routing decision plus a dispatcher call.

## Components

At a glance: the API surface, the two backends, the voice-warm registry, and the optional ScaiDrive write path. Synchronous and async calls share the same dispatch picker; streaming sessions reuse it through the orchestrator.

```mermaid
flowchart LR
    Caller[Caller]

    subgraph ScaiGrid["ScaiGrid /v1/modules/scaispeak/..."]
        Module[Voice lib<br/>Backend policy<br/>Speak svc<br/>Stream svc<br/>Voice warm]
        Accounting[Accounting<br/>Audit / GDPR]
        Module --> Accounting
    end

    BackendA[Backend A<br/>ScaiInfer<br/>self-hosted TTS engine]
    BackendB[Backend B<br/>managed TTS relay]
    ScaiDrive[ScaiDrive<br/>save_to]

    Caller -- "POST /speak" --> Module
    Module -- "audio bytes / job_id" --> Caller
    Caller <-- "WS / WebRTC<br/>audio frames" --> Module

    Module -- gRPC --> BackendA
    Module -- HTTP --> BackendB
    Module -- HTTP --> ScaiDrive
```

There is no separate ScaiSpeak deployment. ScaiSpeak is a ScaiGrid module — it runs in the same FastAPI process, behind the same auth, accounted against the same budgets.

## Two backends, one API

Every synth call ends up on one of two backends:

- **Backend A — self-hosted.** A ScaiInfer node carrying the configured TTS engine on your GPUs. Lowest latency, most control, mandatory for some compliance postures. Picked at dispatch time by `resolve_audio_node`.
- **Backend B — relay.** A managed third-party TTS relay. No infrastructure required, pay-per-call, useful as a fallback when A isn't deployed or is saturated.

The caller never picks the backend directly. The caller's `backend_preference` is an *advisory* — the actual decision is made by `BackendPolicyResolver` against the tenant's `allowed_backends` and `default_backend`. A tenant can lock to A only (sovereign-only postures), to B only (zero-infra postures), or allow both.

If the caller sends `backend_preference: "prefer_self_hosted"` and no A node has the engine loaded, the request falls through to B if B is in the allowed set, or returns 502 `SCAISPEAK_BACKEND_UNAVAILABLE` if it isn't.

## Request flow for one synth call

1. **Route handler** parses the body and pulls `AuthUser` + permission gate.
2. **Backend policy** is resolved for the caller's tenant (`get_or_provision` — first call seeds a default).
3. **Voice visibility** check: the voice id must be `global`, the caller's tenant, or the caller's user.
4. **Backend pick.** If A is in the allowed set AND a node has the engine loaded, A is preferred (or chosen per the preference flag). Otherwise B.
5. **Dispatch.** Backend A goes via `ScaiInferDispatcher` (gRPC); on the self-hosted path, the reference clip is fetched from object storage, normalised to 16 kHz mono int16, and passed inline with the synth request (zero-shot cloning). Backend B goes via the managed-relay dispatcher (HTTP) and uses its own preset speaker set.
6. **Optional save_to.** If a `save_to` block was sent, the synth output is uploaded into the caller's ScaiDrive share via a token-exchanged JWT; the response carries the resulting `file_id`.
7. **Accounting.** Backend used, character count, and dispatch latency are recorded.

For long-form text (default >500 chars or `force_async: true`), step 4 enqueues an arq job (`process_synth_job`) instead of dispatching inline. The async worker runs steps 4-6 in the background; the caller polls `GET /speak/jobs/{id}`.

## Streaming transports

Three streaming surfaces share one orchestrator (`StreamService`):

- **WebSocket** at `/stream/speak`. Best for server-side clients. Bidirectional JSON control, binary audio frames.
- **WebRTC** at `/stream/speak/webrtc/sessions/*`. Best for browsers. Signalling via REST, control via WebSocket, audio over the RTP/SRTP path negotiated by aiortc. **Caveat:** signalling and lifecycle ship end-to-end today; the audio-track decode path through aiortc's `MediaStreamTrack` raises `NotImplementedError` on first `recv`. Use WebSocket for production streaming until the audio plane lands.
- **gRPC bidi** (spec'd, not yet exposed on this module's router). Best for native applications wanting protobuf framing.

All three speak the same control vocabulary: `text`, `flush`, `interrupt`, `close`. The interrupt verb is barge-in — drop buffered audio and stop generation immediately.

## Voice warming

Backend A's current zero-shot path doesn't require a separate warm step — the reference clip is shipped with each synth request and the engine clones in one round-trip. The `VoiceWarmService` and the `POST /voices/{id}/warm` endpoint remain in place from the previous-generation cloning architecture and are retained for compatibility; on the current self-hosted engine they're no-ops.

Operators who care about first-synth latency can pre-fetch the reference audio into a local cache (Phase 2 controller-side optimisation), which removes the per-call S3 round-trip without involving the engine.

## State

- **Voices, consents, licenses, provenance, audit rows** — in ScaiGrid's MariaDB.
- **Reference audio + consent recordings + license documents** — in object storage (Garage S3 under the hood), keyed by `scaispeak/voices/{voice_id}/...`. The reference clip is the cloning input — preserved permanently with the voice row.
- **Voice-warm registry** — Redis sorted set per voice; retained from the previous-generation cloning architecture and unused on the current zero-shot engine.
- **Synth jobs** — partitioned table; output blobs live in S3 referenced by `audio_uri`.

## Where the trust boundary is

The synth API authenticates the *caller*, not the voice. ACL is by visibility: the caller sees `scope='global'` voices plus their tenant's plus their own. Promoting a private voice to tenant scope (`POST /voices/{id}/share`) needs `scaispeak:voice.share` beyond the standard write permission — sharing is a separate capability so you can grant cloning without grantee promotion.

`save_to` writes go further: the synth path exchanges the caller's JWT for a ScaiDrive-audience token at the moment of dispatch, then uploads as the caller. This works only for JWT auth (`sgk_` API keys can't perform ScaiKey token exchange). Tenants who want save_to from key-authenticated workers issue per-worker JWTs through their own auth path.