---
summary: How the routes, transcribe and stream services, dispatchers, and the two
  backends fit together.
title: Architecture
path: concepts/architecture
status: published
---

ScaiEcho is a thin product layer over ScaiGrid's existing dispatch and inference primitives. There is no separate "echo engine" — the module's services orchestrate calls into two interchangeable dispatchers and persist the audit row. Streaming variants reuse the batch pipeline's backend selection.

## Components

```mermaid
flowchart LR
    C[Caller]
    RT["routes/transcribe.py"]
    TS[TranscribeService]
    BPR[Backend policy resolver]
    RS["routes/stream.py<br/>routes/webrtc.py"]
    DB["TenantPolicy<br/>TranscJob<br/>Speaker<br/>(MariaDB)"]
    BA["Backend A: ScaiInfer<br/>self-hosted STT engine"]
    BB["Backend B: managed STT relay"]

    C -- POST /transcribe --> RT
    RT --> TS
    TS --> BPR
    BPR --> RS
    C -- WS audio in --> RS
    RS -- transcript deltas --> C
    TS -- "gRPC bidi" --> BA
    RS -- "HTTP REST" --> BB
    BPR --- DB

    subgraph SG [ScaiGrid]
        RT
        TS
        BPR
        RS
        DB
    end

    subgraph BE [Backends]
        BA
        BB
    end
```

There's no separate ScaiEcho deployment. ScaiEcho is a ScaiGrid module — it runs in the same FastAPI process, behind the same auth, accounted against the same budgets.

## The two backends

ScaiEcho dispatches every request to one of two backends. Tenant policy decides which is allowed and which is the default.

- **Backend A — self-host.** `ScaiInferDispatcher.transcribe()` over gRPC to a ScaiInfer node that has the STT engine loaded. Available only when at least one such node is online; resolved per request via `resolve_audio_node(engine_kind=ENGINE_STT)`.
- **Backend B — relay.** Over HTTPS to a managed STT relay. Always available when the relay credentials are configured at the platform level.

Per-tenant policy lives in `mod_scaiecho_tenant_policy` with three fields: `allowed_backends` (`A`, `B`, `AB`, or `BA`) and `default_backend` (`A` or `B`). The policy resolver lazy-provisions a row from tier defaults the first time a tenant transcribes anything.

Callers can hint via `backend_preference` (`prefer_self_hosted`, `prefer_relay`, or `any`), but the resolver enforces the tenant's allowed set. A tenant pinned to A with no online STT node gets a `BACKEND_UNAVAILABLE` error; a tenant on `AB` falls through to B.

## Request flow — batch transcribe

1. **Route** (`routes/transcribe.py`) accepts the multipart upload, reads bytes.
2. **Threshold check.** If `len(audio) > scaiecho_async_audio_threshold_bytes` (default 5 MiB) or `force_async=true`, the route stages audio to S3, inserts a `TranscriptionJob` row at `status='queued'`, enqueues `process_transcribe_job` on arq, returns `202`.
3. **Sync path.** Build `TranscribeService` with whichever dispatcher factories are wired (B always, A only when an STT node is online).
4. **Policy resolve.** `BackendPolicyResolver.get_or_provision(tenant_id)` returns the tenant's allowed set and default.
5. **Pick.** `BackendPolicyResolver.pick(policy, preference=...)` returns `A` or `B`.
6. **Dispatch.** Call the chosen dispatcher's `transcribe()` with the audio bytes and the language hint.
7. **Persist.** Write a `TranscriptionJob` row capturing audio sha256, bytes, duration, the transcript, the backend used, and detected language.
8. **Respond.** Wrap in the standard `success()` envelope.

## Request flow — streaming

Streaming routes wrap the same backend selection in a session-oriented pump.

1. **Auth.** WebSocket bearer-from-query-or-header check. WebRTC routes accept the same bearer on the control WS and on every signaling REST call.
2. **Open frame.** Client sends `{"type": "open", ...}`. The orchestrator builds a `StreamTranscribeService`, picks a backend, opens a dispatcher session.
3. **Ready frame.** Server sends `{"type": "ready", "backend_used": "A|B"}`.
4. **Concurrent loops.** A pump task receives binary audio frames and pushes them into `AudioInputQueue`. A drain task pulls `TranscriptDelta` records from the dispatcher and forwards each as `{"type": "delta", ...}`.
5. **Close.** Client sends `{"type": "close"}` or disconnects. The orchestrator flushes the queue, drains remaining deltas, sends `{"type": "closed"}`.

WebRTC routes carry audio over the RTP/SRTP path negotiated by aiortc instead of binary WebSocket frames; transcript deltas come back on a control WebSocket bound to the session id.

## State

- **Tenant policy** — `mod_scaiecho_tenant_policy` (MariaDB), one row per tenant.
- **Transcription jobs** — `mod_scaiecho_transcription_job` (MariaDB), every batch transcribe writes a row.
- **Speaker profiles** — `mod_scaiecho_speaker_profile` plus `mod_scaiecho_speaker_consent` (consent capture, biometric data).
- **WebRTC sessions** — `mod_scaiecho_webrtc_session` (audit only; the live peer connection lives in process memory).
- **Erasure audit** — `mod_scaiecho_erasure_audit`, immutable record of speaker deletions for Art. 17 fan-out.
- **Audio blobs** — S3 (Garage in self-hosted deployments). Both async-job uploads and speaker reference recordings.

## How it differs from `/oai/v1/audio/transcriptions`

The OpenAI-compat endpoint is a one-shot transcribe with no tenant policy, no streaming, no speaker library. ScaiEcho adds:

| Concern | OAI compat | ScaiEcho |
|---|---|---|
| Backend selection | First wired dispatcher | Per-tenant policy with caller hint |
| Async jobs | No | Yes — over the threshold or on demand |
| WebSocket streaming | No | Yes — `/stream/transcribe` |
| WebRTC streaming | No | Yes — `/stream/transcribe/webrtc/*` |
| Speaker diarization | No | Yes — enrolled-profile attribution |
| Audit trail | Standard inference accounting | Per-job audit row plus accounting |
| MCP tool | No | Yes — `scaiecho.transcribe` |

For one-off transcription from a Whisper-style client, the compat endpoint is the easier integration. For everything else, use ScaiEcho.

## Async jobs and the worker pool

When the route layer decides to go async, the audio is staged to S3 at `scaiecho/transcribe_jobs/{job_id}.{ext}` and a `TranscriptionJob` row is inserted at `status='queued'`. The `process_transcribe_job` arq job is enqueued with the job id, the backend preference, and the temperature. The worker resolves the backend at dispatch time — that is, the policy lookup happens twice (once at enqueue for the `backend_used` hint, once at dispatch for the real decision). This matters when policy or node availability changes between enqueue and run; the worker always honours the policy in effect when it actually transcribes.

The worker writes the transcript, `backend_used`, `language_detected`, `audio_duration_ms`, and `completed_at` back to the same row. Failures move the row to `status='failed'` with `status_reason` populated. Cross-tenant or cross-user polls return 404 deliberately, to avoid leaking job existence.

## Speaker enrollment fan-out

Speaker profiles are biometric data. Enrollment uploads reference audio plus a consent recording, runs a quality preflight, persists the consent record alongside the profile, then fans the reference embedding out to every online `audio.analyze.pyannote` ScaiInfer node. The fan-out is best-effort — partial success is tolerated. If at least one node accepts, the speaker flips to `enrollment_status='ready'`; if none accept (or none are online), the speaker stays at `pending` and the admin UI shows the actionable state.

The warm registry in Redis tracks which nodes hold which speakers. `GET /speakers/{id}/warm` exposes three sets — `warm_node_ids`, `candidate_node_ids`, `stale_node_ids` — so operators can spot drift between the registry and the live cluster. `POST /speakers/{id}/warm` is the proactive re-fan-out path: stream the reference audio from S3 once, forward to every target node, register success.

Deletion is the GDPR Art. 17 path: blobs go from S3, an immutable `ErasureAudit` row records the action, the speaker row is tombstoned, and every replica that holds the embedding gets an `EvictSpeaker` call. Existing transcripts that already attributed segments to the speaker keep the labels they had — transcripts are not retroactively edited.
