Architecture

ScaiEcho is a thin product layer over ScaiGrid's existing dispatch and inference primitives. There is no separate "echo engine" — the module's services orchestrate calls into two interchangeable dispatchers and persist the audit row. Streaming variants reuse the batch pipeline's backend selection.

Components#

flowchart LR C[Caller] RT["routes/transcribe.py"] TS[TranscribeService] BPR[Backend policy resolver] RS["routes/stream.py routes/webrtc.py"] DB["TenantPolicy TranscJob Speaker (MariaDB)"] BA["Backend A: ScaiInfer self-hosted STT engine"] BB["Backend B: managed STT relay"] C -- POST /transcribe --> RT RT --> TS TS --> BPR BPR --> RS C -- WS audio in --> RS RS -- transcript deltas --> C TS -- "gRPC bidi" --> BA RS -- "HTTP REST" --> BB BPR --- DB subgraph SG [ScaiGrid] RT TS BPR RS DB end subgraph BE [Backends] BA BB end

There's no separate ScaiEcho deployment. ScaiEcho is a ScaiGrid module — it runs in the same FastAPI process, behind the same auth, accounted against the same budgets.

The two backends#

ScaiEcho dispatches every request to one of two backends. Tenant policy decides which is allowed and which is the default.

Backend A — self-host. ScaiInferDispatcher.transcribe() over gRPC to a ScaiInfer node that has the STT engine loaded. Available only when at least one such node is online; resolved per request via resolve_audio_node(engine_kind=ENGINE_STT).
Backend B — relay. Over HTTPS to a managed STT relay. Always available when the relay credentials are configured at the platform level.

Per-tenant policy lives in mod_scaiecho_tenant_policy with three fields: allowed_backends (A, B, AB, or BA) and default_backend (A or B). The policy resolver lazy-provisions a row from tier defaults the first time a tenant transcribes anything.

Callers can hint via backend_preference (prefer_self_hosted, prefer_relay, or any), but the resolver enforces the tenant's allowed set. A tenant pinned to A with no online STT node gets a BACKEND_UNAVAILABLE error; a tenant on AB falls through to B.

Request flow — batch transcribe#

Route (routes/transcribe.py) accepts the multipart upload, reads bytes.
Threshold check. If len(audio) > scaiecho_async_audio_threshold_bytes (default 5 MiB) or force_async=true, the route stages audio to S3, inserts a TranscriptionJob row at status='queued', enqueues process_transcribe_job on arq, returns 202.
Sync path. Build TranscribeService with whichever dispatcher factories are wired (B always, A only when an STT node is online).
Policy resolve. BackendPolicyResolver.get_or_provision(tenant_id) returns the tenant's allowed set and default.
Pick. BackendPolicyResolver.pick(policy, preference=...) returns A or B.
Dispatch. Call the chosen dispatcher's transcribe() with the audio bytes and the language hint.
Persist. Write a TranscriptionJob row capturing audio sha256, bytes, duration, the transcript, the backend used, and detected language.
Respond. Wrap in the standard success() envelope.

Request flow — streaming#

Streaming routes wrap the same backend selection in a session-oriented pump.

Auth. WebSocket bearer-from-query-or-header check. WebRTC routes accept the same bearer on the control WS and on every signaling REST call.
Open frame. Client sends {"type": "open", ...}. The orchestrator builds a StreamTranscribeService, picks a backend, opens a dispatcher session.
Ready frame. Server sends {"type": "ready", "backend_used": "A|B"}.
Concurrent loops. A pump task receives binary audio frames and pushes them into AudioInputQueue. A drain task pulls TranscriptDelta records from the dispatcher and forwards each as {"type": "delta", ...}.
Close. Client sends {"type": "close"} or disconnects. The orchestrator flushes the queue, drains remaining deltas, sends {"type": "closed"}.

WebRTC routes carry audio over the RTP/SRTP path negotiated by aiortc instead of binary WebSocket frames; transcript deltas come back on a control WebSocket bound to the session id.

State#

Tenant policy — mod_scaiecho_tenant_policy (MariaDB), one row per tenant.
Transcription jobs — mod_scaiecho_transcription_job (MariaDB), every batch transcribe writes a row.
Speaker profiles — mod_scaiecho_speaker_profile plus mod_scaiecho_speaker_consent (consent capture, biometric data).
WebRTC sessions — mod_scaiecho_webrtc_session (audit only; the live peer connection lives in process memory).
Erasure audit — mod_scaiecho_erasure_audit, immutable record of speaker deletions for Art. 17 fan-out.
Audio blobs — S3 (Garage in self-hosted deployments). Both async-job uploads and speaker reference recordings.

How it differs from `/oai/v1/audio/transcriptions`#

The OpenAI-compat endpoint is a one-shot transcribe with no tenant policy, no streaming, no speaker library. ScaiEcho adds:

Concern	OAI compat	ScaiEcho
Backend selection	First wired dispatcher	Per-tenant policy with caller hint
Async jobs	No	Yes — over the threshold or on demand
WebSocket streaming	No	Yes — `/stream/transcribe`
WebRTC streaming	No	Yes — `/stream/transcribe/webrtc/*`
Speaker diarization	No	Yes — enrolled-profile attribution
Audit trail	Standard inference accounting	Per-job audit row plus accounting
MCP tool	No	Yes — `scaiecho.transcribe`

For one-off transcription from a Whisper-style client, the compat endpoint is the easier integration. For everything else, use ScaiEcho.

Async jobs and the worker pool#

When the route layer decides to go async, the audio is staged to S3 at scaiecho/transcribe_jobs/{job_id}.{ext} and a TranscriptionJob row is inserted at status='queued'. The process_transcribe_job arq job is enqueued with the job id, the backend preference, and the temperature. The worker resolves the backend at dispatch time — that is, the policy lookup happens twice (once at enqueue for the backend_used hint, once at dispatch for the real decision). This matters when policy or node availability changes between enqueue and run; the worker always honours the policy in effect when it actually transcribes.

The worker writes the transcript, backend_used, language_detected, audio_duration_ms, and completed_at back to the same row. Failures move the row to status='failed' with status_reason populated. Cross-tenant or cross-user polls return 404 deliberately, to avoid leaking job existence.

Speaker enrollment fan-out#

Speaker profiles are biometric data. Enrollment uploads reference audio plus a consent recording, runs a quality preflight, persists the consent record alongside the profile, then fans the reference embedding out to every online audio.analyze.pyannote ScaiInfer node. The fan-out is best-effort — partial success is tolerated. If at least one node accepts, the speaker flips to enrollment_status='ready'; if none accept (or none are online), the speaker stays at pending and the admin UI shows the actionable state.

The warm registry in Redis tracks which nodes hold which speakers. GET /speakers/{id}/warm exposes three sets — warm_node_ids, candidate_node_ids, stale_node_ids — so operators can spot drift between the registry and the live cluster. POST /speakers/{id}/warm is the proactive re-fan-out path: stream the reference audio from S3 once, forward to every target node, register success.

Deletion is the GDPR Art. 17 path: blobs go from S3, an immutable ErasureAudit row records the action, the speaker row is tombstoned, and every replica that holds the embedding gets an EvictSpeaker call. Existing transcripts that already attributed segments to the speaker keep the labels they had — transcripts are not retroactively edited.