Architecture
ScaiEcho is a thin product layer over ScaiGrid's existing dispatch and inference primitives. There is no separate "echo engine" — the module's services orchestrate calls into two interchangeable dispatchers and persist the audit row. Streaming variants reuse the batch pipeline's backend selection.
Components#
There's no separate ScaiEcho deployment. ScaiEcho is a ScaiGrid module — it runs in the same FastAPI process, behind the same auth, accounted against the same budgets.
The two backends#
ScaiEcho dispatches every request to one of two backends. Tenant policy decides which is allowed and which is the default.
- Backend A — self-host.
ScaiInferDispatcher.transcribe()over gRPC to a ScaiInfer node that has the STT engine loaded. Available only when at least one such node is online; resolved per request viaresolve_audio_node(engine_kind=ENGINE_STT). - Backend B — relay. Over HTTPS to a managed STT relay. Always available when the relay credentials are configured at the platform level.
Per-tenant policy lives in mod_scaiecho_tenant_policy with three fields: allowed_backends (A, B, AB, or BA) and default_backend (A or B). The policy resolver lazy-provisions a row from tier defaults the first time a tenant transcribes anything.
Callers can hint via backend_preference (prefer_self_hosted, prefer_relay, or any), but the resolver enforces the tenant's allowed set. A tenant pinned to A with no online STT node gets a BACKEND_UNAVAILABLE error; a tenant on AB falls through to B.
Request flow — batch transcribe#
- Route (
routes/transcribe.py) accepts the multipart upload, reads bytes. - Threshold check. If
len(audio) > scaiecho_async_audio_threshold_bytes(default 5 MiB) orforce_async=true, the route stages audio to S3, inserts aTranscriptionJobrow atstatus='queued', enqueuesprocess_transcribe_jobon arq, returns202. - Sync path. Build
TranscribeServicewith whichever dispatcher factories are wired (B always, A only when an STT node is online). - Policy resolve.
BackendPolicyResolver.get_or_provision(tenant_id)returns the tenant's allowed set and default. - Pick.
BackendPolicyResolver.pick(policy, preference=...)returnsAorB. - Dispatch. Call the chosen dispatcher's
transcribe()with the audio bytes and the language hint. - Persist. Write a
TranscriptionJobrow capturing audio sha256, bytes, duration, the transcript, the backend used, and detected language. - Respond. Wrap in the standard
success()envelope.
Request flow — streaming#
Streaming routes wrap the same backend selection in a session-oriented pump.
- Auth. WebSocket bearer-from-query-or-header check. WebRTC routes accept the same bearer on the control WS and on every signaling REST call.
- Open frame. Client sends
{"type": "open", ...}. The orchestrator builds aStreamTranscribeService, picks a backend, opens a dispatcher session. - Ready frame. Server sends
{"type": "ready", "backend_used": "A|B"}. - Concurrent loops. A pump task receives binary audio frames and pushes them into
AudioInputQueue. A drain task pullsTranscriptDeltarecords from the dispatcher and forwards each as{"type": "delta", ...}. - Close. Client sends
{"type": "close"}or disconnects. The orchestrator flushes the queue, drains remaining deltas, sends{"type": "closed"}.
WebRTC routes carry audio over the RTP/SRTP path negotiated by aiortc instead of binary WebSocket frames; transcript deltas come back on a control WebSocket bound to the session id.
State#
- Tenant policy —
mod_scaiecho_tenant_policy(MariaDB), one row per tenant. - Transcription jobs —
mod_scaiecho_transcription_job(MariaDB), every batch transcribe writes a row. - Speaker profiles —
mod_scaiecho_speaker_profileplusmod_scaiecho_speaker_consent(consent capture, biometric data). - WebRTC sessions —
mod_scaiecho_webrtc_session(audit only; the live peer connection lives in process memory). - Erasure audit —
mod_scaiecho_erasure_audit, immutable record of speaker deletions for Art. 17 fan-out. - Audio blobs — S3 (Garage in self-hosted deployments). Both async-job uploads and speaker reference recordings.
How it differs from /oai/v1/audio/transcriptions#
The OpenAI-compat endpoint is a one-shot transcribe with no tenant policy, no streaming, no speaker library. ScaiEcho adds:
| Concern | OAI compat | ScaiEcho |
|---|---|---|
| Backend selection | First wired dispatcher | Per-tenant policy with caller hint |
| Async jobs | No | Yes — over the threshold or on demand |
| WebSocket streaming | No | Yes — /stream/transcribe |
| WebRTC streaming | No | Yes — /stream/transcribe/webrtc/* |
| Speaker diarization | No | Yes — enrolled-profile attribution |
| Audit trail | Standard inference accounting | Per-job audit row plus accounting |
| MCP tool | No | Yes — scaiecho.transcribe |
For one-off transcription from a Whisper-style client, the compat endpoint is the easier integration. For everything else, use ScaiEcho.
Async jobs and the worker pool#
When the route layer decides to go async, the audio is staged to S3 at scaiecho/transcribe_jobs/{job_id}.{ext} and a TranscriptionJob row is inserted at status='queued'. The process_transcribe_job arq job is enqueued with the job id, the backend preference, and the temperature. The worker resolves the backend at dispatch time — that is, the policy lookup happens twice (once at enqueue for the backend_used hint, once at dispatch for the real decision). This matters when policy or node availability changes between enqueue and run; the worker always honours the policy in effect when it actually transcribes.
The worker writes the transcript, backend_used, language_detected, audio_duration_ms, and completed_at back to the same row. Failures move the row to status='failed' with status_reason populated. Cross-tenant or cross-user polls return 404 deliberately, to avoid leaking job existence.
Speaker enrollment fan-out#
Speaker profiles are biometric data. Enrollment uploads reference audio plus a consent recording, runs a quality preflight, persists the consent record alongside the profile, then fans the reference embedding out to every online audio.analyze.pyannote ScaiInfer node. The fan-out is best-effort — partial success is tolerated. If at least one node accepts, the speaker flips to enrollment_status='ready'; if none accept (or none are online), the speaker stays at pending and the admin UI shows the actionable state.
The warm registry in Redis tracks which nodes hold which speakers. GET /speakers/{id}/warm exposes three sets — warm_node_ids, candidate_node_ids, stale_node_ids — so operators can spot drift between the registry and the live cluster. POST /speakers/{id}/warm is the proactive re-fan-out path: stream the reference audio from S3 once, forward to every target node, register success.
Deletion is the GDPR Art. 17 path: blobs go from S3, an immutable ErasureAudit row records the action, the speaker row is tombstoned, and every replica that holds the embedding gets an EvictSpeaker call. Existing transcripts that already attributed segments to the speaker keep the labels they had — transcripts are not retroactively edited.