Changelog

User-visible changes only. Internal refactors and infrastructure work omitted.

v0.x — Phase rollout#

ScaiSpeak ships in phases. Each phase adds endpoints and capabilities; the module ID and URL prefix have been stable since Phase 0.

Phase 1 — Voice library. List / get / clone / update / delete voices. Preflight checks on intake. Consent capture. ScaiDrive references for reference + consent audio. Permissions split into synthesize, voice.read, voice.write, voice.share, admin.
Phase 2 — Batch synth. POST /speak with Backend B (managed TTS relay) wired. Tenant backend policy at /admin/policy. Voice preview endpoint.
Phase 2B — Self-host backend. Backend A (ScaiInfer-hosted TTS engine) added behind the same /speak path. Backend policy picks per-tenant.
Phase 3 — Voice warming. voice_prefix_tokens from the previous-generation cloning pipeline. Warm / evict / repromote endpoints. Redis-backed warm registry. Superseded 2026-05-22 by the zero-shot cloning engine; the endpoints remain for compatibility but are no-ops on the new engine.
Phase 4 — WebSocket streaming. WS /stream/speak with the text/flush/interrupt/close vocabulary. Opus + PCM output codecs.
Phase 5 — WebRTC. Session lifecycle at /stream/speak/webrtc/sessions/* plus control WebSocket. Requires aiortc + av in the deployment.
Phase 6 — Async long-form. POST /speak returns 202 + job_id for text over the threshold. GET /speak/jobs/{id} for polling. Caller can force the path with force_async.
Phase 7 — GDPR + safety. Erasure pipeline with audit rows. Blocklist endpoints. Lifecycle hooks (install / upgrade / uninstall / tenant enable / disable) wired into the erasure worker.
2026-05-13 — save_to ScaiDrive. POST /speak accepts a save_to block; sync + async paths upload to the caller's ScaiDrive share via token exchange. Synth admin page at /admin/scaispeak/synthesise ships with the ScaiDrive folder picker and localStorage presets. Global voices: POST /admin/voices/global + DELETE /admin/voices/global/{id}, SuperAdmin-only, licensed-not-consent-based.
2026-05-22 — Zero-shot cloning engine. Self-hosted cloning is now zero-shot: the reference clip is consumed at synth time directly, no separate training step. New voices land at embedding_status: ready immediately after intake clears preflight. Three new optional fields on POST /speak (instructions, cfg_value, warmup_trim_ms) let callers tune per-call delivery for cloned voices. Output sample rate is now 48 kHz on the self-hosted path, up from 24 kHz. The warm / repromote endpoints stay in place as no-ops for compatibility.