Changelog
User-visible changes only. Internal refactors and infrastructure work omitted.
v0.x — Phase rollout#
ScaiSpeak ships in phases. Each phase adds endpoints and capabilities; the module ID and URL prefix have been stable since Phase 0.
- Phase 1 — Voice library. List / get / clone / update / delete voices. Preflight checks on intake. Consent capture. ScaiDrive references for reference + consent audio. Permissions split into
synthesize,voice.read,voice.write,voice.share,admin. - Phase 2 — Batch synth.
POST /speakwith Backend B (managed TTS relay) wired. Tenant backend policy at/admin/policy. Voice preview endpoint. - Phase 2B — Self-host backend. Backend A (ScaiInfer-hosted TTS engine) added behind the same
/speakpath. Backend policy picks per-tenant. - Phase 3 — Voice warming.
voice_prefix_tokensfrom the previous-generation cloning pipeline. Warm / evict / repromote endpoints. Redis-backed warm registry. Superseded 2026-05-22 by the zero-shot cloning engine; the endpoints remain for compatibility but are no-ops on the new engine. - Phase 4 — WebSocket streaming.
WS /stream/speakwith the text/flush/interrupt/close vocabulary. Opus + PCM output codecs. - Phase 5 — WebRTC. Session lifecycle at
/stream/speak/webrtc/sessions/*plus control WebSocket. Requiresaiortc+avin the deployment. - Phase 6 — Async long-form.
POST /speakreturns202+job_idfor text over the threshold.GET /speak/jobs/{id}for polling. Caller can force the path withforce_async. - Phase 7 — GDPR + safety. Erasure pipeline with audit rows. Blocklist endpoints. Lifecycle hooks (install / upgrade / uninstall / tenant enable / disable) wired into the erasure worker.
- 2026-05-13 — save_to ScaiDrive.
POST /speakaccepts asave_toblock; sync + async paths upload to the caller's ScaiDrive share via token exchange. Synth admin page at/admin/scaispeak/synthesiseships with the ScaiDrive folder picker and localStorage presets. Global voices:POST /admin/voices/global+DELETE /admin/voices/global/{id}, SuperAdmin-only, licensed-not-consent-based. - 2026-05-22 — Zero-shot cloning engine. Self-hosted cloning is now zero-shot: the reference clip is consumed at synth time directly, no separate training step. New voices land at
embedding_status: readyimmediately after intake clears preflight. Three new optional fields onPOST /speak(instructions,cfg_value,warmup_trim_ms) let callers tune per-call delivery for cloned voices. Output sample rate is now 48 kHz on the self-hosted path, up from 24 kHz. The warm / repromote endpoints stay in place as no-ops for compatibility.