Platform
ScaiWave ScaiGrid ScaiCore ScaiBot ScaiDrive ScaiKey Models Tools & Services
Solutions
Organisations Developers Internet Service Providers Managed Service Providers AI-in-a-Box
Resources
Support Documentation Blog Downloads
Company
About Research Careers Investment Opportunities Contact
Log in

Voice library and consent

A voice in ScaiSpeak is a row in the library with a reference clip, a consent or license trail, and an embedding state. When you call /speak, you pick a voice_id from a list filtered by what's visible to you.

Three scopes, one list#

Every list call (GET /voices) merges three pools:

  • Global voices are platform-managed. SuperAdmins import them through POST /admin/voices/global against a commercial license — the licensor name, license type, and (for time- or usage-bound licenses) the expiry or character cap live on the voice row. No consent recording. Every tenant sees every global voice.
  • Tenant voices are shared inside one tenant. They're created by a user, then promoted via POST /voices/{id}/share (which needs scaispeak:voice.share). All users in the tenant see them.
  • User voices are private to the user who cloned them. Other users in the same tenant don't see them. This is the default scope for any cloned voice.

The visibility ACL is enforced server-side in VoiceService.list_visible and get_visible. You can't bypass it by guessing a voice_id — cross-scope reads return 404, not 403, so the existence of voices outside your scope isn't disclosed.

Cloning a voice#

Voice cloning takes two audio inputs:

  • A reference clip — 5-30 seconds of the speaker's voice in a representative speaking style. This becomes the voice's identity at synthesis time.
  • A consent clip — the same speaker reading a verbatim scripted statement (the consent_text field) that records what they're agreeing to. Authenticates that the reference voice belongs to the person consenting.

Both arrive through POST /voices as either multipart file uploads or as ScaiDrive references (one of {file_id, mcp_uri, share_url}). You can mix sources per file — reference inline, consent from ScaiDrive — but not both for the same file. Inline-plus-ScaiDrive for the same audio fails fast with SCAISPEAK_AMBIGUOUS_SOURCE.

There's also a WebSocket alternative at WS /voices/record for live-recording in the browser. Same two-phase flow (reference, then consent), same validation, no file handling.

Cloning is zero-shot — the reference clip is consumed at synthesis time directly; there's no separate training step. Voices become usable as soon as intake clears preflight + the consent record is committed.

Preflight#

Before any audio is stored, the reference clip runs through a cheap preflight (run_preflight):

  • Duration in milliseconds (must be in spec range).
  • Sample rate and channel count.
  • Peak dBFS (clipping check).
  • Estimated SNR (signal-to-noise sanity check).
  • Voice-activity ratio (rejects clips that are mostly silence).

When the preflight fails, the response carries the structured preflight block so the operator can see which threshold tripped without having to re-upload. Warn-not-block findings show up in warnings; blocking findings show up in fail_reasons and the request returns 400 with code SCAISPEAK_VOICE_PREFLIGHT_FAILED.

Voice lifecycle#

A new voice goes through these states (column embedding_status):

State Meaning
pending Created, intake started, blocked on preflight or consent recording. Brief — exists only while the upload completes.
processing Legacy state from the pre-zero-shot era; not used by the current pipeline. Stuck rows here indicate the voice was created before the engine migration and hasn't been re-promoted.
ready The voice is usable. Backend A (self-hosted) reuses the reference clip at synth time for zero-shot cloning; Backend B (managed relay) uses its own enrollment if applicable.
failed Intake failed. embedding_status_reason carries a short tag (reference_too_short, reference_unavailable, etc.).
evicted Soft-deleted by erasure. The row is kept for audit; cached artefacts are cleared; the reference audio is gone from object storage.

For voices stuck in processing (legacy), POST /voices/{id}/repromote re-runs intake processing. Idempotent — if the voice is already ready, it's a no-op.

The audit trail differs by scope:

  • User / tenant voices carry a voice_consent row. It pins the speaker's full name, the stated purpose, the verbatim consent text, and a hash of the consent audio against that text. This is the GDPR-grade record that the human in the clip authorised use.
  • Global voices carry a voice_platform_license row instead. It pins the licensor, license type (perpetual, time_bound, usage_bound), and (for non-perpetual) the bounds. Licensed acquisition is the audit equivalent for platform-wide voices — no individual end-user consent exists because the license is between ScaiLabs and the voice talent.

Both rows are immutable once written. Editing the consent or license means deleting the voice and re-creating it.

Erasure (right to be forgotten)#

DELETE /voices/{id} is the user-facing erasure path. It's not a simple row delete — it fans out:

  1. Tombstone the voice row (deleted_at set, embedding_status='evicted').
  2. Send EvictVoice to every ScaiInfer node currently warm on this voice.
  3. Clear the voice-warm Redis registry.
  4. Delete the reference audio + consent audio blobs from object storage.
  5. Write an immutable erasure_audit row capturing the trigger user, source, warm-replicas-evicted count, blob-bytes-deleted count, and any partial-failure error summary.

The response carries the audit_id so GDPR tooling can cross-reference. The tombstoned row gets hard-deleted later by the background tombstone worker; until then, listing endpoints filter it out.

Global voices have a parallel path at DELETE /admin/voices/global/{id} (SuperAdmin-only) with a required trigger: license_revoked, license_expired, or platform_decision. Same erasure pipeline; the license row's status is updated to match the trigger.

Backend portability#

Voices are not tied to a single backend. A voice created on Backend A (self-hosted, supports zero-shot cloning) keeps working if the deployment falls back to Backend B (managed relay, preset speakers only) — though the relay can't reproduce a cloned identity; it serves the reference clip's nearest preset match if any. The speaker identity is the reference audio itself, stored in object storage with the voice row.

POST /voices/{id}/repromote re-runs the intake pipeline for voices that were created before the current engine and are still in processing. Idempotent on already-ready voices.

Updated 2026-05-22 14:27:32 View source (.md) rev 13