---
summary: How voices are scoped (global, tenant, user), how cloning works, what consent
  vs license means, and how erasure flows through the system.
title: Voice library and consent
path: concepts/voice-library
status: published
---

A voice in ScaiSpeak is a row in the library with a reference clip, a consent or license trail, and an embedding state. When you call `/speak`, you pick a `voice_id` from a list filtered by what's visible to you.

## Three scopes, one list

Every list call (`GET /voices`) merges three pools:

- **Global** voices are platform-managed. SuperAdmins import them through `POST /admin/voices/global` against a commercial license — the licensor name, license type, and (for time- or usage-bound licenses) the expiry or character cap live on the voice row. No consent recording. Every tenant sees every global voice.
- **Tenant** voices are shared inside one tenant. They're created by a user, then promoted via `POST /voices/{id}/share` (which needs `scaispeak:voice.share`). All users in the tenant see them.
- **User** voices are private to the user who cloned them. Other users in the same tenant don't see them. This is the default scope for any cloned voice.

The visibility ACL is enforced server-side in `VoiceService.list_visible` and `get_visible`. You can't bypass it by guessing a `voice_id` — cross-scope reads return 404, not 403, so the existence of voices outside your scope isn't disclosed.

## Cloning a voice

Voice cloning takes two audio inputs:

- A **reference** clip — 5-30 seconds of the speaker's voice in a representative speaking style. This becomes the voice's identity at synthesis time.
- A **consent** clip — the same speaker reading a verbatim scripted statement (the `consent_text` field) that records what they're agreeing to. Authenticates that the reference voice belongs to the person consenting.

Both arrive through `POST /voices` as either multipart file uploads or as ScaiDrive references (one of `{file_id, mcp_uri, share_url}`). You can mix sources per file — reference inline, consent from ScaiDrive — but not both for the same file. Inline-plus-ScaiDrive for the same audio fails fast with `SCAISPEAK_AMBIGUOUS_SOURCE`.

There's also a WebSocket alternative at `WS /voices/record` for live-recording in the browser. Same two-phase flow (reference, then consent), same validation, no file handling.

Cloning is **zero-shot** — the reference clip is consumed at synthesis time directly; there's no separate training step. Voices become usable as soon as intake clears preflight + the consent record is committed.

## Preflight

Before any audio is stored, the reference clip runs through a cheap preflight (`run_preflight`):

- Duration in milliseconds (must be in spec range).
- Sample rate and channel count.
- Peak dBFS (clipping check).
- Estimated SNR (signal-to-noise sanity check).
- Voice-activity ratio (rejects clips that are mostly silence).

When the preflight fails, the response carries the structured `preflight` block so the operator can see which threshold tripped without having to re-upload. Warn-not-block findings show up in `warnings`; blocking findings show up in `fail_reasons` and the request returns 400 with code `SCAISPEAK_VOICE_PREFLIGHT_FAILED`.

## Voice lifecycle

A new voice goes through these states (column `embedding_status`):

| State | Meaning |
|---|---|
| `pending` | Created, intake started, blocked on preflight or consent recording. Brief — exists only while the upload completes. |
| `processing` | Legacy state from the pre-zero-shot era; not used by the current pipeline. Stuck rows here indicate the voice was created before the engine migration and hasn't been re-promoted. |
| `ready` | The voice is usable. Backend A (self-hosted) reuses the reference clip at synth time for zero-shot cloning; Backend B (managed relay) uses its own enrollment if applicable. |
| `failed` | Intake failed. `embedding_status_reason` carries a short tag (`reference_too_short`, `reference_unavailable`, etc.). |
| `evicted` | Soft-deleted by erasure. The row is kept for audit; cached artefacts are cleared; the reference audio is gone from object storage. |

For voices stuck in `processing` (legacy), `POST /voices/{id}/repromote` re-runs intake processing. Idempotent — if the voice is already ready, it's a no-op.

## Consent vs license

The audit trail differs by scope:

- **User / tenant voices** carry a `voice_consent` row. It pins the speaker's full name, the stated purpose, the verbatim consent text, and a hash of the consent audio against that text. This is the GDPR-grade record that the human in the clip authorised use.
- **Global voices** carry a `voice_platform_license` row instead. It pins the licensor, license type (`perpetual`, `time_bound`, `usage_bound`), and (for non-perpetual) the bounds. Licensed acquisition is the audit equivalent for platform-wide voices — no individual end-user consent exists because the license is between ScaiLabs and the voice talent.

Both rows are immutable once written. Editing the consent or license means deleting the voice and re-creating it.

## Erasure (right to be forgotten)

`DELETE /voices/{id}` is the user-facing erasure path. It's not a simple row delete — it fans out:

1. Tombstone the voice row (`deleted_at` set, `embedding_status='evicted'`).
2. Send `EvictVoice` to every ScaiInfer node currently warm on this voice.
3. Clear the voice-warm Redis registry.
4. Delete the reference audio + consent audio blobs from object storage.
5. Write an immutable `erasure_audit` row capturing the trigger user, source, warm-replicas-evicted count, blob-bytes-deleted count, and any partial-failure error summary.

The response carries the `audit_id` so GDPR tooling can cross-reference. The tombstoned row gets hard-deleted later by the background tombstone worker; until then, listing endpoints filter it out.

Global voices have a parallel path at `DELETE /admin/voices/global/{id}` (SuperAdmin-only) with a required `trigger`: `license_revoked`, `license_expired`, or `platform_decision`. Same erasure pipeline; the license row's `status` is updated to match the trigger.

## Backend portability

Voices are not tied to a single backend. A voice created on Backend A (self-hosted, supports zero-shot cloning) keeps working if the deployment falls back to Backend B (managed relay, preset speakers only) — though the relay can't reproduce a cloned identity; it serves the reference clip's nearest preset match if any. The speaker identity is the reference audio itself, stored in object storage with the voice row.

`POST /voices/{id}/repromote` re-runs the intake pipeline for voices that were created before the current engine and are still in `processing`. Idempotent on already-ready voices.
