Clone a voice and synthesise

You're going from a reference recording to a working custom voice that runs in your tenant. Roughly two minutes end-to-end — cloning is zero-shot, so there's no training step to wait for.

You need:

A 5-30 second reference clip of the target speaker (WAV, mono recommended; a broadband recording from a wired headset or studio mic produces the best clones).
A consent recording of the same speaker reading a scripted statement that names the platform, names the speaker, and describes the intended use.
An API key with scaispeak:voice.write (tenant admins have it; otherwise grant it explicitly).

Before you record anything, decide what the speaker is going to say. The statement is stored verbatim and the audio is hashed against this exact text — a typo means the audio you upload no longer matches the consent.

A working template:

scdoc

My name is {full_name}. On {date}, I am recording this statement to grant
{tenant_name} permission to clone my voice using ScaiSpeak. The cloned
voice will be used for {stated_purpose}. I understand the recording is
stored under GDPR-compliant terms and can be erased on request.

Keep the statement short — 10-20 seconds is plenty. Reading it monotonally is fine; expressiveness goes in the reference clip, not the consent.

2. Record the reference clip#

The reference is what the cloned voice sounds like. Aim for:

Clean recording — no music, no overlapping voice, low background noise.
5-25 seconds of varied speech (one sentence with both rising and falling intonation works).
The speaking style you want the clone to inherit — the model will mirror tempo, formality, and prosody from this clip.

Save as reference.wav and consent.wav. The preflight rejects clips that are mostly silence, mostly clipping, or outside the duration range, so listen back before uploading.

3. Create the voice#

Upload both clips together with the consent metadata. The reference and consent files arrive as multipart parts; the rest is form fields. The endpoint runs preflight synchronously and rejects the request inline if the reference clip fails the quality check — you'll see the structured preflight block in the error response.

bash
curl -X POST "$SCAIGRID_HOST/v1/modules/scaispeak/voices" \
  -H "Authorization: Bearer $SCAIGRID_API_KEY" \
  -F "reference=@reference.wav" \
  -F "consent=@consent.wav" \
  -F "display_name=Acme Narrator Avery" \
  -F "language_primary=en" \
  -F "language_supported_json=[\"en\",\"en-GB\"]" \
  -F "gender_hint=female" \
  -F "age_hint=adult" \
  -F "consent_user_full_name=Avery Johnson" \
  -F "consent_stated_purpose=narration for the Acme handbook audiobook" \
  -F "consent_text=My name is Avery Johnson. On 2026-05-17, I grant Acme..."

python
import httpx, os

with open("reference.wav", "rb") as ref, open("consent.wav", "rb") as cnt:
    r = httpx.post(
        f"{os.environ['SCAIGRID_HOST']}/v1/modules/scaispeak/voices",
        headers={"Authorization": f"Bearer {os.environ['SCAIGRID_API_KEY']}"},
        files={
            "reference": ("reference.wav", ref, "audio/wav"),
            "consent": ("consent.wav", cnt, "audio/wav"),
        },
        data={
            "display_name": "Acme Narrator Avery",
            "language_primary": "en",
            "language_supported_json": '["en","en-GB"]',
            "gender_hint": "female",
            "age_hint": "adult",
            "consent_user_full_name": "Avery Johnson",
            "consent_stated_purpose": "narration for the Acme handbook audiobook",
            "consent_text": (
                "My name is Avery Johnson. On 2026-05-17, I grant Acme..."
            ),
        },
    )
r.raise_for_status()
voice = r.json()["data"]
print(voice["voice_id"], voice["embedding_status"], voice["preflight"])

javascript
const form = new FormData();
form.append("reference", fs.createReadStream("reference.wav"));
form.append("consent", fs.createReadStream("consent.wav"));
form.append("display_name", "Acme Narrator Avery");
form.append("language_primary", "en");
form.append("language_supported_json", '["en","en-GB"]');
form.append("gender_hint", "female");
form.append("age_hint", "adult");
form.append("consent_user_full_name", "Avery Johnson");
form.append("consent_stated_purpose", "narration for the Acme handbook audiobook");
form.append("consent_text", "My name is Avery Johnson. On 2026-05-17, I grant Acme...");

const res = await fetch(`${process.env.SCAIGRID_HOST}/v1/modules/scaispeak/voices`, {
  method: "POST",
  headers: { "Authorization": `Bearer ${process.env.SCAIGRID_API_KEY}` },
  body: form,
});
const { data: voice } = await res.json();
console.log(voice.voice_id, voice.embedding_status);

If preflight rejected the audio, you'll get a 400 with the preflight block explaining which threshold failed. Otherwise the voice lands at embedding_status: ready and is immediately usable — there's no separate training step to wait for.

4. Synthesise#

Now use the voice the same way you'd use any voice in the library:

bash
curl -X POST "$SCAIGRID_HOST/v1/modules/scaispeak/speak" \
  -H "Authorization: Bearer $SCAIGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "voice_id": "'$VOICE_ID'",
    "text": "Welcome to the Acme handbook. In this chapter we cover...",
    "response_format": "mp3"
  }' \
  | python -c "import sys,json,base64;\
b=json.load(sys.stdin)['data']['audio_base64'];\
open('chapter.mp3','wb').write(base64.b64decode(b))"

Play chapter.mp3. The voice should match the reference clip.

5. Tune the delivery (optional)#

Three extra fields on /speak let you steer the cloned-voice output per call:

instructions — free-text style guidance. The engine interprets it: "cheerful and energetic", "slowly and carefully", "whispered", "like reading a bedtime story" all work.
cfg_value — cloning fidelity vs naturalness tradeoff, 0.5 to 5.0. Higher stays closer to the reference at the cost of natural prosody; lower sounds more natural but drifts further from the source. Engine default ~2.0.
warmup_trim_ms — strips the first N ms to absorb the engine's warm-up artefact at the start of cloned output. 150 ms is the recommended setting; 0 to disable.

bash
curl -X POST "$SCAIGRID_HOST/v1/modules/scaispeak/speak" \
  -H "Authorization: Bearer $SCAIGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "voice_id": "'$VOICE_ID'",
    "text": "Welcome to the Acme handbook.",
    "response_format": "wav",
    "instructions": "warm and reassuring, slower than normal",
    "cfg_value": 2.5,
    "warmup_trim_ms": 150
  }'

These fields are no-ops for preset speakers and for the managed-relay backend.

Other users in your tenant can't see this voice yet — it's scope user. Promote it to tenant scope:

bash
curl -X POST "$SCAIGRID_HOST/v1/modules/scaispeak/voices/$VOICE_ID/share" \
  -H "Authorization: Bearer $SCAIGRID_API_KEY"

Sharing needs scaispeak:voice.share beyond the standard write permission. After this, every user in your tenant sees the voice in GET /voices.

7. When you're done with it#

Right-to-erasure is built in. DELETE /voices/{id} clears the reference and consent blobs from storage, evicts any cached state across the deployment, and writes an immutable audit row.

bash
curl -X DELETE "$SCAIGRID_HOST/v1/modules/scaispeak/voices/$VOICE_ID" \
  -H "Authorization: Bearer $SCAIGRID_API_KEY"

The response carries the audit_id. Keep it — it's the proof the deletion happened.

Done#

You have a custom voice running through your tenant, with a recorded consent trail and a working erasure path. Iterate the same way: re-record, re-create, delete the old one.