---
summary: "Upload a reference clip plus a consent recording, then synthesise in your\
  \ custom voice \u2014 zero-shot, no training wait."
title: Clone a voice and synthesise
path: tutorials/clone-and-synthesise
status: published
---

You're going from a reference recording to a working custom voice that runs in your tenant. Roughly two minutes end-to-end — cloning is zero-shot, so there's no training step to wait for.

You need:

- A 5-30 second reference clip of the target speaker (WAV, mono recommended; a broadband recording from a wired headset or studio mic produces the best clones).
- A consent recording of the same speaker reading a scripted statement that names the platform, names the speaker, and describes the intended use.
- An API key with `scaispeak:voice.write` (tenant admins have it; otherwise grant it explicitly).

## 1. Settle the consent script

Before you record anything, decide what the speaker is going to say. The statement is stored verbatim and the audio is hashed against this exact text — a typo means the audio you upload no longer matches the consent.

A working template:

```
My name is {full_name}. On {date}, I am recording this statement to grant
{tenant_name} permission to clone my voice using ScaiSpeak. The cloned
voice will be used for {stated_purpose}. I understand the recording is
stored under GDPR-compliant terms and can be erased on request.
```

Keep the statement short — 10-20 seconds is plenty. Reading it monotonally is fine; expressiveness goes in the reference clip, not the consent.

## 2. Record the reference clip

The reference is what the cloned voice *sounds* like. Aim for:

- Clean recording — no music, no overlapping voice, low background noise.
- 5-25 seconds of varied speech (one sentence with both rising and falling intonation works).
- The speaking style you want the clone to inherit — the model will mirror tempo, formality, and prosody from this clip.

Save as `reference.wav` and `consent.wav`. The preflight rejects clips that are mostly silence, mostly clipping, or outside the duration range, so listen back before uploading.

## 3. Create the voice

Upload both clips together with the consent metadata. The reference and consent files arrive as multipart parts; the rest is form fields. The endpoint runs preflight synchronously and rejects the request inline if the reference clip fails the quality check — you'll see the structured `preflight` block in the error response.

```bash
curl -X POST "$SCAIGRID_HOST/v1/modules/scaispeak/voices" \
  -H "Authorization: Bearer $SCAIGRID_API_KEY" \
  -F "reference=@reference.wav" \
  -F "consent=@consent.wav" \
  -F "display_name=Acme Narrator Avery" \
  -F "language_primary=en" \
  -F "language_supported_json=[\"en\",\"en-GB\"]" \
  -F "gender_hint=female" \
  -F "age_hint=adult" \
  -F "consent_user_full_name=Avery Johnson" \
  -F "consent_stated_purpose=narration for the Acme handbook audiobook" \
  -F "consent_text=My name is Avery Johnson. On 2026-05-17, I grant Acme..."
```

```python
import httpx, os

with open("reference.wav", "rb") as ref, open("consent.wav", "rb") as cnt:
    r = httpx.post(
        f"{os.environ['SCAIGRID_HOST']}/v1/modules/scaispeak/voices",
        headers={"Authorization": f"Bearer {os.environ['SCAIGRID_API_KEY']}"},
        files={
            "reference": ("reference.wav", ref, "audio/wav"),
            "consent": ("consent.wav", cnt, "audio/wav"),
        },
        data={
            "display_name": "Acme Narrator Avery",
            "language_primary": "en",
            "language_supported_json": '["en","en-GB"]',
            "gender_hint": "female",
            "age_hint": "adult",
            "consent_user_full_name": "Avery Johnson",
            "consent_stated_purpose": "narration for the Acme handbook audiobook",
            "consent_text": (
                "My name is Avery Johnson. On 2026-05-17, I grant Acme..."
            ),
        },
    )
r.raise_for_status()
voice = r.json()["data"]
print(voice["voice_id"], voice["embedding_status"], voice["preflight"])
```

```javascript
const form = new FormData();
form.append("reference", fs.createReadStream("reference.wav"));
form.append("consent", fs.createReadStream("consent.wav"));
form.append("display_name", "Acme Narrator Avery");
form.append("language_primary", "en");
form.append("language_supported_json", '["en","en-GB"]');
form.append("gender_hint", "female");
form.append("age_hint", "adult");
form.append("consent_user_full_name", "Avery Johnson");
form.append("consent_stated_purpose", "narration for the Acme handbook audiobook");
form.append("consent_text", "My name is Avery Johnson. On 2026-05-17, I grant Acme...");

const res = await fetch(`${process.env.SCAIGRID_HOST}/v1/modules/scaispeak/voices`, {
  method: "POST",
  headers: { "Authorization": `Bearer ${process.env.SCAIGRID_API_KEY}` },
  body: form,
});
const { data: voice } = await res.json();
console.log(voice.voice_id, voice.embedding_status);
```

If preflight rejected the audio, you'll get a 400 with the `preflight` block explaining which threshold failed. Otherwise the voice lands at `embedding_status: ready` and is immediately usable — there's no separate training step to wait for.

## 4. Synthesise

Now use the voice the same way you'd use any voice in the library:

```bash
curl -X POST "$SCAIGRID_HOST/v1/modules/scaispeak/speak" \
  -H "Authorization: Bearer $SCAIGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "voice_id": "'$VOICE_ID'",
    "text": "Welcome to the Acme handbook. In this chapter we cover...",
    "response_format": "mp3"
  }' \
  | python -c "import sys,json,base64;\
b=json.load(sys.stdin)['data']['audio_base64'];\
open('chapter.mp3','wb').write(base64.b64decode(b))"
```

Play `chapter.mp3`. The voice should match the reference clip.

## 5. Tune the delivery (optional)

Three extra fields on `/speak` let you steer the cloned-voice output per call:

- `instructions` — free-text style guidance. The engine interprets it: `"cheerful and energetic"`, `"slowly and carefully"`, `"whispered"`, `"like reading a bedtime story"` all work.
- `cfg_value` — cloning fidelity vs naturalness tradeoff, 0.5 to 5.0. Higher stays closer to the reference at the cost of natural prosody; lower sounds more natural but drifts further from the source. Engine default ~2.0.
- `warmup_trim_ms` — strips the first N ms to absorb the engine's warm-up artefact at the start of cloned output. 150 ms is the recommended setting; 0 to disable.

```bash
curl -X POST "$SCAIGRID_HOST/v1/modules/scaispeak/speak" \
  -H "Authorization: Bearer $SCAIGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "voice_id": "'$VOICE_ID'",
    "text": "Welcome to the Acme handbook.",
    "response_format": "wav",
    "instructions": "warm and reassuring, slower than normal",
    "cfg_value": 2.5,
    "warmup_trim_ms": 150
  }'
```

These fields are no-ops for preset speakers and for the managed-relay backend.

## 6. Share with your tenant (optional)

Other users in your tenant can't see this voice yet — it's scope `user`. Promote it to tenant scope:

```bash
curl -X POST "$SCAIGRID_HOST/v1/modules/scaispeak/voices/$VOICE_ID/share" \
  -H "Authorization: Bearer $SCAIGRID_API_KEY"
```

Sharing needs `scaispeak:voice.share` beyond the standard write permission. After this, every user in your tenant sees the voice in `GET /voices`.

## 7. When you're done with it

Right-to-erasure is built in. `DELETE /voices/{id}` clears the reference and consent blobs from storage, evicts any cached state across the deployment, and writes an immutable audit row.

```bash
curl -X DELETE "$SCAIGRID_HOST/v1/modules/scaispeak/voices/$VOICE_ID" \
  -H "Authorization: Bearer $SCAIGRID_API_KEY"
```

The response carries the `audit_id`. Keep it — it's the proof the deletion happened.

## Done

You have a custom voice running through your tenant, with a recorded consent trail and a working erasure path. Iterate the same way: re-record, re-create, delete the old one.
