---
title: Audio
path: api-guides/audio
status: published
---

# Audio

Two audio capabilities, two endpoints: **transcription** (audio → text) and **synthesis** (text → audio).

## Transcription

Convert speech or recorded audio into text. Supports MP3, WAV, OGG, WebM, M4A, and a handful of less common formats.

**Endpoint:** `POST /v1/inference/audio/transcribe`

### Basic request

Transcription uses multipart/form-data (not JSON), because audio files are binary.

```bash
curl -X POST https://scaigrid.scailabs.ai/v1/inference/audio/transcribe \
  -H "Authorization: Bearer $SCAIGRID_API_KEY" \
  -F "model=openai/whisper-1" \
  -F "file=@recording.mp3" \
  -F "language=en"
```

```python
import httpx, os

with open("recording.mp3", "rb") as f:
    resp = httpx.post(
        "https://scaigrid.scailabs.ai/v1/inference/audio/transcribe",
        headers={"Authorization": f"Bearer {os.environ['SCAIGRID_API_KEY']}"},
        files={"file": ("recording.mp3", f, "audio/mpeg")},
        data={"model": "openai/whisper-1", "language": "en"},
        timeout=120,
    )
print(resp.json()["data"]["text"])
```

```typescript
import { createReadStream } from "node:fs";

const form = new FormData();
form.append("model", "openai/whisper-1");
form.append("language", "en");
form.append("file", new Blob([await Bun.file("recording.mp3").arrayBuffer()]), "recording.mp3");

const resp = await fetch("https://scaigrid.scailabs.ai/v1/inference/audio/transcribe", {
  method: "POST",
  headers: { "Authorization": `Bearer ${process.env.SCAIGRID_API_KEY}` },
  body: form,
});
const { data } = await resp.json();
console.log(data.text);
```

### Parameters

| Field | Type | Notes |
|-------|------|-------|
| `model` | string (required) | Transcription model slug |
| `file` | binary (required) | Audio file upload |
| `language` | string | ISO-639-1 code — speeds up and improves accuracy when known |
| `temperature` | float | 0.0–1.0 — higher = more creative transcription (rarely useful) |
| `prompt` | string | Initial context to bias transcription (e.g. "The following is technical documentation about databases.") |
| `response_format` | string | `"text"` (default), `"json"`, `"verbose_json"`, `"srt"`, `"vtt"` |
| `timestamp_granularities` | array | `["segment", "word"]` for verbose_json |

### Response

```json
{
  "status": "ok",
  "data": {
    "text": "The meeting starts at 9 AM tomorrow.",
    "duration": 3.2,
    "language": "en"
  }
}
```

For `verbose_json` response_format, you also get per-segment and per-word timestamps.

### Best practices

- **Keep files under 25 MB** — the typical upload limit. Split longer recordings into chunks.
- **Pass `language` if you know it** — significantly faster and more accurate.
- **Use `prompt` for context** — helps with domain-specific terminology, uncommon names, technical vocabulary.
- **Consider SRT/VTT output** for video subtitles — ScaiGrid returns ready-to-use subtitle files.

## Synthesis (text-to-speech)

Convert text into spoken audio. Multiple voices, multiple formats.

**Endpoint:** `POST /v1/inference/audio/synthesize`

### Basic request

```bash
curl -X POST https://scaigrid.scailabs.ai/v1/inference/audio/synthesize \
  -H "Authorization: Bearer $SCAIGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/tts-1",
    "input": "Welcome to ScaiGrid. Your request has been received.",
    "voice": "alloy",
    "response_format": "mp3",
    "speed": 1.0
  }' \
  --output welcome.mp3
```

```python
import httpx, os

resp = httpx.post(
    "https://scaigrid.scailabs.ai/v1/inference/audio/synthesize",
    headers={"Authorization": f"Bearer {os.environ['SCAIGRID_API_KEY']}"},
    json={
        "model": "openai/tts-1",
        "input": "Welcome to ScaiGrid. Your request has been received.",
        "voice": "alloy",
        "response_format": "mp3",
        "speed": 1.0,
    },
    timeout=60,
)
# Response is raw audio bytes, not JSON
with open("welcome.mp3", "wb") as f:
    f.write(resp.content)
```

```typescript
import { writeFileSync } from "node:fs";

const resp = await fetch("https://scaigrid.scailabs.ai/v1/inference/audio/synthesize", {
  method: "POST",
  headers: {
    "Authorization": `Bearer ${process.env.SCAIGRID_API_KEY}`,
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
    model: "openai/tts-1",
    input: "Welcome to ScaiGrid. Your request has been received.",
    voice: "alloy",
    response_format: "mp3",
    speed: 1.0,
  }),
});
writeFileSync("welcome.mp3", Buffer.from(await resp.arrayBuffer()));
```

### Parameters

| Field | Type | Notes |
|-------|------|-------|
| `model` | string (required) | Synthesis model slug |
| `input` | string (required) | Text to synthesize |
| `voice` | string (required) | `"alloy"`, `"echo"`, `"fable"`, `"onyx"`, `"nova"`, `"shimmer"` for OpenAI TTS |
| `response_format` | string | `"mp3"` (default), `"opus"`, `"aac"`, `"flac"`, `"wav"`, `"pcm"` |
| `speed` | float | 0.25–4.0 — `1.0` is normal |

Unlike other endpoints, synthesis returns **raw audio bytes directly** — not a JSON envelope. The `Content-Type` response header tells you the format.

### Voice options

For `openai/tts-1` and `openai/tts-1-hd`:

- `alloy` — neutral, balanced
- `echo` — warm, male-leaning
- `fable` — expressive, storytelling
- `onyx` — deep, authoritative
- `nova` — bright, upbeat
- `shimmer` — soft, gentle

Other providers (ElevenLabs, etc.) have their own voice vocabularies. Check your tenant's model list.

### Format selection

- `mp3` — best general-purpose, small size, wide compatibility
- `opus` — best for streaming/realtime, small size
- `flac` / `wav` — lossless, larger files, for editing pipelines
- `pcm` — raw 16-bit, 24kHz, little-endian — useful for ingesting into audio libraries directly

## Multimodal chat with audio

The chat completions endpoint accepts audio input for models that support it (GPT-4o audio, Gemini multimodal). See [Chat Completions — Multimodal](./01-chat-completions.md#multimodal-content).

## What's next

- [Chat Completions](./01-chat-completions.md) — multimodal messages including audio input.
- [OpenAI Compatibility](./07-openai-compatibility.md) — `/oai/v1/audio/transcriptions` and `/oai/v1/audio/speech` work identically.
