Audio

Two audio capabilities, two endpoints: transcription (audio → text) and synthesis (text → audio).

Transcription#

Convert speech or recorded audio into text. Supports MP3, WAV, OGG, WebM, M4A, and a handful of less common formats.

Endpoint: POST /v1/inference/audio/transcribe

Basic request#

Transcription uses multipart/form-data (not JSON), because audio files are binary.

bash
curl -X POST https://scaigrid.scailabs.ai/v1/inference/audio/transcribe \
  -H "Authorization: Bearer $SCAIGRID_API_KEY" \
  -F "model=openai/whisper-1" \
  -F "file=@recording.mp3" \
  -F "language=en"

python
import httpx, os

with open("recording.mp3", "rb") as f:
    resp = httpx.post(
        "https://scaigrid.scailabs.ai/v1/inference/audio/transcribe",
        headers={"Authorization": f"Bearer {os.environ['SCAIGRID_API_KEY']}"},
        files={"file": ("recording.mp3", f, "audio/mpeg")},
        data={"model": "openai/whisper-1", "language": "en"},
        timeout=120,
    )
print(resp.json()["data"]["text"])

typescript
import { createReadStream } from "node:fs";

const form = new FormData();
form.append("model", "openai/whisper-1");
form.append("language", "en");
form.append("file", new Blob([await Bun.file("recording.mp3").arrayBuffer()]), "recording.mp3");

const resp = await fetch("https://scaigrid.scailabs.ai/v1/inference/audio/transcribe", {
  method: "POST",
  headers: { "Authorization": `Bearer ${process.env.SCAIGRID_API_KEY}` },
  body: form,
});
const { data } = await resp.json();
console.log(data.text);

Parameters#

Field	Type	Notes
`model`	string (required)	Transcription model slug
`file`	binary (required)	Audio file upload
`language`	string	ISO-639-1 code — speeds up and improves accuracy when known
`temperature`	float	0.0–1.0 — higher = more creative transcription (rarely useful)
`prompt`	string	Initial context to bias transcription (e.g. "The following is technical documentation about databases.")
`response_format`	string	`"text"` (default), `"json"`, `"verbose_json"`, `"srt"`, `"vtt"`
`timestamp_granularities`	array	`["segment", "word"]` for verbose_json

Response#

json
{
  "status": "ok",
  "data": {
    "text": "The meeting starts at 9 AM tomorrow.",
    "duration": 3.2,
    "language": "en"
  }
}

For verbose_json response_format, you also get per-segment and per-word timestamps.

Best practices#

Keep files under 25 MB — the typical upload limit. Split longer recordings into chunks.
Pass language if you know it — significantly faster and more accurate.
Use prompt for context — helps with domain-specific terminology, uncommon names, technical vocabulary.
Consider SRT/VTT output for video subtitles — ScaiGrid returns ready-to-use subtitle files.

Synthesis (text-to-speech)#

Convert text into spoken audio. Multiple voices, multiple formats.

Endpoint: POST /v1/inference/audio/synthesize

Basic request#

bash
curl -X POST https://scaigrid.scailabs.ai/v1/inference/audio/synthesize \
  -H "Authorization: Bearer $SCAIGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/tts-1",
    "input": "Welcome to ScaiGrid. Your request has been received.",
    "voice": "alloy",
    "response_format": "mp3",
    "speed": 1.0
  }' \
  --output welcome.mp3

python
import httpx, os

resp = httpx.post(
    "https://scaigrid.scailabs.ai/v1/inference/audio/synthesize",
    headers={"Authorization": f"Bearer {os.environ['SCAIGRID_API_KEY']}"},
    json={
        "model": "openai/tts-1",
        "input": "Welcome to ScaiGrid. Your request has been received.",
        "voice": "alloy",
        "response_format": "mp3",
        "speed": 1.0,
    },
    timeout=60,
)
# Response is raw audio bytes, not JSON
with open("welcome.mp3", "wb") as f:
    f.write(resp.content)

typescript
import { writeFileSync } from "node:fs";

const resp = await fetch("https://scaigrid.scailabs.ai/v1/inference/audio/synthesize", {
  method: "POST",
  headers: {
    "Authorization": `Bearer ${process.env.SCAIGRID_API_KEY}`,
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
    model: "openai/tts-1",
    input: "Welcome to ScaiGrid. Your request has been received.",
    voice: "alloy",
    response_format: "mp3",
    speed: 1.0,
  }),
});
writeFileSync("welcome.mp3", Buffer.from(await resp.arrayBuffer()));

Parameters#

Field	Type	Notes
`model`	string (required)	Synthesis model slug
`input`	string (required)	Text to synthesize
`voice`	string (required)	`"alloy"`, `"echo"`, `"fable"`, `"onyx"`, `"nova"`, `"shimmer"` for OpenAI TTS
`response_format`	string	`"mp3"` (default), `"opus"`, `"aac"`, `"flac"`, `"wav"`, `"pcm"`
`speed`	float	0.25–4.0 — `1.0` is normal

Unlike other endpoints, synthesis returns raw audio bytes directly — not a JSON envelope. The Content-Type response header tells you the format.

Voice options#

For openai/tts-1 and openai/tts-1-hd:

alloy — neutral, balanced
echo — warm, male-leaning
fable — expressive, storytelling
onyx — deep, authoritative
nova — bright, upbeat
shimmer — soft, gentle

Other providers (ElevenLabs, etc.) have their own voice vocabularies. Check your tenant's model list.

Format selection#

mp3 — best general-purpose, small size, wide compatibility
opus — best for streaming/realtime, small size
flac / wav — lossless, larger files, for editing pipelines
pcm — raw 16-bit, 24kHz, little-endian — useful for ingesting into audio libraries directly

Multimodal chat with audio#

The chat completions endpoint accepts audio input for models that support it (GPT-4o audio, Gemini multimodal). See Chat Completions — Multimodal.

What's next#

Chat Completions — multimodal messages including audio input.
OpenAI Compatibility — /oai/v1/audio/transcriptions and /oai/v1/audio/speech work identically.