Audio
Two audio capabilities, two endpoints: transcription (audio → text) and synthesis (text → audio).
Transcription#
Convert speech or recorded audio into text. Supports MP3, WAV, OGG, WebM, M4A, and a handful of less common formats.
Endpoint: POST /v1/inference/audio/transcribe
Basic request#
Transcription uses multipart/form-data (not JSON), because audio files are binary.
1 2 3 4 5 | |
1 2 3 4 5 6 7 8 9 10 11 | |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | |
Parameters#
| Field | Type | Notes |
|---|---|---|
model |
string (required) | Transcription model slug |
file |
binary (required) | Audio file upload |
language |
string | ISO-639-1 code — speeds up and improves accuracy when known |
temperature |
float | 0.0–1.0 — higher = more creative transcription (rarely useful) |
prompt |
string | Initial context to bias transcription (e.g. "The following is technical documentation about databases.") |
response_format |
string | "text" (default), "json", "verbose_json", "srt", "vtt" |
timestamp_granularities |
array | ["segment", "word"] for verbose_json |
Response#
1 2 3 4 5 6 7 8 | |
For verbose_json response_format, you also get per-segment and per-word timestamps.
Best practices#
- Keep files under 25 MB — the typical upload limit. Split longer recordings into chunks.
- Pass
languageif you know it — significantly faster and more accurate. - Use
promptfor context — helps with domain-specific terminology, uncommon names, technical vocabulary. - Consider SRT/VTT output for video subtitles — ScaiGrid returns ready-to-use subtitle files.
Synthesis (text-to-speech)#
Convert text into spoken audio. Multiple voices, multiple formats.
Endpoint: POST /v1/inference/audio/synthesize
Basic request#
1 2 3 4 5 6 7 8 9 10 11 | |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | |
Parameters#
| Field | Type | Notes |
|---|---|---|
model |
string (required) | Synthesis model slug |
input |
string (required) | Text to synthesize |
voice |
string (required) | "alloy", "echo", "fable", "onyx", "nova", "shimmer" for OpenAI TTS |
response_format |
string | "mp3" (default), "opus", "aac", "flac", "wav", "pcm" |
speed |
float | 0.25–4.0 — 1.0 is normal |
Unlike other endpoints, synthesis returns raw audio bytes directly — not a JSON envelope. The Content-Type response header tells you the format.
Voice options#
For openai/tts-1 and openai/tts-1-hd:
alloy— neutral, balancedecho— warm, male-leaningfable— expressive, storytellingonyx— deep, authoritativenova— bright, upbeatshimmer— soft, gentle
Other providers (ElevenLabs, etc.) have their own voice vocabularies. Check your tenant's model list.
Format selection#
mp3— best general-purpose, small size, wide compatibilityopus— best for streaming/realtime, small sizeflac/wav— lossless, larger files, for editing pipelinespcm— raw 16-bit, 24kHz, little-endian — useful for ingesting into audio libraries directly
Multimodal chat with audio#
The chat completions endpoint accepts audio input for models that support it (GPT-4o audio, Gemini multimodal). See Chat Completions — Multimodal.
What's next#
- Chat Completions — multimodal messages including audio input.
- OpenAI Compatibility —
/oai/v1/audio/transcriptionsand/oai/v1/audio/speechwork identically.