Platform
ScaiWave ScaiGrid ScaiCore ScaiBot ScaiDrive ScaiKey Models Tools & Services
Solutions
Organisations Developers Internet Service Providers Managed Service Providers AI-in-a-Box
Resources
Support Documentation Blog Downloads
Company
About Research Careers Investment Opportunities Contact
Log in

Audio

Two audio capabilities, two endpoints: transcription (audio → text) and synthesis (text → audio).

Transcription#

Convert speech or recorded audio into text. Supports MP3, WAV, OGG, WebM, M4A, and a handful of less common formats.

Endpoint: POST /v1/inference/audio/transcribe

Basic request#

Transcription uses multipart/form-data (not JSON), because audio files are binary.

bash
1
2
3
4
5
curl -X POST https://scaigrid.scailabs.ai/v1/inference/audio/transcribe \
  -H "Authorization: Bearer $SCAIGRID_API_KEY" \
  -F "model=openai/whisper-1" \
  -F "file=@recording.mp3" \
  -F "language=en"
python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
import httpx, os

with open("recording.mp3", "rb") as f:
    resp = httpx.post(
        "https://scaigrid.scailabs.ai/v1/inference/audio/transcribe",
        headers={"Authorization": f"Bearer {os.environ['SCAIGRID_API_KEY']}"},
        files={"file": ("recording.mp3", f, "audio/mpeg")},
        data={"model": "openai/whisper-1", "language": "en"},
        timeout=120,
    )
print(resp.json()["data"]["text"])
typescript
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
import { createReadStream } from "node:fs";

const form = new FormData();
form.append("model", "openai/whisper-1");
form.append("language", "en");
form.append("file", new Blob([await Bun.file("recording.mp3").arrayBuffer()]), "recording.mp3");

const resp = await fetch("https://scaigrid.scailabs.ai/v1/inference/audio/transcribe", {
  method: "POST",
  headers: { "Authorization": `Bearer ${process.env.SCAIGRID_API_KEY}` },
  body: form,
});
const { data } = await resp.json();
console.log(data.text);

Parameters#

Field Type Notes
model string (required) Transcription model slug
file binary (required) Audio file upload
language string ISO-639-1 code — speeds up and improves accuracy when known
temperature float 0.0–1.0 — higher = more creative transcription (rarely useful)
prompt string Initial context to bias transcription (e.g. "The following is technical documentation about databases.")
response_format string "text" (default), "json", "verbose_json", "srt", "vtt"
timestamp_granularities array ["segment", "word"] for verbose_json

Response#

json
1
2
3
4
5
6
7
8
{
  "status": "ok",
  "data": {
    "text": "The meeting starts at 9 AM tomorrow.",
    "duration": 3.2,
    "language": "en"
  }
}

For verbose_json response_format, you also get per-segment and per-word timestamps.

Best practices#

  • Keep files under 25 MB — the typical upload limit. Split longer recordings into chunks.
  • Pass language if you know it — significantly faster and more accurate.
  • Use prompt for context — helps with domain-specific terminology, uncommon names, technical vocabulary.
  • Consider SRT/VTT output for video subtitles — ScaiGrid returns ready-to-use subtitle files.

Synthesis (text-to-speech)#

Convert text into spoken audio. Multiple voices, multiple formats.

Endpoint: POST /v1/inference/audio/synthesize

Basic request#

bash
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
curl -X POST https://scaigrid.scailabs.ai/v1/inference/audio/synthesize \
  -H "Authorization: Bearer $SCAIGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/tts-1",
    "input": "Welcome to ScaiGrid. Your request has been received.",
    "voice": "alloy",
    "response_format": "mp3",
    "speed": 1.0
  }' \
  --output welcome.mp3
python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
import httpx, os

resp = httpx.post(
    "https://scaigrid.scailabs.ai/v1/inference/audio/synthesize",
    headers={"Authorization": f"Bearer {os.environ['SCAIGRID_API_KEY']}"},
    json={
        "model": "openai/tts-1",
        "input": "Welcome to ScaiGrid. Your request has been received.",
        "voice": "alloy",
        "response_format": "mp3",
        "speed": 1.0,
    },
    timeout=60,
)
# Response is raw audio bytes, not JSON
with open("welcome.mp3", "wb") as f:
    f.write(resp.content)
typescript
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
import { writeFileSync } from "node:fs";

const resp = await fetch("https://scaigrid.scailabs.ai/v1/inference/audio/synthesize", {
  method: "POST",
  headers: {
    "Authorization": `Bearer ${process.env.SCAIGRID_API_KEY}`,
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
    model: "openai/tts-1",
    input: "Welcome to ScaiGrid. Your request has been received.",
    voice: "alloy",
    response_format: "mp3",
    speed: 1.0,
  }),
});
writeFileSync("welcome.mp3", Buffer.from(await resp.arrayBuffer()));

Parameters#

Field Type Notes
model string (required) Synthesis model slug
input string (required) Text to synthesize
voice string (required) "alloy", "echo", "fable", "onyx", "nova", "shimmer" for OpenAI TTS
response_format string "mp3" (default), "opus", "aac", "flac", "wav", "pcm"
speed float 0.25–4.0 — 1.0 is normal

Unlike other endpoints, synthesis returns raw audio bytes directly — not a JSON envelope. The Content-Type response header tells you the format.

Voice options#

For openai/tts-1 and openai/tts-1-hd:

  • alloy — neutral, balanced
  • echo — warm, male-leaning
  • fable — expressive, storytelling
  • onyx — deep, authoritative
  • nova — bright, upbeat
  • shimmer — soft, gentle

Other providers (ElevenLabs, etc.) have their own voice vocabularies. Check your tenant's model list.

Format selection#

  • mp3 — best general-purpose, small size, wide compatibility
  • opus — best for streaming/realtime, small size
  • flac / wav — lossless, larger files, for editing pipelines
  • pcm — raw 16-bit, 24kHz, little-endian — useful for ingesting into audio libraries directly

Multimodal chat with audio#

The chat completions endpoint accepts audio input for models that support it (GPT-4o audio, Gemini multimodal). See Chat Completions — Multimodal.

What's next#

Updated 2026-05-18 15:01:28 View source (.md) rev 17