Quickstart
In five minutes you'll have a transcript of a recorded file and a live captioning session running over a WebSocket.
You need:
- A ScaiGrid API key with
scaiecho:transcribe (any tenant admin has this).
- A short audio file (
.wav, .mp3, .flac, .ogg, or .m4a) — under 5 MiB to keep this synchronous.
- A terminal that can speak WebSocket if you want to follow step 4 (
websocat works, so does Python's websockets library).
| export SCAIGRID_HOST="https://scaigrid.scailabs.ai"
export SCAIGRID_API_KEY="sgk_..."
|
1. Check your tenant policy
Tenant policy decides whether your transcription runs on a self-hosted STT node (Backend A) or a managed STT relay (Backend B). Most tenants are on AB allowed with B default.
| curl "$SCAIGRID_HOST/v1/modules/scaiecho/tenant-policy" \
-H "Authorization: Bearer $SCAIGRID_API_KEY"
|
If you see a 403, you need scaiecho:admin to read policy. You can still transcribe — the policy applies automatically.
2. Transcribe a file
| curl -X POST "$SCAIGRID_HOST/v1/modules/scaiecho/transcribe" \
-H "Authorization: Bearer $SCAIGRID_API_KEY" \
-F "file=@meeting.wav" \
-F "language_hint=en" \
-F "backend_preference=any"
|
| import httpx, os
with open("meeting.wav", "rb") as f:
resp = httpx.post(
f"{os.environ['SCAIGRID_HOST']}/v1/modules/scaiecho/transcribe",
headers={"Authorization": f"Bearer {os.environ['SCAIGRID_API_KEY']}"},
files={"file": ("meeting.wav", f, "audio/wav")},
data={"language_hint": "en", "backend_preference": "any"},
timeout=120.0,
)
print(resp.json()["data"]["transcript"])
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14 | import fs from "node:fs";
const form = new FormData();
form.append("file", new Blob([fs.readFileSync("meeting.wav")]), "meeting.wav");
form.append("language_hint", "en");
form.append("backend_preference", "any");
const res = await fetch(`${process.env.SCAIGRID_HOST}/v1/modules/scaiecho/transcribe`, {
method: "POST",
headers: { "Authorization": `Bearer ${process.env.SCAIGRID_API_KEY}` },
body: form,
});
const { data } = await res.json();
console.log(data.transcript);
|
For audio under the inline threshold (default 5 MiB) you get the transcript back in the same response. Larger files return 202 Accepted with a job_id.
3. Poll an async job (if you went over the threshold)
| curl "$SCAIGRID_HOST/v1/modules/scaiecho/transcribe/jobs/$JOB_ID" \
-H "Authorization: Bearer $SCAIGRID_API_KEY"
|
Status moves through queued → running → completed. The transcript is on the response when status is completed. No S3 fetch needed — transcripts are text.
4. Stream live audio over WebSocket
Open a WebSocket to /v1/modules/scaiecho/stream/transcribe, send an open JSON control frame, push binary audio chunks, and receive delta JSON frames.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26 | import asyncio, json, os, websockets
async def main():
url = (
os.environ["SCAIGRID_HOST"].replace("https", "wss")
+ "/v1/modules/scaiecho/stream/transcribe"
+ f"?token={os.environ['SCAIGRID_API_KEY']}"
)
async with websockets.connect(url) as ws:
await ws.send(json.dumps({
"type": "open",
"language_hint": "en",
"media_type": "audio/wav",
"chunk_seconds": 5.0,
}))
print(await ws.recv()) # {"type": "ready", ...}
with open("meeting.wav", "rb") as f:
while chunk := f.read(16000):
await ws.send(chunk)
await ws.send(json.dumps({"type": "close"}))
async for msg in ws:
print(msg)
asyncio.run(main())
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17 | import WebSocket from "ws";
import fs from "node:fs";
const url = `${process.env.SCAIGRID_HOST.replace("https", "wss")}`
+ `/v1/modules/scaiecho/stream/transcribe?token=${process.env.SCAIGRID_API_KEY}`;
const ws = new WebSocket(url);
ws.on("open", () => {
ws.send(JSON.stringify({ type: "open", language_hint: "en", media_type: "audio/wav" }));
});
ws.on("message", (data) => console.log(data.toString()));
ws.on("open", () => {
const stream = fs.createReadStream("meeting.wav", { highWaterMark: 16000 });
stream.on("data", (chunk) => ws.send(chunk));
stream.on("end", () => ws.send(JSON.stringify({ type: "close" })));
});
|
| # websocat — push raw bytes after an open control frame
{
echo '{"type":"open","language_hint":"en","media_type":"audio/wav"}'
cat meeting.wav
echo '{"type":"close"}'
} | websocat -b "$SCAIGRID_HOST_WS/v1/modules/scaiecho/stream/transcribe?token=$SCAIGRID_API_KEY"
|
Server frames you'll see: ready (with the selected backend), repeated delta (with text, is_final, start, end), and finally closed.
5. (Optional) enroll a speaker for diarization
Skip this unless you need speaker-attributed transcripts. Enrollment is biometric — it requires consent capture and the scaiecho:enroll permission, separate from scaiecho:transcribe.
| curl -X POST "$SCAIGRID_HOST/v1/modules/scaiecho/speakers" \
-H "Authorization: Bearer $SCAIGRID_API_KEY" \
-F "display_name=Alice" \
-F "language_primary=en" \
-F "consent_user_full_name=Alice Example" \
-F "consent_stated_purpose=Meeting transcription diarization" \
-F "consent_text=I consent to enrollment for diarization." \
-F "reference=@alice-reference.wav" \
-F "consent=@alice-consent.wav"
|
See Enroll a speaker for diarization for the full pipeline, including how diarized streaming requests pick up enrolled profiles.
What just happened
- Step 2 ran through
TranscribeService, which consulted your tenant policy and picked Backend A or B. Short audio went sync; long audio enqueued a job on the arq worker pool.
- Step 4 opened a
StreamTranscribeService session. Audio frames went to the dispatcher's chunk-relay path; transcript records came back over the WebSocket as JSON deltas.
- Every call was metered by ScaiGrid's accounting pipeline against your tenant's budget, just like a chat completion.
Next