Batch Inference
Batch inference runs thousands of requests asynchronously at reduced cost. Submit a file of prompts, walk away, come back for the results.
Use batches when:
- You have > 1000 requests to run.
- Latency doesn't matter (willing to wait minutes to hours).
- You want the lower per-token cost — typically 50% of live inference pricing.
Endpoints:
POST /v1/inference/batch— submitGET /v1/inference/batch— list jobsGET /v1/inference/batch/{batch_id}— check statusPOST /v1/inference/batch/{batch_id}/cancel— cancel
Preparing the input file#
Batches accept a JSONL file — one JSON request per line. Each line has a custom_id (your own correlation key), the inference method (POST), the url (endpoint path), and the request body.
{"custom_id": "req-1", "method": "POST", "url": "/v1/inference/chat", "body": {"model": "scailabs/poolnoodle-omni", "messages": [{"role": "user", "content": "Summarize: ..."}], "max_tokens": 100}}
{"custom_id": "req-2", "method": "POST", "url": "/v1/inference/chat", "body": {"model": "scailabs/poolnoodle-omni", "messages": [{"role": "user", "content": "Summarize: ..."}], "max_tokens": 100}}
{"custom_id": "req-3", "method": "POST", "url": "/v1/inference/embed", "body": {"model": "openai/text-embedding-3-small", "input": ["First text", "Second text"]}}
You can mix endpoint types in a single batch — chat, embeddings, and other inference calls can share the same job.
Submitting a batch#
1 2 3 4 5 6 7 8 | |
1 2 3 4 5 6 7 8 9 10 11 | |
Input file options#
input_file_url— a pre-signed S3/GCS URL, orscaidrive://URL, orscaigrid://file/{file_id}for files already stored in ScaiGrid.input_file_base64— for small batches (< 10 MB), you can inline the JSONL as base64.
Completion window#
endpoint_completion_window tells ScaiGrid how long you'll wait. Options:
24h(default) — discounted rate, up to 24-hour processing4h— mid-tier discount, up to 4 hours1h— small discount, up to 1 hour
Jobs that can't complete in the window fail partially — you get whatever finished plus error records for the rest.
Monitoring progress#
1 2 | |
Response:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | |
Status values:
pending— queued, not started yetprocessing— runningcompleted— all requests finished (checkfailed_countfor partial failures)failed— fatal error (usually input file parsing)cancelled— explicitly cancelled
Getting results#
When status becomes completed, fetch results:
1 2 | |
The response now includes results_url:
1 2 3 4 5 6 7 8 9 10 | |
Download the results file — JSONL, one result per line, keyed by custom_id:
{"custom_id": "req-1", "status": "ok", "response": {"choices": [...], "usage": {...}}}
{"custom_id": "req-2", "status": "ok", "response": {"choices": [...], "usage": {...}}}
{"custom_id": "req-3", "status": "error", "error": {"code": "...", "message": "..."}}
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | |
Webhook notifications#
Rather than polling, subscribe to batch events:
1 2 3 4 5 6 7 | |
ScaiGrid POSTs to your URL when a batch finishes. See Webhooks.
Cancelling#
1 2 | |
Stops further processing. Already-completed requests stay completed — you can still fetch their results.
Pricing#
Batch pricing is a per-model discount off live pricing, applied at result time. The discount is visible on each model's metadata:
1 2 | |
Limits#
- Max requests per batch — 50,000 by default, raisable for high-volume tenants.
- Max input file size — 500 MB.
- Concurrent batches per tenant — 10 by default.
Hitting these returns QUOTA_EXCEEDED.
What's next#
- Webhooks — be notified when batches complete.
- Embeddings — prime use case for batches (thousands of documents).
- OpenAI Compatibility —
/oai/v1/batchesworks similarly.