Batch Inference

Batch inference runs thousands of requests asynchronously at reduced cost. Submit a file of prompts, walk away, come back for the results.

Use batches when:

You have > 1000 requests to run.
Latency doesn't matter (willing to wait minutes to hours).
You want the lower per-token cost — typically 50% of live inference pricing.

Endpoints:

POST /v1/inference/batch — submit
GET /v1/inference/batch — list jobs
GET /v1/inference/batch/{batch_id} — check status
POST /v1/inference/batch/{batch_id}/cancel — cancel

Preparing the input file#

Batches accept a JSONL file — one JSON request per line. Each line has a custom_id (your own correlation key), the inference method (POST), the url (endpoint path), and the request body.

jsonl

{"custom_id": "req-1", "method": "POST", "url": "/v1/inference/chat", "body": {"model": "scailabs/poolnoodle-omni", "messages": [{"role": "user", "content": "Summarize: ..."}], "max_tokens": 100}}
{"custom_id": "req-2", "method": "POST", "url": "/v1/inference/chat", "body": {"model": "scailabs/poolnoodle-omni", "messages": [{"role": "user", "content": "Summarize: ..."}], "max_tokens": 100}}
{"custom_id": "req-3", "method": "POST", "url": "/v1/inference/embed", "body": {"model": "openai/text-embedding-3-small", "input": ["First text", "Second text"]}}

You can mix endpoint types in a single batch — chat, embeddings, and other inference calls can share the same job.

Submitting a batch#

bash
curl -X POST https://scaigrid.scailabs.ai/v1/inference/batch \
  -H "Authorization: Bearer $SCAIGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "input_file_url": "s3://my-bucket/batch-input.jsonl",
    "endpoint_completion_window": "24h",
    "metadata": {"project": "summarization-2026-q2"}
  }'

python
resp = httpx.post(
    "https://scaigrid.scailabs.ai/v1/inference/batch",
    headers={"Authorization": f"Bearer {API_KEY}"},
    json={
        "input_file_url": "s3://my-bucket/batch-input.jsonl",
        "endpoint_completion_window": "24h",
        "metadata": {"project": "summarization-2026-q2"},
    },
)
batch = resp.json()["data"]
print(batch["id"])  # batch_abc123

Input file options#

input_file_url — a pre-signed S3/GCS URL, or scaidrive:// URL, or scaigrid://file/{file_id} for files already stored in ScaiGrid.
input_file_base64 — for small batches (< 10 MB), you can inline the JSONL as base64.

Completion window#

endpoint_completion_window tells ScaiGrid how long you'll wait. Options:

24h (default) — discounted rate, up to 24-hour processing
4h — mid-tier discount, up to 4 hours
1h — small discount, up to 1 hour

Jobs that can't complete in the window fail partially — you get whatever finished plus error records for the rest.

Monitoring progress#

bash
curl https://scaigrid.scailabs.ai/v1/inference/batch/{batch_id} \
  -H "Authorization: Bearer $SCAIGRID_API_KEY"

Response:

json
{
  "status": "ok",
  "data": {
    "id": "batch_abc123",
    "status": "in_progress",
    "request_count": 10000,
    "completed_count": 6421,
    "failed_count": 12,
    "created_at": "2026-04-22T09:00:00Z",
    "started_at": "2026-04-22T09:00:14Z",
    "completed_at": null,
    "webhook_url": null,
    "metadata": {"project": "summarization-2026-q2"}
  }
}

Status values:

pending — queued, not started yet
processing — running
completed — all requests finished (check failed_count for partial failures)
failed — fatal error (usually input file parsing)
cancelled — explicitly cancelled

Getting results#

When status becomes completed, fetch results:

bash
curl https://scaigrid.scailabs.ai/v1/inference/batch/{batch_id} \
  -H "Authorization: Bearer $SCAIGRID_API_KEY"

The response now includes results_url:

json
{
  "status": "ok",
  "data": {
    "id": "batch_abc123",
    "status": "completed",
    "results_url": "https://scaigrid.scailabs.ai/v1/media/tok_results_xyz",
    "error_file_url": "https://scaigrid.scailabs.ai/v1/media/tok_errors_xyz",
    ...
  }
}

Download the results file — JSONL, one result per line, keyed by custom_id:

jsonl

{"custom_id": "req-1", "status": "ok", "response": {"choices": [...], "usage": {...}}}
{"custom_id": "req-2", "status": "ok", "response": {"choices": [...], "usage": {...}}}
{"custom_id": "req-3", "status": "error", "error": {"code": "...", "message": "..."}}

python
import httpx

batch = httpx.get(
    f"https://scaigrid.scailabs.ai/v1/inference/batch/{batch_id}",
    headers={"Authorization": f"Bearer {API_KEY}"},
).json()["data"]

if batch["status"] != "completed":
    raise RuntimeError(f"Batch not done: {batch['status']}")

results = httpx.get(batch["results_url"]).text
for line in results.splitlines():
    item = json.loads(line)
    if item["status"] == "ok":
        print(item["custom_id"], item["response"]["choices"][0]["message"]["content"])
    else:
        print(item["custom_id"], "FAILED:", item["error"]["code"])

Webhook notifications#

Rather than polling, subscribe to batch events:

bash
curl -X POST https://scaigrid.scailabs.ai/v1/webhooks \
  -H "Authorization: Bearer $SCAIGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://your-service.example/batch-webhook",
    "events": ["batch.completed", "batch.failed"]
  }'

ScaiGrid POSTs to your URL when a batch finishes. See Webhooks.

Cancelling#

bash
curl -X POST https://scaigrid.scailabs.ai/v1/inference/batch/{batch_id}/cancel \
  -H "Authorization: Bearer $SCAIGRID_API_KEY"

Stops further processing. Already-completed requests stay completed — you can still fetch their results.

Pricing#

Batch pricing is a per-model discount off live pricing, applied at result time. The discount is visible on each model's metadata:

bash
curl https://scaigrid.scailabs.ai/v1/models/scailabs/poolnoodle-omni \
  -H "Authorization: Bearer $SCAIGRID_API_KEY"

Limits#

Max requests per batch — 50,000 by default, raisable for high-volume tenants.
Max input file size — 500 MB.
Concurrent batches per tenant — 10 by default.

Hitting these returns QUOTA_EXCEEDED.

What's next#

Webhooks — be notified when batches complete.
Embeddings — prime use case for batches (thousands of documents).
OpenAI Compatibility — /oai/v1/batches works similarly.