Platform
ScaiWave ScaiGrid ScaiCore ScaiBot ScaiDrive ScaiKey Models Tools & Services
Solutions
Organisations Developers Internet Service Providers Managed Service Providers AI-in-a-Box
Resources
Support Documentation Blog Downloads
Company
About Research Careers Investment Opportunities Contact
Log in

Batch Inference

Batch inference runs thousands of requests asynchronously at reduced cost. Submit a file of prompts, walk away, come back for the results.

Use batches when:

  • You have > 1000 requests to run.
  • Latency doesn't matter (willing to wait minutes to hours).
  • You want the lower per-token cost — typically 50% of live inference pricing.

Endpoints:

  • POST /v1/inference/batch — submit
  • GET /v1/inference/batch — list jobs
  • GET /v1/inference/batch/{batch_id} — check status
  • POST /v1/inference/batch/{batch_id}/cancel — cancel

Preparing the input file#

Batches accept a JSONL file — one JSON request per line. Each line has a custom_id (your own correlation key), the inference method (POST), the url (endpoint path), and the request body.

jsonl
{"custom_id": "req-1", "method": "POST", "url": "/v1/inference/chat", "body": {"model": "scailabs/poolnoodle-omni", "messages": [{"role": "user", "content": "Summarize: ..."}], "max_tokens": 100}}
{"custom_id": "req-2", "method": "POST", "url": "/v1/inference/chat", "body": {"model": "scailabs/poolnoodle-omni", "messages": [{"role": "user", "content": "Summarize: ..."}], "max_tokens": 100}}
{"custom_id": "req-3", "method": "POST", "url": "/v1/inference/embed", "body": {"model": "openai/text-embedding-3-small", "input": ["First text", "Second text"]}}

You can mix endpoint types in a single batch — chat, embeddings, and other inference calls can share the same job.

Submitting a batch#

bash
1
2
3
4
5
6
7
8
curl -X POST https://scaigrid.scailabs.ai/v1/inference/batch \
  -H "Authorization: Bearer $SCAIGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "input_file_url": "s3://my-bucket/batch-input.jsonl",
    "endpoint_completion_window": "24h",
    "metadata": {"project": "summarization-2026-q2"}
  }'
python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
resp = httpx.post(
    "https://scaigrid.scailabs.ai/v1/inference/batch",
    headers={"Authorization": f"Bearer {API_KEY}"},
    json={
        "input_file_url": "s3://my-bucket/batch-input.jsonl",
        "endpoint_completion_window": "24h",
        "metadata": {"project": "summarization-2026-q2"},
    },
)
batch = resp.json()["data"]
print(batch["id"])  # batch_abc123

Input file options#

  • input_file_url — a pre-signed S3/GCS URL, or scaidrive:// URL, or scaigrid://file/{file_id} for files already stored in ScaiGrid.
  • input_file_base64 — for small batches (< 10 MB), you can inline the JSONL as base64.

Completion window#

endpoint_completion_window tells ScaiGrid how long you'll wait. Options:

  • 24h (default) — discounted rate, up to 24-hour processing
  • 4h — mid-tier discount, up to 4 hours
  • 1h — small discount, up to 1 hour

Jobs that can't complete in the window fail partially — you get whatever finished plus error records for the rest.

Monitoring progress#

bash
1
2
curl https://scaigrid.scailabs.ai/v1/inference/batch/{batch_id} \
  -H "Authorization: Bearer $SCAIGRID_API_KEY"

Response:

json
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
{
  "status": "ok",
  "data": {
    "id": "batch_abc123",
    "status": "in_progress",
    "request_count": 10000,
    "completed_count": 6421,
    "failed_count": 12,
    "created_at": "2026-04-22T09:00:00Z",
    "started_at": "2026-04-22T09:00:14Z",
    "completed_at": null,
    "webhook_url": null,
    "metadata": {"project": "summarization-2026-q2"}
  }
}

Status values:

  • pending — queued, not started yet
  • processing — running
  • completed — all requests finished (check failed_count for partial failures)
  • failed — fatal error (usually input file parsing)
  • cancelled — explicitly cancelled

Getting results#

When status becomes completed, fetch results:

bash
1
2
curl https://scaigrid.scailabs.ai/v1/inference/batch/{batch_id} \
  -H "Authorization: Bearer $SCAIGRID_API_KEY"

The response now includes results_url:

json
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
{
  "status": "ok",
  "data": {
    "id": "batch_abc123",
    "status": "completed",
    "results_url": "https://scaigrid.scailabs.ai/v1/media/tok_results_xyz",
    "error_file_url": "https://scaigrid.scailabs.ai/v1/media/tok_errors_xyz",
    ...
  }
}

Download the results file — JSONL, one result per line, keyed by custom_id:

jsonl
{"custom_id": "req-1", "status": "ok", "response": {"choices": [...], "usage": {...}}}
{"custom_id": "req-2", "status": "ok", "response": {"choices": [...], "usage": {...}}}
{"custom_id": "req-3", "status": "error", "error": {"code": "...", "message": "..."}}
python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
import httpx

batch = httpx.get(
    f"https://scaigrid.scailabs.ai/v1/inference/batch/{batch_id}",
    headers={"Authorization": f"Bearer {API_KEY}"},
).json()["data"]

if batch["status"] != "completed":
    raise RuntimeError(f"Batch not done: {batch['status']}")

results = httpx.get(batch["results_url"]).text
for line in results.splitlines():
    item = json.loads(line)
    if item["status"] == "ok":
        print(item["custom_id"], item["response"]["choices"][0]["message"]["content"])
    else:
        print(item["custom_id"], "FAILED:", item["error"]["code"])

Webhook notifications#

Rather than polling, subscribe to batch events:

bash
1
2
3
4
5
6
7
curl -X POST https://scaigrid.scailabs.ai/v1/webhooks \
  -H "Authorization: Bearer $SCAIGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://your-service.example/batch-webhook",
    "events": ["batch.completed", "batch.failed"]
  }'

ScaiGrid POSTs to your URL when a batch finishes. See Webhooks.

Cancelling#

bash
1
2
curl -X POST https://scaigrid.scailabs.ai/v1/inference/batch/{batch_id}/cancel \
  -H "Authorization: Bearer $SCAIGRID_API_KEY"

Stops further processing. Already-completed requests stay completed — you can still fetch their results.

Pricing#

Batch pricing is a per-model discount off live pricing, applied at result time. The discount is visible on each model's metadata:

bash
1
2
curl https://scaigrid.scailabs.ai/v1/models/scailabs/poolnoodle-omni \
  -H "Authorization: Bearer $SCAIGRID_API_KEY"

Limits#

  • Max requests per batch — 50,000 by default, raisable for high-volume tenants.
  • Max input file size — 500 MB.
  • Concurrent batches per tenant — 10 by default.

Hitting these returns QUOTA_EXCEEDED.

What's next#

Updated 2026-05-18 15:01:29 View source (.md) rev 17