---
title: Batch Inference
path: api-guides/batch-inference
status: published
---

# Batch Inference

Batch inference runs thousands of requests asynchronously at reduced cost. Submit a file of prompts, walk away, come back for the results.

Use batches when:

- You have > 1000 requests to run.
- Latency doesn't matter (willing to wait minutes to hours).
- You want the lower per-token cost — typically 50% of live inference pricing.

**Endpoints:**
- `POST /v1/inference/batch` — submit
- `GET /v1/inference/batch` — list jobs
- `GET /v1/inference/batch/{batch_id}` — check status
- `POST /v1/inference/batch/{batch_id}/cancel` — cancel

## Preparing the input file

Batches accept a JSONL file — one JSON request per line. Each line has a `custom_id` (your own correlation key), the inference `method` (`POST`), the `url` (endpoint path), and the request `body`.

```jsonl
{"custom_id": "req-1", "method": "POST", "url": "/v1/inference/chat", "body": {"model": "scailabs/poolnoodle-omni", "messages": [{"role": "user", "content": "Summarize: ..."}], "max_tokens": 100}}
{"custom_id": "req-2", "method": "POST", "url": "/v1/inference/chat", "body": {"model": "scailabs/poolnoodle-omni", "messages": [{"role": "user", "content": "Summarize: ..."}], "max_tokens": 100}}
{"custom_id": "req-3", "method": "POST", "url": "/v1/inference/embed", "body": {"model": "openai/text-embedding-3-small", "input": ["First text", "Second text"]}}
```

You can mix endpoint types in a single batch — chat, embeddings, and other inference calls can share the same job.

## Submitting a batch

```bash
curl -X POST https://scaigrid.scailabs.ai/v1/inference/batch \
  -H "Authorization: Bearer $SCAIGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "input_file_url": "s3://my-bucket/batch-input.jsonl",
    "endpoint_completion_window": "24h",
    "metadata": {"project": "summarization-2026-q2"}
  }'
```

```python
resp = httpx.post(
    "https://scaigrid.scailabs.ai/v1/inference/batch",
    headers={"Authorization": f"Bearer {API_KEY}"},
    json={
        "input_file_url": "s3://my-bucket/batch-input.jsonl",
        "endpoint_completion_window": "24h",
        "metadata": {"project": "summarization-2026-q2"},
    },
)
batch = resp.json()["data"]
print(batch["id"])  # batch_abc123
```

### Input file options

- **`input_file_url`** — a pre-signed S3/GCS URL, or `scaidrive://` URL, or `scaigrid://file/{file_id}` for files already stored in ScaiGrid.
- **`input_file_base64`** — for small batches (< 10 MB), you can inline the JSONL as base64.

### Completion window

`endpoint_completion_window` tells ScaiGrid how long you'll wait. Options:

- `24h` (default) — discounted rate, up to 24-hour processing
- `4h` — mid-tier discount, up to 4 hours
- `1h` — small discount, up to 1 hour

Jobs that can't complete in the window fail partially — you get whatever finished plus error records for the rest.

## Monitoring progress

```bash
curl https://scaigrid.scailabs.ai/v1/inference/batch/{batch_id} \
  -H "Authorization: Bearer $SCAIGRID_API_KEY"
```

Response:

```json
{
  "status": "ok",
  "data": {
    "id": "batch_abc123",
    "status": "in_progress",
    "request_count": 10000,
    "completed_count": 6421,
    "failed_count": 12,
    "created_at": "2026-04-22T09:00:00Z",
    "started_at": "2026-04-22T09:00:14Z",
    "completed_at": null,
    "webhook_url": null,
    "metadata": {"project": "summarization-2026-q2"}
  }
}
```

Status values:

- `pending` — queued, not started yet
- `processing` — running
- `completed` — all requests finished (check `failed_count` for partial failures)
- `failed` — fatal error (usually input file parsing)
- `cancelled` — explicitly cancelled

## Getting results

When `status` becomes `completed`, fetch results:

```bash
curl https://scaigrid.scailabs.ai/v1/inference/batch/{batch_id} \
  -H "Authorization: Bearer $SCAIGRID_API_KEY"
```

The response now includes `results_url`:

```json
{
  "status": "ok",
  "data": {
    "id": "batch_abc123",
    "status": "completed",
    "results_url": "https://scaigrid.scailabs.ai/v1/media/tok_results_xyz",
    "error_file_url": "https://scaigrid.scailabs.ai/v1/media/tok_errors_xyz",
    ...
  }
}
```

Download the results file — JSONL, one result per line, keyed by `custom_id`:

```jsonl
{"custom_id": "req-1", "status": "ok", "response": {"choices": [...], "usage": {...}}}
{"custom_id": "req-2", "status": "ok", "response": {"choices": [...], "usage": {...}}}
{"custom_id": "req-3", "status": "error", "error": {"code": "...", "message": "..."}}
```

```python
import httpx

batch = httpx.get(
    f"https://scaigrid.scailabs.ai/v1/inference/batch/{batch_id}",
    headers={"Authorization": f"Bearer {API_KEY}"},
).json()["data"]

if batch["status"] != "completed":
    raise RuntimeError(f"Batch not done: {batch['status']}")

results = httpx.get(batch["results_url"]).text
for line in results.splitlines():
    item = json.loads(line)
    if item["status"] == "ok":
        print(item["custom_id"], item["response"]["choices"][0]["message"]["content"])
    else:
        print(item["custom_id"], "FAILED:", item["error"]["code"])
```

## Webhook notifications

Rather than polling, subscribe to batch events:

```bash
curl -X POST https://scaigrid.scailabs.ai/v1/webhooks \
  -H "Authorization: Bearer $SCAIGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://your-service.example/batch-webhook",
    "events": ["batch.completed", "batch.failed"]
  }'
```

ScaiGrid POSTs to your URL when a batch finishes. See [Webhooks](../06-reference/08-webhooks.md).

## Cancelling

```bash
curl -X POST https://scaigrid.scailabs.ai/v1/inference/batch/{batch_id}/cancel \
  -H "Authorization: Bearer $SCAIGRID_API_KEY"
```

Stops further processing. Already-completed requests stay completed — you can still fetch their results.

## Pricing

Batch pricing is a per-model discount off live pricing, applied at result time. The discount is visible on each model's metadata:

```bash
curl https://scaigrid.scailabs.ai/v1/models/scailabs/poolnoodle-omni \
  -H "Authorization: Bearer $SCAIGRID_API_KEY"
```

## Limits

- **Max requests per batch** — 50,000 by default, raisable for high-volume tenants.
- **Max input file size** — 500 MB.
- **Concurrent batches per tenant** — 10 by default.

Hitting these returns `QUOTA_EXCEEDED`.

## What's next

- [Webhooks](../06-reference/08-webhooks.md) — be notified when batches complete.
- [Embeddings](./02-embeddings.md) — prime use case for batches (thousands of documents).
- [OpenAI Compatibility](./07-openai-compatibility.md) — `/oai/v1/batches` works similarly.