---
title: Inference Reference
path: reference/inference
status: published
---

# Inference Reference

All inference endpoints. For task-oriented guides, see [API Guides](../04-api-guides/).

**Base path:** `/v1/inference/`
**Required permission:** `models:use`

## POST /v1/inference/chat

Chat completion. See [Chat Completions](../04-api-guides/01-chat-completions.md) for the full walk-through.

**Request:**

| Field | Type | Required |
|-------|------|---------|
| `model` | string | Yes |
| `messages` | array | Yes |
| `max_tokens` | integer | No |
| `temperature` | float | No (default provider-specific, usually 1.0) |
| `top_p` | float | No |
| `stop` | string or array | No |
| `seed` | integer | No |
| `stream` | boolean | No (default false) |
| `tools` | array | No |
| `tool_choice` | string or object | No |
| `metadata` | object | No |

**Response (non-streaming):** see [Chat Completions](../04-api-guides/01-chat-completions.md#response-shape).

**Response (streaming):** SSE stream. `data: {...}` chunks with `choices[0].delta.content`; ends with `data: [DONE]`. Errors arrive as `event: error\ndata: {...}`.

## POST /v1/inference/generate

Text generation (completion). Simpler than chat — no message roles, just a prompt.

```json
{
  "prompt": "Once upon a time",
  "model": "scailabs/poolnoodle-omni",
  "max_tokens": 200,
  "temperature": 0.8,
  "stop": ["THE END"],
  "seed": 42
}
```

Returns a text completion. Most modern models are chat-trained; use `/v1/inference/chat` unless you specifically need raw text generation.

## POST /v1/inference/embed

Generate embeddings. See [Embeddings](../04-api-guides/02-embeddings.md).

```json
{
  "model": "openai/text-embedding-3-small",
  "input": ["first text", "second text"],
  "dimensions": 1536
}
```

Returns a list of vectors.

## POST /v1/inference/images/generate

Generate images. See [Images](../04-api-guides/03-images.md).

```json
{
  "model": "openai/dall-e-3",
  "prompt": "A landscape painting",
  "n": 1,
  "size": "1024x1024",
  "quality": "standard",
  "style": "vivid",
  "response_format": "url"
}
```

## POST /v1/inference/audio/transcribe

Speech-to-text. See [Audio](../04-api-guides/04-audio.md#transcription).

**Content-Type:** `multipart/form-data`

Form fields:

| Field | Type | Notes |
|-------|------|-------|
| `file` | binary | Audio file |
| `model` | string | Required |
| `language` | string | ISO-639-1 |
| `temperature` | float | 0.0–1.0 |
| `prompt` | string | Context |
| `response_format` | string | `text` / `json` / `verbose_json` / `srt` / `vtt` |
| `timestamp_granularities` | array | `segment`, `word` |

## POST /v1/inference/audio/synthesize

Text-to-speech. See [Audio](../04-api-guides/04-audio.md#synthesis-text-to-speech).

```json
{
  "model": "openai/tts-1",
  "input": "Hello world",
  "voice": "alloy",
  "response_format": "mp3",
  "speed": 1.0
}
```

**Response:** raw audio bytes (not JSON envelope). `Content-Type` reflects the format.

## Batch inference

See [Batch Inference](../04-api-guides/06-batch-inference.md) for the complete workflow.

### POST /v1/inference/batch

Submit a batch.

```json
{
  "input_file_url": "s3://...",
  "endpoint_completion_window": "24h",
  "metadata": {...}
}
```

### GET /v1/inference/batch

List batches. Query params: `status`, `limit`, `cursor`.

### GET /v1/inference/batch/{batch_id}

Get batch status and result URLs.

### POST /v1/inference/batch/{batch_id}/cancel

Cancel a batch. Completed requests are retained.

## Response envelope

All successful `/v1/inference/*` responses (except audio synthesis, which returns raw bytes) follow the standard envelope:

```json
{
  "status": "ok",
  "data": {...},
  "meta": {"request_id": "req_..."}
}
```

Errors use the same envelope with `status: "error"` and an `error` object. See [Errors](../03-core-concepts/07-errors.md).

## Headers

**Request:**

- `Authorization: Bearer <token>` — required
- `X-Request-ID: <id>` — optional, propagates through tracing

**Response:**

- `X-Scaigrid-Request-Id: <id>` — always present. Include in support requests.
- `X-Scaigrid-Model: <slug>` — the frontend model that served the request
- `X-Scaigrid-Backend: <id>` — the backend that was actually called
- `Retry-After: <seconds>` — present on 429 responses

## Related

- [Chat Completions](../04-api-guides/01-chat-completions.md)
- [Embeddings](../04-api-guides/02-embeddings.md)
- [Images](../04-api-guides/03-images.md)
- [Audio](../04-api-guides/04-audio.md)
- [Batch Inference](../04-api-guides/06-batch-inference.md)
- [Models and Routing](./04-models-and-routing.md)