---
summary: "Every ScaiMind endpoint \u2014 jobs, monitoring, cluster, nodes, queue,\
  \ evaluations, data."
title: API reference
path: reference/api
status: published
---

All endpoints are mounted at `/v1/modules/scaimind/` and authenticate with the standard ScaiGrid bearer token. Responses use ScaiGrid's standard envelope (`{ "data": ... }` for success). Bodies and responses are translations of MindCoordinator's protobuf messages — fields not listed here may appear in responses as the coordinator evolves.

## Jobs

### `GET /jobs`

List jobs in the caller's tenant. Side effect: best-effort sync of returned jobs into the local cache (`mod_scaimind_jobs`).

```bash
curl "$SCAIGRID_HOST/v1/modules/scaimind/jobs" \
  -H "Authorization: Bearer $SCAIGRID_API_KEY"
```

Query parameters:

| Parameter | Notes |
|---|---|
| `status` | Repeatable. One of `PENDING`, `QUEUED`, `SCHEDULING`, `PREPARING`, `TRAINING`, `CHECKPOINTING`, `PAUSED`, `EVALUATING`, `EXPORTING`, `COMPLETED`, `FAILED`, `CANCELLED`, `PREEMPTED`. |
| `page_size` | 1-100, default 20. |
| `page_token` | Opaque, from previous response's `next_page_token`. |
| `order_by` | Coordinator-defined ordering string. |

Response:

```json
{
  "data": {
    "jobs": [ { "job_id": "...", "name": "...", "status": "TRAINING", "...": "..." } ],
    "next_page_token": "...",
    "total_count": 12
  }
}
```

### `POST /jobs`

Submit a new training job. Requires `scaimind:manage`. Status code `201`.

Body fields (see [Training jobs](../concepts/training-jobs) for the full schema):

| Field | Required | Notes |
|---|---|---|
| `name` | yes | Human-readable label. |
| `training_config` | yes | See `TrainingConfigSchema`. |
| `data_config` | yes | See `DataConfigSchema`. |
| `resource_config` | yes | See `ResourceConfigSchema`. |
| `output_config` | no | See `OutputConfigSchema`. |
| `scheduling_config` | no | See `SchedulingConfigSchema`. |
| `labels` | no | Free-form `dict[str,str]`. |

Token exchange: this endpoint mints downstream ScaiDrive and ScaiAtlas tokens and forwards them as gRPC metadata so the coordinator can fetch datasets and model artefacts.

Response: `SubmitJobResponse` — `{ "job_id", "status", "message", "estimated_wait_seconds" }`.

### `GET /jobs/{job_id}`

Fetch one job's full state from the coordinator.

### `POST /jobs/{job_id}/cancel`

Cancel a non-terminal job. Body (optional): `{ "reason": "..." }`.

### `POST /jobs/{job_id}/pause`

Pause a running job. Body (optional): `{ "save_checkpoint": true }`.

### `POST /jobs/{job_id}/resume`

Resume a paused job. Body (optional): `{ "checkpoint_id": "" }`. Empty value resumes from the most recent checkpoint.

### `POST /jobs/{job_id}/retry`

Retry a failed or preempted job. Body (optional):

```json
{
  "checkpoint_id": "",
  "modify_resources": false,
  "new_resource_config": null
}
```

When `modify_resources: true`, `new_resource_config` must be supplied and follows `ResourceConfigSchema`. Creates a child job whose `parent_job_id` points at the original.

## Monitoring

### `GET /jobs/{job_id}/metrics`

Point-in-time read of training metrics.

Query parameters: `max_points` (integer; default 0 = coordinator default).

### `GET /jobs/{job_id}/logs`

Server-Sent Event stream of log entries.

Query parameters:

| Parameter | Notes |
|---|---|
| `follow` | Boolean. Default `true`. If `false`, returns the tail and closes. |
| `level` | `DEBUG`, `INFO`, `WARNING`. Default `INFO`. |
| `tail` | Initial lines to send. Default 100. |

Events:

| Event | Payload |
|---|---|
| `log` | `{ "level": "...", "message": "...", "timestamp": "...", ... }` (coordinator-defined). |
| `error` | `{ "error": "stream ended" }` if the upstream stream terminates with an error. |

### `GET /jobs/{job_id}/metrics/stream`

Server-Sent Event stream of metric snapshots.

Query parameters: `interval` (seconds, 1-60, default 5).

Events:

| Event | Payload |
|---|---|
| `metrics` | Proto-derived metrics snapshot for the most recent step. |
| `error` | `{ "error": "stream ended" }` on failure. |

## Cluster

### `GET /cluster`

Cluster-wide status.

```json
{
  "data": {
    "total_nodes": 8,
    "online_nodes": 7,
    "draining_nodes": 1,
    "offline_nodes": 0,
    "total_gpus": 64,
    "allocated_gpus": 32,
    "available_gpus": 32,
    "active_jobs": 4,
    "queued_jobs": 2,
    "cluster_utilization": 0.5
  }
}
```

### `GET /cluster/nodes`

List nodes.

Query parameters: `status_filter` (coordinator-defined string), `page_size` (1-100, default 20), `page_token`.

```json
{
  "data": {
    "nodes": [ { "node_id": "...", "status": "ONLINE", "...": "..." } ],
    "next_page_token": "...",
    "total_count": 8
  }
}
```

### `GET /cluster/nodes/{node_id}`

Fetch one node's full status.

### `POST /cluster/nodes/{node_id}/drain`

Stop scheduling new jobs onto a node. Requires `scaimind:cluster_admin`.

Body (optional):

```json
{ "force": false, "reason": "scheduled maintenance" }
```

`force: true` aborts in-flight jobs on that node; `false` lets them finish.

### `POST /cluster/nodes/{node_id}/enable`

Re-enable a drained or disabled node. Requires `scaimind:cluster_admin`.

## Queue

### `GET /queue`

Queue depth and estimated waits.

Query parameters: `queue_name` (string, default empty for all queues the tenant can see).

```json
{
  "data": {
    "queued_jobs": [
      {
        "job_id": "...", "tenant_id": "...", "name": "...",
        "priority": 6, "gpu_count": 4, "gpu_type": "A100-80GB",
        "queued_at": "...", "position": 2, "estimated_wait_seconds": 1200
      }
    ],
    "total_queued": 2,
    "estimated_wait_seconds": 1200
  }
}
```

## Evaluations

### `GET /evaluations`

List evaluation runs. Server-side this calls `ListJobs` with `label_filter={"type":"evaluation"}`, so the response shape matches `/jobs`.

Query parameters: `page_size`, `page_token`.

### `POST /evaluations`

Submit an evaluation. Status code `201`. Body:

| Field | Notes |
|---|---|
| `job_id` | The training job whose model is being evaluated. |
| `model_uri` | Where the model artefact lives. |
| `checkpoint_id` | Optional; evaluate a specific checkpoint instead of the final. |
| `benchmarks` | List of `{ "name", "dataset", "num_samples", "parameters" }`. |

### `GET /evaluations/{evaluation_id}`

Fetch one evaluation's full record, including per-benchmark results once `COMPLETED`.

## Data

### `POST /data/validate`

Check that the coordinator can reach and parse one or more data sources before queueing a job. Forwards ScaiDrive and ScaiAtlas tokens.

Body:

```json
{
  "data_sources": [
    { "path": "scaidrive://.../foo.jsonl", "format": "jsonl" }
  ]
}
```

Response:

```json
{
  "data": {
    "valid": true,
    "validations": [
      {
        "path": "...", "accessible": true,
        "size_bytes": 12345, "record_count": 2000,
        "format": "jsonl", "error_message": ""
      }
    ]
  }
}
```

### `GET /data/cache`

Inspect the coordinator-side dataset cache. Forwards ScaiDrive and ScaiAtlas tokens.

Query parameters: `node_id` (optional; empty for cluster-wide).

```json
{
  "data": {
    "total_cache_size_bytes": 0,
    "used_cache_size_bytes": 0,
    "datasets": [
      {
        "path": "...", "size_bytes": 0,
        "cached_at": "...", "last_accessed": "...",
        "reference_count": 0
      }
    ]
  }
}
```

## Errors

All endpoints return ScaiGrid's standard error envelope. Underlying gRPC status codes map onto HTTP as follows:

| gRPC code | HTTP | Detail handling |
|---|---|---|
| `NOT_FOUND` | 404 | Coordinator detail passed through. |
| `INVALID_ARGUMENT` | 400 | Coordinator detail passed through. |
| `PERMISSION_DENIED` | 403 | Coordinator detail passed through. |
| `UNAUTHENTICATED` | 401 | Coordinator detail passed through. |
| `ALREADY_EXISTS` | 409 | Coordinator detail passed through. |
| `FAILED_PRECONDITION` | 409 | Coordinator detail passed through. |
| `RESOURCE_EXHAUSTED` | 429 | Coordinator detail passed through. |
| `DEADLINE_EXCEEDED` | 504 | Friendly message; original logged server-side. |
| `UNAVAILABLE` | 503 | Friendly message; original logged server-side. |
| `INTERNAL` | 500 | Sanitised; raw detail captured in `scaimind_coordinator_error` log. |
| `UNKNOWN` | 500 | Sanitised; raw detail captured in log. |

The sanitisation logic in `errors.py` looks for traceback markers in the coordinator's `details()` string and replaces the body with a generic message when one is detected — so callers never receive leaked SQL, Python stack traces, or other infrastructure noise.
