API reference

All endpoints are mounted at /v1/modules/scaimind/ and authenticate with the standard ScaiGrid bearer token. Responses use ScaiGrid's standard envelope ({ "data": ... } for success). Bodies and responses are translations of MindCoordinator's protobuf messages — fields not listed here may appear in responses as the coordinator evolves.

Jobs#

`GET /jobs`#

List jobs in the caller's tenant. Side effect: best-effort sync of returned jobs into the local cache (mod_scaimind_jobs).

bash
curl "$SCAIGRID_HOST/v1/modules/scaimind/jobs" \
  -H "Authorization: Bearer $SCAIGRID_API_KEY"

Query parameters:

Parameter	Notes
`status`	Repeatable. One of `PENDING`, `QUEUED`, `SCHEDULING`, `PREPARING`, `TRAINING`, `CHECKPOINTING`, `PAUSED`, `EVALUATING`, `EXPORTING`, `COMPLETED`, `FAILED`, `CANCELLED`, `PREEMPTED`.
`page_size`	1-100, default 20.
`page_token`	Opaque, from previous response's `next_page_token`.
`order_by`	Coordinator-defined ordering string.

Response:

json
{
  "data": {
    "jobs": [ { "job_id": "...", "name": "...", "status": "TRAINING", "...": "..." } ],
    "next_page_token": "...",
    "total_count": 12
  }
}

`POST /jobs`#

Submit a new training job. Requires scaimind:manage. Status code 201.

Body fields (see Training jobs for the full schema):

Field	Required	Notes
`name`	yes	Human-readable label.
`training_config`	yes	See `TrainingConfigSchema`.
`data_config`	yes	See `DataConfigSchema`.
`resource_config`	yes	See `ResourceConfigSchema`.
`output_config`	no	See `OutputConfigSchema`.
`scheduling_config`	no	See `SchedulingConfigSchema`.
`labels`	no	Free-form `dict[str,str]`.

Token exchange: this endpoint mints downstream ScaiDrive and ScaiAtlas tokens and forwards them as gRPC metadata so the coordinator can fetch datasets and model artefacts.

Response: SubmitJobResponse — { "job_id", "status", "message", "estimated_wait_seconds" }.

`GET /jobs/{job_id}`#

Fetch one job's full state from the coordinator.

`POST /jobs/{job_id}/cancel`#

Cancel a non-terminal job. Body (optional): { "reason": "..." }.

`POST /jobs/{job_id}/pause`#

Pause a running job. Body (optional): { "save_checkpoint": true }.

`POST /jobs/{job_id}/resume`#

Resume a paused job. Body (optional): { "checkpoint_id": "" }. Empty value resumes from the most recent checkpoint.

`POST /jobs/{job_id}/retry`#

Retry a failed or preempted job. Body (optional):

json
{
  "checkpoint_id": "",
  "modify_resources": false,
  "new_resource_config": null
}

When modify_resources: true, new_resource_config must be supplied and follows ResourceConfigSchema. Creates a child job whose parent_job_id points at the original.

Monitoring#

`GET /jobs/{job_id}/metrics`#

Point-in-time read of training metrics.

Query parameters: max_points (integer; default 0 = coordinator default).

`GET /jobs/{job_id}/logs`#

Server-Sent Event stream of log entries.

Query parameters:

Parameter	Notes
`follow`	Boolean. Default `true`. If `false`, returns the tail and closes.
`level`	`DEBUG`, `INFO`, `WARNING`. Default `INFO`.
`tail`	Initial lines to send. Default 100.

Events:

Event	Payload
`log`	`{ "level": "...", "message": "...", "timestamp": "...", ... }` (coordinator-defined).
`error`	`{ "error": "stream ended" }` if the upstream stream terminates with an error.

`GET /jobs/{job_id}/metrics/stream`#

Server-Sent Event stream of metric snapshots.

Query parameters: interval (seconds, 1-60, default 5).

Events:

Event	Payload
`metrics`	Proto-derived metrics snapshot for the most recent step.
`error`	`{ "error": "stream ended" }` on failure.

Cluster#

`GET /cluster`#

Cluster-wide status.

json
{
  "data": {
    "total_nodes": 8,
    "online_nodes": 7,
    "draining_nodes": 1,
    "offline_nodes": 0,
    "total_gpus": 64,
    "allocated_gpus": 32,
    "available_gpus": 32,
    "active_jobs": 4,
    "queued_jobs": 2,
    "cluster_utilization": 0.5
  }
}

`GET /cluster/nodes`#

List nodes.

Query parameters: status_filter (coordinator-defined string), page_size (1-100, default 20), page_token.

json
{
  "data": {
    "nodes": [ { "node_id": "...", "status": "ONLINE", "...": "..." } ],
    "next_page_token": "...",
    "total_count": 8
  }
}

`GET /cluster/nodes/{node_id}`#

Fetch one node's full status.

`POST /cluster/nodes/{node_id}/drain`#

Stop scheduling new jobs onto a node. Requires scaimind:cluster_admin.

Body (optional):

json
{ "force": false, "reason": "scheduled maintenance" }

force: true aborts in-flight jobs on that node; false lets them finish.

`POST /cluster/nodes/{node_id}/enable`#

Re-enable a drained or disabled node. Requires scaimind:cluster_admin.

Queue#

`GET /queue`#

Queue depth and estimated waits.

Query parameters: queue_name (string, default empty for all queues the tenant can see).

json
{
  "data": {
    "queued_jobs": [
      {
        "job_id": "...", "tenant_id": "...", "name": "...",
        "priority": 6, "gpu_count": 4, "gpu_type": "A100-80GB",
        "queued_at": "...", "position": 2, "estimated_wait_seconds": 1200
      }
    ],
    "total_queued": 2,
    "estimated_wait_seconds": 1200
  }
}

Evaluations#

`GET /evaluations`#

List evaluation runs. Server-side this calls ListJobs with label_filter={"type":"evaluation"}, so the response shape matches /jobs.

Query parameters: page_size, page_token.

`POST /evaluations`#

Submit an evaluation. Status code 201. Body:

Field	Notes
`job_id`	The training job whose model is being evaluated.
`model_uri`	Where the model artefact lives.
`checkpoint_id`	Optional; evaluate a specific checkpoint instead of the final.
`benchmarks`	List of `{ "name", "dataset", "num_samples", "parameters" }`.

`GET /evaluations/{evaluation_id}`#

Fetch one evaluation's full record, including per-benchmark results once COMPLETED.

Data#

`POST /data/validate`#

Check that the coordinator can reach and parse one or more data sources before queueing a job. Forwards ScaiDrive and ScaiAtlas tokens.

Body:

json
{
  "data_sources": [
    { "path": "scaidrive://.../foo.jsonl", "format": "jsonl" }
  ]
}

Response:

json
{
  "data": {
    "valid": true,
    "validations": [
      {
        "path": "...", "accessible": true,
        "size_bytes": 12345, "record_count": 2000,
        "format": "jsonl", "error_message": ""
      }
    ]
  }
}

`GET /data/cache`#

Inspect the coordinator-side dataset cache. Forwards ScaiDrive and ScaiAtlas tokens.

Query parameters: node_id (optional; empty for cluster-wide).

json
{
  "data": {
    "total_cache_size_bytes": 0,
    "used_cache_size_bytes": 0,
    "datasets": [
      {
        "path": "...", "size_bytes": 0,
        "cached_at": "...", "last_accessed": "...",
        "reference_count": 0
      }
    ]
  }
}

Errors#

All endpoints return ScaiGrid's standard error envelope. Underlying gRPC status codes map onto HTTP as follows:

gRPC code	HTTP	Detail handling
`NOT_FOUND`	404	Coordinator detail passed through.
`INVALID_ARGUMENT`	400	Coordinator detail passed through.
`PERMISSION_DENIED`	403	Coordinator detail passed through.
`UNAUTHENTICATED`	401	Coordinator detail passed through.
`ALREADY_EXISTS`	409	Coordinator detail passed through.
`FAILED_PRECONDITION`	409	Coordinator detail passed through.
`RESOURCE_EXHAUSTED`	429	Coordinator detail passed through.
`DEADLINE_EXCEEDED`	504	Friendly message; original logged server-side.
`UNAVAILABLE`	503	Friendly message; original logged server-side.
`INTERNAL`	500	Sanitised; raw detail captured in `scaimind_coordinator_error` log.
`UNKNOWN`	500	Sanitised; raw detail captured in log.

The sanitisation logic in errors.py looks for traceback markers in the coordinator's details() string and replaces the body with a generic message when one is detected — so callers never receive leaked SQL, Python stack traces, or other infrastructure noise.

API reference

Jobs#

GET /jobs#

POST /jobs#

GET /jobs/{job_id}#

POST /jobs/{job_id}/cancel#

POST /jobs/{job_id}/pause#

POST /jobs/{job_id}/resume#

POST /jobs/{job_id}/retry#

Monitoring#

GET /jobs/{job_id}/metrics#

GET /jobs/{job_id}/logs#

GET /jobs/{job_id}/metrics/stream#