API reference
All endpoints are mounted at /v1/modules/scaimind/ and authenticate with the standard ScaiGrid bearer token. Responses use ScaiGrid's standard envelope ({ "data": ... } for success). Bodies and responses are translations of MindCoordinator's protobuf messages — fields not listed here may appear in responses as the coordinator evolves.
Jobs#
GET /jobs#
List jobs in the caller's tenant. Side effect: best-effort sync of returned jobs into the local cache (mod_scaimind_jobs).
1 2 | |
Query parameters:
| Parameter | Notes |
|---|---|
status |
Repeatable. One of PENDING, QUEUED, SCHEDULING, PREPARING, TRAINING, CHECKPOINTING, PAUSED, EVALUATING, EXPORTING, COMPLETED, FAILED, CANCELLED, PREEMPTED. |
page_size |
1-100, default 20. |
page_token |
Opaque, from previous response's next_page_token. |
order_by |
Coordinator-defined ordering string. |
Response:
1 2 3 4 5 6 7 | |
POST /jobs#
Submit a new training job. Requires scaimind:manage. Status code 201.
Body fields (see Training jobs for the full schema):
| Field | Required | Notes |
|---|---|---|
name |
yes | Human-readable label. |
training_config |
yes | See TrainingConfigSchema. |
data_config |
yes | See DataConfigSchema. |
resource_config |
yes | See ResourceConfigSchema. |
output_config |
no | See OutputConfigSchema. |
scheduling_config |
no | See SchedulingConfigSchema. |
labels |
no | Free-form dict[str,str]. |
Token exchange: this endpoint mints downstream ScaiDrive and ScaiAtlas tokens and forwards them as gRPC metadata so the coordinator can fetch datasets and model artefacts.
Response: SubmitJobResponse — { "job_id", "status", "message", "estimated_wait_seconds" }.
GET /jobs/{job_id}#
Fetch one job's full state from the coordinator.
POST /jobs/{job_id}/cancel#
Cancel a non-terminal job. Body (optional): { "reason": "..." }.
POST /jobs/{job_id}/pause#
Pause a running job. Body (optional): { "save_checkpoint": true }.
POST /jobs/{job_id}/resume#
Resume a paused job. Body (optional): { "checkpoint_id": "" }. Empty value resumes from the most recent checkpoint.
POST /jobs/{job_id}/retry#
Retry a failed or preempted job. Body (optional):
1 2 3 4 5 | |
When modify_resources: true, new_resource_config must be supplied and follows ResourceConfigSchema. Creates a child job whose parent_job_id points at the original.
Monitoring#
GET /jobs/{job_id}/metrics#
Point-in-time read of training metrics.
Query parameters: max_points (integer; default 0 = coordinator default).
GET /jobs/{job_id}/logs#
Server-Sent Event stream of log entries.
Query parameters:
| Parameter | Notes |
|---|---|
follow |
Boolean. Default true. If false, returns the tail and closes. |
level |
DEBUG, INFO, WARNING. Default INFO. |
tail |
Initial lines to send. Default 100. |
Events:
| Event | Payload |
|---|---|
log |
{ "level": "...", "message": "...", "timestamp": "...", ... } (coordinator-defined). |
error |
{ "error": "stream ended" } if the upstream stream terminates with an error. |
GET /jobs/{job_id}/metrics/stream#
Server-Sent Event stream of metric snapshots.
Query parameters: interval (seconds, 1-60, default 5).
Events:
| Event | Payload |
|---|---|
metrics |
Proto-derived metrics snapshot for the most recent step. |
error |
{ "error": "stream ended" } on failure. |
Cluster#
GET /cluster#
Cluster-wide status.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | |
GET /cluster/nodes#
List nodes.
Query parameters: status_filter (coordinator-defined string), page_size (1-100, default 20), page_token.
1 2 3 4 5 6 7 | |
GET /cluster/nodes/{node_id}#
Fetch one node's full status.
POST /cluster/nodes/{node_id}/drain#
Stop scheduling new jobs onto a node. Requires scaimind:cluster_admin.
Body (optional):
1 | |
force: true aborts in-flight jobs on that node; false lets them finish.
POST /cluster/nodes/{node_id}/enable#
Re-enable a drained or disabled node. Requires scaimind:cluster_admin.
Queue#
GET /queue#
Queue depth and estimated waits.
Query parameters: queue_name (string, default empty for all queues the tenant can see).
1 2 3 4 5 6 7 8 9 10 11 12 13 | |
Evaluations#
GET /evaluations#
List evaluation runs. Server-side this calls ListJobs with label_filter={"type":"evaluation"}, so the response shape matches /jobs.
Query parameters: page_size, page_token.
POST /evaluations#
Submit an evaluation. Status code 201. Body:
| Field | Notes |
|---|---|
job_id |
The training job whose model is being evaluated. |
model_uri |
Where the model artefact lives. |
checkpoint_id |
Optional; evaluate a specific checkpoint instead of the final. |
benchmarks |
List of { "name", "dataset", "num_samples", "parameters" }. |
GET /evaluations/{evaluation_id}#
Fetch one evaluation's full record, including per-benchmark results once COMPLETED.
Data#
POST /data/validate#
Check that the coordinator can reach and parse one or more data sources before queueing a job. Forwards ScaiDrive and ScaiAtlas tokens.
Body:
1 2 3 4 5 | |
Response:
1 2 3 4 5 6 7 8 9 10 11 12 | |
GET /data/cache#
Inspect the coordinator-side dataset cache. Forwards ScaiDrive and ScaiAtlas tokens.
Query parameters: node_id (optional; empty for cluster-wide).
1 2 3 4 5 6 7 8 9 10 11 12 13 | |
Errors#
All endpoints return ScaiGrid's standard error envelope. Underlying gRPC status codes map onto HTTP as follows:
| gRPC code | HTTP | Detail handling |
|---|---|---|
NOT_FOUND |
404 | Coordinator detail passed through. |
INVALID_ARGUMENT |
400 | Coordinator detail passed through. |
PERMISSION_DENIED |
403 | Coordinator detail passed through. |
UNAUTHENTICATED |
401 | Coordinator detail passed through. |
ALREADY_EXISTS |
409 | Coordinator detail passed through. |
FAILED_PRECONDITION |
409 | Coordinator detail passed through. |
RESOURCE_EXHAUSTED |
429 | Coordinator detail passed through. |
DEADLINE_EXCEEDED |
504 | Friendly message; original logged server-side. |
UNAVAILABLE |
503 | Friendly message; original logged server-side. |
INTERNAL |
500 | Sanitised; raw detail captured in scaimind_coordinator_error log. |
UNKNOWN |
500 | Sanitised; raw detail captured in log. |
The sanitisation logic in errors.py looks for traceback markers in the coordinator's details() string and replaces the body with a generic message when one is detected — so callers never receive leaked SQL, Python stack traces, or other infrastructure noise.