Platform
ScaiWave ScaiGrid ScaiCore ScaiBot ScaiDrive ScaiKey Models Tools & Services
Solutions
Organisations Developers Internet Service Providers Managed Service Providers AI-in-a-Box
Resources
Support Documentation Blog Downloads
Company
About Research Careers Investment Opportunities Contact
Log in

API reference

All endpoints are mounted at /v1/modules/scaimind/ and authenticate with the standard ScaiGrid bearer token. Responses use ScaiGrid's standard envelope ({ "data": ... } for success). Bodies and responses are translations of MindCoordinator's protobuf messages — fields not listed here may appear in responses as the coordinator evolves.

Jobs#

GET /jobs#

List jobs in the caller's tenant. Side effect: best-effort sync of returned jobs into the local cache (mod_scaimind_jobs).

bash
1
2
curl "$SCAIGRID_HOST/v1/modules/scaimind/jobs" \
  -H "Authorization: Bearer $SCAIGRID_API_KEY"

Query parameters:

Parameter Notes
status Repeatable. One of PENDING, QUEUED, SCHEDULING, PREPARING, TRAINING, CHECKPOINTING, PAUSED, EVALUATING, EXPORTING, COMPLETED, FAILED, CANCELLED, PREEMPTED.
page_size 1-100, default 20.
page_token Opaque, from previous response's next_page_token.
order_by Coordinator-defined ordering string.

Response:

json
1
2
3
4
5
6
7
{
  "data": {
    "jobs": [ { "job_id": "...", "name": "...", "status": "TRAINING", "...": "..." } ],
    "next_page_token": "...",
    "total_count": 12
  }
}

POST /jobs#

Submit a new training job. Requires scaimind:manage. Status code 201.

Body fields (see Training jobs for the full schema):

Field Required Notes
name yes Human-readable label.
training_config yes See TrainingConfigSchema.
data_config yes See DataConfigSchema.
resource_config yes See ResourceConfigSchema.
output_config no See OutputConfigSchema.
scheduling_config no See SchedulingConfigSchema.
labels no Free-form dict[str,str].

Token exchange: this endpoint mints downstream ScaiDrive and ScaiAtlas tokens and forwards them as gRPC metadata so the coordinator can fetch datasets and model artefacts.

Response: SubmitJobResponse{ "job_id", "status", "message", "estimated_wait_seconds" }.

GET /jobs/{job_id}#

Fetch one job's full state from the coordinator.

POST /jobs/{job_id}/cancel#

Cancel a non-terminal job. Body (optional): { "reason": "..." }.

POST /jobs/{job_id}/pause#

Pause a running job. Body (optional): { "save_checkpoint": true }.

POST /jobs/{job_id}/resume#

Resume a paused job. Body (optional): { "checkpoint_id": "" }. Empty value resumes from the most recent checkpoint.

POST /jobs/{job_id}/retry#

Retry a failed or preempted job. Body (optional):

json
1
2
3
4
5
{
  "checkpoint_id": "",
  "modify_resources": false,
  "new_resource_config": null
}

When modify_resources: true, new_resource_config must be supplied and follows ResourceConfigSchema. Creates a child job whose parent_job_id points at the original.

Monitoring#

GET /jobs/{job_id}/metrics#

Point-in-time read of training metrics.

Query parameters: max_points (integer; default 0 = coordinator default).

GET /jobs/{job_id}/logs#

Server-Sent Event stream of log entries.

Query parameters:

Parameter Notes
follow Boolean. Default true. If false, returns the tail and closes.
level DEBUG, INFO, WARNING. Default INFO.
tail Initial lines to send. Default 100.

Events:

Event Payload
log { "level": "...", "message": "...", "timestamp": "...", ... } (coordinator-defined).
error { "error": "stream ended" } if the upstream stream terminates with an error.

GET /jobs/{job_id}/metrics/stream#

Server-Sent Event stream of metric snapshots.

Query parameters: interval (seconds, 1-60, default 5).

Events:

Event Payload
metrics Proto-derived metrics snapshot for the most recent step.
error { "error": "stream ended" } on failure.

Cluster#

GET /cluster#

Cluster-wide status.

json
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
{
  "data": {
    "total_nodes": 8,
    "online_nodes": 7,
    "draining_nodes": 1,
    "offline_nodes": 0,
    "total_gpus": 64,
    "allocated_gpus": 32,
    "available_gpus": 32,
    "active_jobs": 4,
    "queued_jobs": 2,
    "cluster_utilization": 0.5
  }
}

GET /cluster/nodes#

List nodes.

Query parameters: status_filter (coordinator-defined string), page_size (1-100, default 20), page_token.

json
1
2
3
4
5
6
7
{
  "data": {
    "nodes": [ { "node_id": "...", "status": "ONLINE", "...": "..." } ],
    "next_page_token": "...",
    "total_count": 8
  }
}

GET /cluster/nodes/{node_id}#

Fetch one node's full status.

POST /cluster/nodes/{node_id}/drain#

Stop scheduling new jobs onto a node. Requires scaimind:cluster_admin.

Body (optional):

json
1
{ "force": false, "reason": "scheduled maintenance" }

force: true aborts in-flight jobs on that node; false lets them finish.

POST /cluster/nodes/{node_id}/enable#

Re-enable a drained or disabled node. Requires scaimind:cluster_admin.

Queue#

GET /queue#

Queue depth and estimated waits.

Query parameters: queue_name (string, default empty for all queues the tenant can see).

json
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
{
  "data": {
    "queued_jobs": [
      {
        "job_id": "...", "tenant_id": "...", "name": "...",
        "priority": 6, "gpu_count": 4, "gpu_type": "A100-80GB",
        "queued_at": "...", "position": 2, "estimated_wait_seconds": 1200
      }
    ],
    "total_queued": 2,
    "estimated_wait_seconds": 1200
  }
}

Evaluations#

GET /evaluations#

List evaluation runs. Server-side this calls ListJobs with label_filter={"type":"evaluation"}, so the response shape matches /jobs.

Query parameters: page_size, page_token.

POST /evaluations#

Submit an evaluation. Status code 201. Body:

Field Notes
job_id The training job whose model is being evaluated.
model_uri Where the model artefact lives.
checkpoint_id Optional; evaluate a specific checkpoint instead of the final.
benchmarks List of { "name", "dataset", "num_samples", "parameters" }.

GET /evaluations/{evaluation_id}#

Fetch one evaluation's full record, including per-benchmark results once COMPLETED.

Data#

POST /data/validate#

Check that the coordinator can reach and parse one or more data sources before queueing a job. Forwards ScaiDrive and ScaiAtlas tokens.

Body:

json
1
2
3
4
5
{
  "data_sources": [
    { "path": "scaidrive://.../foo.jsonl", "format": "jsonl" }
  ]
}

Response:

json
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
{
  "data": {
    "valid": true,
    "validations": [
      {
        "path": "...", "accessible": true,
        "size_bytes": 12345, "record_count": 2000,
        "format": "jsonl", "error_message": ""
      }
    ]
  }
}

GET /data/cache#

Inspect the coordinator-side dataset cache. Forwards ScaiDrive and ScaiAtlas tokens.

Query parameters: node_id (optional; empty for cluster-wide).

json
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
{
  "data": {
    "total_cache_size_bytes": 0,
    "used_cache_size_bytes": 0,
    "datasets": [
      {
        "path": "...", "size_bytes": 0,
        "cached_at": "...", "last_accessed": "...",
        "reference_count": 0
      }
    ]
  }
}

Errors#

All endpoints return ScaiGrid's standard error envelope. Underlying gRPC status codes map onto HTTP as follows:

gRPC code HTTP Detail handling
NOT_FOUND 404 Coordinator detail passed through.
INVALID_ARGUMENT 400 Coordinator detail passed through.
PERMISSION_DENIED 403 Coordinator detail passed through.
UNAUTHENTICATED 401 Coordinator detail passed through.
ALREADY_EXISTS 409 Coordinator detail passed through.
FAILED_PRECONDITION 409 Coordinator detail passed through.
RESOURCE_EXHAUSTED 429 Coordinator detail passed through.
DEADLINE_EXCEEDED 504 Friendly message; original logged server-side.
UNAVAILABLE 503 Friendly message; original logged server-side.
INTERNAL 500 Sanitised; raw detail captured in scaimind_coordinator_error log.
UNKNOWN 500 Sanitised; raw detail captured in log.

The sanitisation logic in errors.py looks for traceback markers in the coordinator's details() string and replaces the body with a generic message when one is detected — so callers never receive leaked SQL, Python stack traces, or other infrastructure noise.

Updated 2026-05-18 15:01:31 View source (.md) rev 12