Architecture

ScaiMind is a thin REST-to-gRPC bridge. There is no training engine inside ScaiGrid — the actual scheduling, GPU allocation, and training loop live in an external service called MindCoordinator. ScaiMind's job is to translate REST requests into gRPC calls, forward downstream credentials, and cache results locally for dashboard reads.

Components#

flowchart LR Caller[Caller] subgraph SG ["ScaiGrid"] Bridge["/v1/modules/scaimind/... REST-to-gRPC bridge"] Cache["Local cache mod_scaimind_jobs"] end MC["MindCoordinator GPU cluster: Nodes, Queues, Jobs, Checkpoints, Evaluations, Data cache"] Caller -- "REST /v1/" --> Bridge Bridge -- gRPC --> MC Caller <-- "SSE streams" --> Bridge Bridge <-- streams --> MC Bridge --> Cache

The bridge runs in-process inside ScaiGrid's FastAPI app under the same auth, logging, and middleware as every other module. MindCoordinator is a separate process you deploy alongside the GPU cluster; ScaiMind reaches it via a single gRPC channel held open at module init.

Request flow for one submission#

Caller sends POST /v1/modules/scaimind/jobs with a SubmitJobRequest body.
Auth resolves the bearer token to a CurrentUser and enforces the scaimind:manage permission.
Token exchange mints short-lived downstream credentials for ScaiDrive and ScaiAtlas via TokenExchangeDep. These are propagated to the coordinator as gRPC metadata so the coordinator can fetch datasets and model artefacts on the caller's behalf.
Schema conversion translates the Pydantic body into a protobuf SubmitJobRequest (build_submit_job_request in converters.py).
gRPC call runs through client.submit_job(...). The channel was opened at module init and is held for the process lifetime; metadata always includes x-api-secret, x-tenant-id, optionally x-user-id, x-scaidrive-token, x-scaiatlas-token.
Response conversion runs the protobuf reply through proto_to_dict and returns it inside ScaiGrid's standard success() envelope.
Error mapping is handled by grpc_to_http() — see "Error translation" below.

Read-side flows (GET /jobs, GET /cluster, etc.) are the same minus the token exchange. List endpoints additionally call JobService.upsert to refresh the local cache.

Streaming endpoints#

GET /jobs/{job_id}/logs and GET /jobs/{job_id}/metrics/stream open server-streaming gRPC calls (StreamJobLogs, StreamJobMetrics) and re-emit each protobuf message as a Server-Sent Event:

text

event: log
data: {"level": "INFO", "message": "...", "timestamp": "..."}

event: metrics
data: {"step": 200, "loss": 0.34, ...}

Stream calls use a separate stream_timeout from the unary timeout, both set on the client at init.

State#

Authoritative state lives in MindCoordinator: job lifecycle, queue, node inventory, evaluations, checkpoints.
Cache state lives in MariaDB table mod_scaimind_jobs: job id, tenant id, status, priority, the five config blobs as JSON, error info, timestamps. Refreshed best-effort on every GET /jobs list call. Never authoritative; dashboards always reconcile against a fresh coordinator read when the user opens a job.
No client-side state matters — the caller holds a job_id and an optional page_token. Losing them just means re-listing.

Error translation#

modules/scaimind/errors.py maps gRPC status codes to HTTP responses:

gRPC code	HTTP	Notes
`OK`	200
`NOT_FOUND`	404	Detail passed through.
`INVALID_ARGUMENT`	400	Detail passed through.
`PERMISSION_DENIED`	403	Detail passed through.
`UNAUTHENTICATED`	401	Detail passed through.
`ALREADY_EXISTS`	409	Detail passed through.
`FAILED_PRECONDITION`	409	Detail passed through.
`RESOURCE_EXHAUSTED`	429	Detail passed through.
`DEADLINE_EXCEEDED`	504	Friendly message; original logged.
`UNAVAILABLE`	503	Friendly message; original logged.
`INTERNAL`	500	Always sanitised; raw detail logged server-side.
`UNKNOWN`	500	Always sanitised; raw detail logged server-side.

INTERNAL and UNKNOWN details are scrubbed before reaching the caller because they often contain coordinator-side tracebacks, SQL errors, or other infrastructure leakage. The original is captured in structured logs under scaimind_coordinator_error.

Trust boundary#

The REST surface is inside ScaiGrid's normal trust perimeter — bearer token, tenant scoping, RBAC. Beyond that, three trust contracts matter:

ScaiMind → coordinator authenticates with a static x-api-secret set in ScaiGrid config (scaimind_api_secret). The coordinator MUST validate this on every call; do not deploy a coordinator that ignores it.
Coordinator → ScaiDrive / ScaiAtlas authenticates with the per-request tokens forwarded as x-scaidrive-token / x-scaiatlas-token. These are minted fresh per call by TokenExchangeDep and scoped to the caller's identity.
Tenant isolation is enforced by the coordinator using the x-tenant-id header. ScaiMind always sends str(user.tenant_id); the coordinator must reject any request that tries to read or write a job in a different tenant.

If scaimind_tls_enabled is set, the gRPC channel uses mTLS with the configured cert, key, and CA. Otherwise it is plaintext — fine for in-cluster deployments behind a service mesh, not fine for anything traversing untrusted networks.

How it differs from calling the coordinator directly#

A coordinator with an exposed gRPC port is callable directly by anyone who has the API secret. ScaiMind adds:

Concern	Direct gRPC	ScaiMind
Auth	API secret only	ScaiGrid bearer token + tenant scoping + RBAC
Tenant isolation	You enforce it	Enforced by middleware on every call
Downstream tokens	You manage them	Minted per request via `TokenExchangeDep`
Error sanitisation	Tracebacks pass through	`INTERNAL`/`UNKNOWN` scrubbed; raw detail logged
Listings	gRPC only	Synced to local cache for fast dashboard reads
Stream framing	gRPC over HTTP/2	Server-Sent Events over HTTP/1.1

For programmatic access from inside the trust perimeter, talking to the coordinator directly is faster. For multi-tenant, externally-exposed access, the REST surface is the only sane option — the same module-permission keys gate everything.