Architecture
ScaiMind is a thin REST-to-gRPC bridge. There is no training engine inside ScaiGrid — the actual scheduling, GPU allocation, and training loop live in an external service called MindCoordinator. ScaiMind's job is to translate REST requests into gRPC calls, forward downstream credentials, and cache results locally for dashboard reads.
Components#
The bridge runs in-process inside ScaiGrid's FastAPI app under the same auth, logging, and middleware as every other module. MindCoordinator is a separate process you deploy alongside the GPU cluster; ScaiMind reaches it via a single gRPC channel held open at module init.
Request flow for one submission#
- Caller sends
POST /v1/modules/scaimind/jobswith aSubmitJobRequestbody. - Auth resolves the bearer token to a
CurrentUserand enforces thescaimind:managepermission. - Token exchange mints short-lived downstream credentials for ScaiDrive and ScaiAtlas via
TokenExchangeDep. These are propagated to the coordinator as gRPC metadata so the coordinator can fetch datasets and model artefacts on the caller's behalf. - Schema conversion translates the Pydantic body into a protobuf
SubmitJobRequest(build_submit_job_requestinconverters.py). - gRPC call runs through
client.submit_job(...). The channel was opened at module init and is held for the process lifetime; metadata always includesx-api-secret,x-tenant-id, optionallyx-user-id,x-scaidrive-token,x-scaiatlas-token. - Response conversion runs the protobuf reply through
proto_to_dictand returns it inside ScaiGrid's standardsuccess()envelope. - Error mapping is handled by
grpc_to_http()— see "Error translation" below.
Read-side flows (GET /jobs, GET /cluster, etc.) are the same minus the token exchange. List endpoints additionally call JobService.upsert to refresh the local cache.
Streaming endpoints#
GET /jobs/{job_id}/logs and GET /jobs/{job_id}/metrics/stream open server-streaming gRPC calls (StreamJobLogs, StreamJobMetrics) and re-emit each protobuf message as a Server-Sent Event:
1 2 3 4 5 | |
Stream calls use a separate stream_timeout from the unary timeout, both set on the client at init.
State#
- Authoritative state lives in MindCoordinator: job lifecycle, queue, node inventory, evaluations, checkpoints.
- Cache state lives in MariaDB table
mod_scaimind_jobs: job id, tenant id, status, priority, the five config blobs as JSON, error info, timestamps. Refreshed best-effort on everyGET /jobslist call. Never authoritative; dashboards always reconcile against a fresh coordinator read when the user opens a job. - No client-side state matters — the caller holds a
job_idand an optionalpage_token. Losing them just means re-listing.
Error translation#
modules/scaimind/errors.py maps gRPC status codes to HTTP responses:
| gRPC code | HTTP | Notes |
|---|---|---|
OK |
200 | |
NOT_FOUND |
404 | Detail passed through. |
INVALID_ARGUMENT |
400 | Detail passed through. |
PERMISSION_DENIED |
403 | Detail passed through. |
UNAUTHENTICATED |
401 | Detail passed through. |
ALREADY_EXISTS |
409 | Detail passed through. |
FAILED_PRECONDITION |
409 | Detail passed through. |
RESOURCE_EXHAUSTED |
429 | Detail passed through. |
DEADLINE_EXCEEDED |
504 | Friendly message; original logged. |
UNAVAILABLE |
503 | Friendly message; original logged. |
INTERNAL |
500 | Always sanitised; raw detail logged server-side. |
UNKNOWN |
500 | Always sanitised; raw detail logged server-side. |
INTERNAL and UNKNOWN details are scrubbed before reaching the caller because they often contain coordinator-side tracebacks, SQL errors, or other infrastructure leakage. The original is captured in structured logs under scaimind_coordinator_error.
Trust boundary#
The REST surface is inside ScaiGrid's normal trust perimeter — bearer token, tenant scoping, RBAC. Beyond that, three trust contracts matter:
- ScaiMind → coordinator authenticates with a static
x-api-secretset in ScaiGrid config (scaimind_api_secret). The coordinator MUST validate this on every call; do not deploy a coordinator that ignores it. - Coordinator → ScaiDrive / ScaiAtlas authenticates with the per-request tokens forwarded as
x-scaidrive-token/x-scaiatlas-token. These are minted fresh per call byTokenExchangeDepand scoped to the caller's identity. - Tenant isolation is enforced by the coordinator using the
x-tenant-idheader. ScaiMind always sendsstr(user.tenant_id); the coordinator must reject any request that tries to read or write a job in a different tenant.
If scaimind_tls_enabled is set, the gRPC channel uses mTLS with the configured cert, key, and CA. Otherwise it is plaintext — fine for in-cluster deployments behind a service mesh, not fine for anything traversing untrusted networks.
How it differs from calling the coordinator directly#
A coordinator with an exposed gRPC port is callable directly by anyone who has the API secret. ScaiMind adds:
| Concern | Direct gRPC | ScaiMind |
|---|---|---|
| Auth | API secret only | ScaiGrid bearer token + tenant scoping + RBAC |
| Tenant isolation | You enforce it | Enforced by middleware on every call |
| Downstream tokens | You manage them | Minted per request via TokenExchangeDep |
| Error sanitisation | Tracebacks pass through | INTERNAL/UNKNOWN scrubbed; raw detail logged |
| Listings | gRPC only | Synced to local cache for fast dashboard reads |
| Stream framing | gRPC over HTTP/2 | Server-Sent Events over HTTP/1.1 |
For programmatic access from inside the trust perimeter, talking to the coordinator directly is faster. For multi-tenant, externally-exposed access, the REST surface is the only sane option — the same module-permission keys gate everything.