Platform
ScaiWave ScaiGrid ScaiCore ScaiBot ScaiDrive ScaiKey Models Tools & Services
Solutions
Organisations Developers Internet Service Providers Managed Service Providers AI-in-a-Box
Resources
Support Documentation Blog Downloads
Company
About Research Careers Investment Opportunities Contact
Log in

Architecture

ScaiMind is a thin REST-to-gRPC bridge. There is no training engine inside ScaiGrid — the actual scheduling, GPU allocation, and training loop live in an external service called MindCoordinator. ScaiMind's job is to translate REST requests into gRPC calls, forward downstream credentials, and cache results locally for dashboard reads.

Components#

flowchart LR Caller[Caller] subgraph SG ["ScaiGrid"] Bridge["/v1/modules/scaimind/...<br/>REST-to-gRPC bridge"] Cache["Local cache<br/>mod_scaimind_jobs"] end MC["MindCoordinator<br/>GPU cluster:<br/>Nodes, Queues, Jobs,<br/>Checkpoints, Evaluations,<br/>Data cache"] Caller -- "REST /v1/" --> Bridge Bridge -- gRPC --> MC Caller <-- "SSE streams" --> Bridge Bridge <-- streams --> MC Bridge --> Cache

The bridge runs in-process inside ScaiGrid's FastAPI app under the same auth, logging, and middleware as every other module. MindCoordinator is a separate process you deploy alongside the GPU cluster; ScaiMind reaches it via a single gRPC channel held open at module init.

Request flow for one submission#

  1. Caller sends POST /v1/modules/scaimind/jobs with a SubmitJobRequest body.
  2. Auth resolves the bearer token to a CurrentUser and enforces the scaimind:manage permission.
  3. Token exchange mints short-lived downstream credentials for ScaiDrive and ScaiAtlas via TokenExchangeDep. These are propagated to the coordinator as gRPC metadata so the coordinator can fetch datasets and model artefacts on the caller's behalf.
  4. Schema conversion translates the Pydantic body into a protobuf SubmitJobRequest (build_submit_job_request in converters.py).
  5. gRPC call runs through client.submit_job(...). The channel was opened at module init and is held for the process lifetime; metadata always includes x-api-secret, x-tenant-id, optionally x-user-id, x-scaidrive-token, x-scaiatlas-token.
  6. Response conversion runs the protobuf reply through proto_to_dict and returns it inside ScaiGrid's standard success() envelope.
  7. Error mapping is handled by grpc_to_http() — see "Error translation" below.

Read-side flows (GET /jobs, GET /cluster, etc.) are the same minus the token exchange. List endpoints additionally call JobService.upsert to refresh the local cache.

Streaming endpoints#

GET /jobs/{job_id}/logs and GET /jobs/{job_id}/metrics/stream open server-streaming gRPC calls (StreamJobLogs, StreamJobMetrics) and re-emit each protobuf message as a Server-Sent Event:

text
1
2
3
4
5
event: log
data: {"level": "INFO", "message": "...", "timestamp": "..."}

event: metrics
data: {"step": 200, "loss": 0.34, ...}

Stream calls use a separate stream_timeout from the unary timeout, both set on the client at init.

State#

  • Authoritative state lives in MindCoordinator: job lifecycle, queue, node inventory, evaluations, checkpoints.
  • Cache state lives in MariaDB table mod_scaimind_jobs: job id, tenant id, status, priority, the five config blobs as JSON, error info, timestamps. Refreshed best-effort on every GET /jobs list call. Never authoritative; dashboards always reconcile against a fresh coordinator read when the user opens a job.
  • No client-side state matters — the caller holds a job_id and an optional page_token. Losing them just means re-listing.

Error translation#

modules/scaimind/errors.py maps gRPC status codes to HTTP responses:

gRPC code HTTP Notes
OK 200
NOT_FOUND 404 Detail passed through.
INVALID_ARGUMENT 400 Detail passed through.
PERMISSION_DENIED 403 Detail passed through.
UNAUTHENTICATED 401 Detail passed through.
ALREADY_EXISTS 409 Detail passed through.
FAILED_PRECONDITION 409 Detail passed through.
RESOURCE_EXHAUSTED 429 Detail passed through.
DEADLINE_EXCEEDED 504 Friendly message; original logged.
UNAVAILABLE 503 Friendly message; original logged.
INTERNAL 500 Always sanitised; raw detail logged server-side.
UNKNOWN 500 Always sanitised; raw detail logged server-side.

INTERNAL and UNKNOWN details are scrubbed before reaching the caller because they often contain coordinator-side tracebacks, SQL errors, or other infrastructure leakage. The original is captured in structured logs under scaimind_coordinator_error.

Trust boundary#

The REST surface is inside ScaiGrid's normal trust perimeter — bearer token, tenant scoping, RBAC. Beyond that, three trust contracts matter:

  • ScaiMind → coordinator authenticates with a static x-api-secret set in ScaiGrid config (scaimind_api_secret). The coordinator MUST validate this on every call; do not deploy a coordinator that ignores it.
  • Coordinator → ScaiDrive / ScaiAtlas authenticates with the per-request tokens forwarded as x-scaidrive-token / x-scaiatlas-token. These are minted fresh per call by TokenExchangeDep and scoped to the caller's identity.
  • Tenant isolation is enforced by the coordinator using the x-tenant-id header. ScaiMind always sends str(user.tenant_id); the coordinator must reject any request that tries to read or write a job in a different tenant.

If scaimind_tls_enabled is set, the gRPC channel uses mTLS with the configured cert, key, and CA. Otherwise it is plaintext — fine for in-cluster deployments behind a service mesh, not fine for anything traversing untrusted networks.

How it differs from calling the coordinator directly#

A coordinator with an exposed gRPC port is callable directly by anyone who has the API secret. ScaiMind adds:

Concern Direct gRPC ScaiMind
Auth API secret only ScaiGrid bearer token + tenant scoping + RBAC
Tenant isolation You enforce it Enforced by middleware on every call
Downstream tokens You manage them Minted per request via TokenExchangeDep
Error sanitisation Tracebacks pass through INTERNAL/UNKNOWN scrubbed; raw detail logged
Listings gRPC only Synced to local cache for fast dashboard reads
Stream framing gRPC over HTTP/2 Server-Sent Events over HTTP/1.1

For programmatic access from inside the trust perimeter, talking to the coordinator directly is faster. For multi-tenant, externally-exposed access, the REST surface is the only sane option — the same module-permission keys gate everything.

Updated 2026-05-18 15:01:31 View source (.md) rev 12