---
summary: How the REST endpoint, the gRPC bridge, the local cache, and the external
  MindCoordinator fit together.
title: Architecture
path: concepts/architecture
status: published
---

ScaiMind is a thin REST-to-gRPC bridge. There is no training engine inside ScaiGrid — the actual scheduling, GPU allocation, and training loop live in an external service called MindCoordinator. ScaiMind's job is to translate REST requests into gRPC calls, forward downstream credentials, and cache results locally for dashboard reads.

## Components

```mermaid
flowchart LR
    Caller[Caller]
    subgraph SG ["ScaiGrid"]
        Bridge["/v1/modules/scaimind/...<br/>REST-to-gRPC bridge"]
        Cache["Local cache<br/>mod_scaimind_jobs"]
    end
    MC["MindCoordinator<br/>GPU cluster:<br/>Nodes, Queues, Jobs,<br/>Checkpoints, Evaluations,<br/>Data cache"]
    Caller -- "REST /v1/" --> Bridge
    Bridge -- gRPC --> MC
    Caller <-- "SSE streams" --> Bridge
    Bridge <-- streams --> MC
    Bridge --> Cache
```

The bridge runs in-process inside ScaiGrid's FastAPI app under the same auth, logging, and middleware as every other module. MindCoordinator is a separate process you deploy alongside the GPU cluster; ScaiMind reaches it via a single gRPC channel held open at module init.

## Request flow for one submission

1. **Caller** sends `POST /v1/modules/scaimind/jobs` with a `SubmitJobRequest` body.
2. **Auth** resolves the bearer token to a `CurrentUser` and enforces the `scaimind:manage` permission.
3. **Token exchange** mints short-lived downstream credentials for ScaiDrive and ScaiAtlas via `TokenExchangeDep`. These are propagated to the coordinator as gRPC metadata so the coordinator can fetch datasets and model artefacts on the caller's behalf.
4. **Schema conversion** translates the Pydantic body into a protobuf `SubmitJobRequest` (`build_submit_job_request` in `converters.py`).
5. **gRPC call** runs through `client.submit_job(...)`. The channel was opened at module init and is held for the process lifetime; metadata always includes `x-api-secret`, `x-tenant-id`, optionally `x-user-id`, `x-scaidrive-token`, `x-scaiatlas-token`.
6. **Response conversion** runs the protobuf reply through `proto_to_dict` and returns it inside ScaiGrid's standard `success()` envelope.
7. **Error mapping** is handled by `grpc_to_http()` — see "Error translation" below.

Read-side flows (`GET /jobs`, `GET /cluster`, etc.) are the same minus the token exchange. List endpoints additionally call `JobService.upsert` to refresh the local cache.

## Streaming endpoints

`GET /jobs/{job_id}/logs` and `GET /jobs/{job_id}/metrics/stream` open server-streaming gRPC calls (`StreamJobLogs`, `StreamJobMetrics`) and re-emit each protobuf message as a Server-Sent Event:

```
event: log
data: {"level": "INFO", "message": "...", "timestamp": "..."}

event: metrics
data: {"step": 200, "loss": 0.34, ...}
```

Stream calls use a separate `stream_timeout` from the unary `timeout`, both set on the client at init.

## State

- **Authoritative state** lives in MindCoordinator: job lifecycle, queue, node inventory, evaluations, checkpoints.
- **Cache state** lives in MariaDB table `mod_scaimind_jobs`: job id, tenant id, status, priority, the five config blobs as JSON, error info, timestamps. Refreshed best-effort on every `GET /jobs` list call. Never authoritative; dashboards always reconcile against a fresh coordinator read when the user opens a job.
- **No client-side state matters** — the caller holds a `job_id` and an optional `page_token`. Losing them just means re-listing.

## Error translation

`modules/scaimind/errors.py` maps gRPC status codes to HTTP responses:

| gRPC code | HTTP | Notes |
|---|---|---|
| `OK` | 200 | |
| `NOT_FOUND` | 404 | Detail passed through. |
| `INVALID_ARGUMENT` | 400 | Detail passed through. |
| `PERMISSION_DENIED` | 403 | Detail passed through. |
| `UNAUTHENTICATED` | 401 | Detail passed through. |
| `ALREADY_EXISTS` | 409 | Detail passed through. |
| `FAILED_PRECONDITION` | 409 | Detail passed through. |
| `RESOURCE_EXHAUSTED` | 429 | Detail passed through. |
| `DEADLINE_EXCEEDED` | 504 | Friendly message; original logged. |
| `UNAVAILABLE` | 503 | Friendly message; original logged. |
| `INTERNAL` | 500 | Always sanitised; raw detail logged server-side. |
| `UNKNOWN` | 500 | Always sanitised; raw detail logged server-side. |

`INTERNAL` and `UNKNOWN` details are scrubbed before reaching the caller because they often contain coordinator-side tracebacks, SQL errors, or other infrastructure leakage. The original is captured in structured logs under `scaimind_coordinator_error`.

## Trust boundary

The REST surface is inside ScaiGrid's normal trust perimeter — bearer token, tenant scoping, RBAC. Beyond that, three trust contracts matter:

- **ScaiMind → coordinator** authenticates with a static `x-api-secret` set in ScaiGrid config (`scaimind_api_secret`). The coordinator MUST validate this on every call; do not deploy a coordinator that ignores it.
- **Coordinator → ScaiDrive / ScaiAtlas** authenticates with the per-request tokens forwarded as `x-scaidrive-token` / `x-scaiatlas-token`. These are minted fresh per call by `TokenExchangeDep` and scoped to the caller's identity.
- **Tenant isolation** is enforced by the coordinator using the `x-tenant-id` header. ScaiMind always sends `str(user.tenant_id)`; the coordinator must reject any request that tries to read or write a job in a different tenant.

If `scaimind_tls_enabled` is set, the gRPC channel uses mTLS with the configured cert, key, and CA. Otherwise it is plaintext — fine for in-cluster deployments behind a service mesh, not fine for anything traversing untrusted networks.

## How it differs from calling the coordinator directly

A coordinator with an exposed gRPC port is callable directly by anyone who has the API secret. ScaiMind adds:

| Concern | Direct gRPC | ScaiMind |
|---|---|---|
| Auth | API secret only | ScaiGrid bearer token + tenant scoping + RBAC |
| Tenant isolation | You enforce it | Enforced by middleware on every call |
| Downstream tokens | You manage them | Minted per request via `TokenExchangeDep` |
| Error sanitisation | Tracebacks pass through | `INTERNAL`/`UNKNOWN` scrubbed; raw detail logged |
| Listings | gRPC only | Synced to local cache for fast dashboard reads |
| Stream framing | gRPC over HTTP/2 | Server-Sent Events over HTTP/1.1 |

For programmatic access from inside the trust perimeter, talking to the coordinator directly is faster. For multi-tenant, externally-exposed access, the REST surface is the only sane option — the same module-permission keys gate everything.