ScaiMind
ScaiMind is the training product on top of ScaiGrid. You submit a training job — SFT, LoRA, QLoRA, DPO, RLHF, continued pretraining, or full fine-tune — through the REST API; ScaiMind translates each call into gRPC against an external MindCoordinator service that schedules the job on a GPU cluster, streams metrics and logs back, and runs evaluations against the resulting model.
It is a thin REST-to-gRPC bridge: every endpoint is a translation of an HTTP request into a MindCoordinator call. A local job cache in ScaiGrid's MariaDB powers dashboard queries without round-tripping every read to the coordinator.
When to use it#
- You want to fine-tune a model on your own data and don't want to run a training stack yourself.
- You need multi-GPU scheduling — LoRA on 4xA100, DeepSpeed on 8xH100, FSDP across nodes.
- You want training metrics (loss, learning rate, throughput, GPU utilisation) streamed live without writing your own collector.
- You need to evaluate trained models against named benchmarks and compare runs.
- You operate the GPU cluster and need to drain nodes, watch queues, and inspect data caches.
If you only need to call an already-trained model, you don't need ScaiMind — use ScaiGrid's /v1/inference/chat.
What you get#
- REST submission for seven training types. SFT, LORA, QLORA, DPO, RLHF, CONTINUED_PRETRAIN, FULL_FINETUNE.
- Five framework targets. HF_TRAINER, DEEPSPEED, FSDP, MEGATRON, CUSTOM.
- Lifecycle controls. Submit, cancel, pause, resume, retry — each with optional checkpoint handling.
- Live monitoring. Point-in-time metrics, server-sent-event log streams, server-sent-event metric snapshots.
- Evaluation runs. Submit named benchmarks against a job's output model, list and inspect results.
- Cluster operations. Cluster status, node listing, node drain and enable, queue depth, data cache inspection.
- Local cache. Every list and read syncs to
mod_scaimind_jobsfor fast dashboard queries.
Two-minute mental model#
You manage three nouns and one verb:
- A Job is one training run with its training, data, resource, output, and scheduling config.
- A Node is one machine in the GPU cluster, with a status (online, draining, offline) and a GPU inventory.
- An Evaluation is a benchmark run against a job's output model.
- And the verb: you submit — a job, a retry, an evaluation — and ScaiMind forwards it to the coordinator.
ScaiMind itself stores no training state of record; the coordinator owns it. ScaiGrid's local table is a read-through cache for the admin UI.
How it differs from calling inference#
A ScaiGrid chat-completion call gives you tokens out of a model that already exists. ScaiMind produces the model in the first place. The two products are deliberately separate:
| Concern | Inference (/v1/inference/) |
ScaiMind (/v1/modules/scaimind/) |
|---|---|---|
| Latency | Milliseconds per request | Minutes to days per job |
| State | Stateless per request | Long-lived, multi-step lifecycle |
| Hardware | Inference fleet | Training cluster (separate GPUs) |
| Tenancy | Per-call accounting | Per-job scheduling and budgeting |
| Output | A response | A model artefact |
When a ScaiMind job completes, the resulting artefact can be registered as a ScaiGrid backend and routed to from /v1/inference/chat like any other model. The hand-off is done outside ScaiMind — through ScaiGrid's models and routing endpoints.
Where to go next#
- Quickstart — submit your first LoRA job and watch its metrics.
- Architecture — REST, gRPC, the coordinator, the local cache.
- Training jobs — config shape, lifecycle, retries, checkpoints.
- API reference — every endpoint, request, response.
- Submit a LoRA fine-tune — full walkthrough.
- Run an evaluation — benchmark a trained model.
ScaiMind's module ID inside ScaiGrid is scaimind; its API is mounted at /v1/modules/scaimind/.