ScaiMind

ScaiMind is the training product on top of ScaiGrid. You submit a training job — SFT, LoRA, QLoRA, DPO, RLHF, continued pretraining, or full fine-tune — through the REST API; ScaiMind translates each call into gRPC against an external MindCoordinator service that schedules the job on a GPU cluster, streams metrics and logs back, and runs evaluations against the resulting model.

It is a thin REST-to-gRPC bridge: every endpoint is a translation of an HTTP request into a MindCoordinator call. A local job cache in ScaiGrid's MariaDB powers dashboard queries without round-tripping every read to the coordinator.

When to use it#

You want to fine-tune a model on your own data and don't want to run a training stack yourself.
You need multi-GPU scheduling — LoRA on 4xA100, DeepSpeed on 8xH100, FSDP across nodes.
You want training metrics (loss, learning rate, throughput, GPU utilisation) streamed live without writing your own collector.
You need to evaluate trained models against named benchmarks and compare runs.
You operate the GPU cluster and need to drain nodes, watch queues, and inspect data caches.

If you only need to call an already-trained model, you don't need ScaiMind — use ScaiGrid's /v1/inference/chat.

What you get#

REST submission for seven training types. SFT, LORA, QLORA, DPO, RLHF, CONTINUED_PRETRAIN, FULL_FINETUNE.
Five framework targets. HF_TRAINER, DEEPSPEED, FSDP, MEGATRON, CUSTOM.
Lifecycle controls. Submit, cancel, pause, resume, retry — each with optional checkpoint handling.
Live monitoring. Point-in-time metrics, server-sent-event log streams, server-sent-event metric snapshots.
Evaluation runs. Submit named benchmarks against a job's output model, list and inspect results.
Cluster operations. Cluster status, node listing, node drain and enable, queue depth, data cache inspection.
Local cache. Every list and read syncs to mod_scaimind_jobs for fast dashboard queries.

Two-minute mental model#

You manage three nouns and one verb:

A Job is one training run with its training, data, resource, output, and scheduling config.
A Node is one machine in the GPU cluster, with a status (online, draining, offline) and a GPU inventory.
An Evaluation is a benchmark run against a job's output model.
And the verb: you submit — a job, a retry, an evaluation — and ScaiMind forwards it to the coordinator.

ScaiMind itself stores no training state of record; the coordinator owns it. ScaiGrid's local table is a read-through cache for the admin UI.

How it differs from calling inference#

A ScaiGrid chat-completion call gives you tokens out of a model that already exists. ScaiMind produces the model in the first place. The two products are deliberately separate:

Concern	Inference (`/v1/inference/`)	ScaiMind (`/v1/modules/scaimind/`)
Latency	Milliseconds per request	Minutes to days per job
State	Stateless per request	Long-lived, multi-step lifecycle
Hardware	Inference fleet	Training cluster (separate GPUs)
Tenancy	Per-call accounting	Per-job scheduling and budgeting
Output	A response	A model artefact

When a ScaiMind job completes, the resulting artefact can be registered as a ScaiGrid backend and routed to from /v1/inference/chat like any other model. The hand-off is done outside ScaiMind — through ScaiGrid's models and routing endpoints.

Where to go next#

Quickstart — submit your first LoRA job and watch its metrics.
Architecture — REST, gRPC, the coordinator, the local cache.
Training jobs — config shape, lifecycle, retries, checkpoints.
API reference — every endpoint, request, response.
Submit a LoRA fine-tune — full walkthrough.
Run an evaluation — benchmark a trained model.

ScaiMind's module ID inside ScaiGrid is scaimind; its API is mounted at /v1/modules/scaimind/.