---
summary: "Training orchestration bridge \u2014 submit fine-tuning jobs, monitor metrics\
  \ and logs, evaluate models, and manage GPU clusters through a single REST API over\
  \ MindCoordinator."
title: ScaiMind
path: overview
status: published
---

ScaiMind is the training product on top of ScaiGrid. You submit a training job — SFT, LoRA, QLoRA, DPO, RLHF, continued pretraining, or full fine-tune — through the REST API; ScaiMind translates each call into gRPC against an external MindCoordinator service that schedules the job on a GPU cluster, streams metrics and logs back, and runs evaluations against the resulting model.

It is a thin REST-to-gRPC bridge: every endpoint is a translation of an HTTP request into a MindCoordinator call. A local job cache in ScaiGrid's MariaDB powers dashboard queries without round-tripping every read to the coordinator.

## When to use it

- You want to fine-tune a model on your own data and don't want to run a training stack yourself.
- You need multi-GPU scheduling — LoRA on 4xA100, DeepSpeed on 8xH100, FSDP across nodes.
- You want training metrics (loss, learning rate, throughput, GPU utilisation) streamed live without writing your own collector.
- You need to evaluate trained models against named benchmarks and compare runs.
- You operate the GPU cluster and need to drain nodes, watch queues, and inspect data caches.

If you only need to call an already-trained model, you don't need ScaiMind — use ScaiGrid's `/v1/inference/chat`.

## What you get

- **REST submission for seven training types.** SFT, LORA, QLORA, DPO, RLHF, CONTINUED_PRETRAIN, FULL_FINETUNE.
- **Five framework targets.** HF_TRAINER, DEEPSPEED, FSDP, MEGATRON, CUSTOM.
- **Lifecycle controls.** Submit, cancel, pause, resume, retry — each with optional checkpoint handling.
- **Live monitoring.** Point-in-time metrics, server-sent-event log streams, server-sent-event metric snapshots.
- **Evaluation runs.** Submit named benchmarks against a job's output model, list and inspect results.
- **Cluster operations.** Cluster status, node listing, node drain and enable, queue depth, data cache inspection.
- **Local cache.** Every list and read syncs to `mod_scaimind_jobs` for fast dashboard queries.

## Two-minute mental model

You manage three nouns and one verb:

- A **Job** is one training run with its training, data, resource, output, and scheduling config.
- A **Node** is one machine in the GPU cluster, with a status (online, draining, offline) and a GPU inventory.
- An **Evaluation** is a benchmark run against a job's output model.
- And the verb: you **submit** — a job, a retry, an evaluation — and ScaiMind forwards it to the coordinator.

ScaiMind itself stores no training state of record; the coordinator owns it. ScaiGrid's local table is a read-through cache for the admin UI.

## How it differs from calling inference

A ScaiGrid chat-completion call gives you tokens out of a model that already exists. ScaiMind produces the model in the first place. The two products are deliberately separate:

| Concern | Inference (`/v1/inference/`) | ScaiMind (`/v1/modules/scaimind/`) |
|---|---|---|
| Latency | Milliseconds per request | Minutes to days per job |
| State | Stateless per request | Long-lived, multi-step lifecycle |
| Hardware | Inference fleet | Training cluster (separate GPUs) |
| Tenancy | Per-call accounting | Per-job scheduling and budgeting |
| Output | A response | A model artefact |

When a ScaiMind job completes, the resulting artefact can be registered as a ScaiGrid backend and routed to from `/v1/inference/chat` like any other model. The hand-off is done outside ScaiMind — through ScaiGrid's models and routing endpoints.

## Where to go next

- [Quickstart](./quickstart) — submit your first LoRA job and watch its metrics.
- [Architecture](./concepts/architecture) — REST, gRPC, the coordinator, the local cache.
- [Training jobs](./concepts/training-jobs) — config shape, lifecycle, retries, checkpoints.
- [API reference](./reference/api) — every endpoint, request, response.
- [Submit a LoRA fine-tune](./tutorials/submit-a-lora-finetune) — full walkthrough.
- [Run an evaluation](./tutorials/run-an-evaluation) — benchmark a trained model.

ScaiMind's module ID inside ScaiGrid is `scaimind`; its API is mounted at `/v1/modules/scaimind/`.
