Platform
ScaiWave ScaiGrid ScaiCore ScaiBot ScaiDrive ScaiKey Models Tools & Services
Solutions
Organisations Developers Internet Service Providers Managed Service Providers AI-in-a-Box
Resources
Support Documentation Blog Downloads
Company
About Research Careers Investment Opportunities Contact
Log in

ScaiMind

ScaiMind is the training product on top of ScaiGrid. You submit a training job — SFT, LoRA, QLoRA, DPO, RLHF, continued pretraining, or full fine-tune — through the REST API; ScaiMind translates each call into gRPC against an external MindCoordinator service that schedules the job on a GPU cluster, streams metrics and logs back, and runs evaluations against the resulting model.

It is a thin REST-to-gRPC bridge: every endpoint is a translation of an HTTP request into a MindCoordinator call. A local job cache in ScaiGrid's MariaDB powers dashboard queries without round-tripping every read to the coordinator.

When to use it#

  • You want to fine-tune a model on your own data and don't want to run a training stack yourself.
  • You need multi-GPU scheduling — LoRA on 4xA100, DeepSpeed on 8xH100, FSDP across nodes.
  • You want training metrics (loss, learning rate, throughput, GPU utilisation) streamed live without writing your own collector.
  • You need to evaluate trained models against named benchmarks and compare runs.
  • You operate the GPU cluster and need to drain nodes, watch queues, and inspect data caches.

If you only need to call an already-trained model, you don't need ScaiMind — use ScaiGrid's /v1/inference/chat.

What you get#

  • REST submission for seven training types. SFT, LORA, QLORA, DPO, RLHF, CONTINUED_PRETRAIN, FULL_FINETUNE.
  • Five framework targets. HF_TRAINER, DEEPSPEED, FSDP, MEGATRON, CUSTOM.
  • Lifecycle controls. Submit, cancel, pause, resume, retry — each with optional checkpoint handling.
  • Live monitoring. Point-in-time metrics, server-sent-event log streams, server-sent-event metric snapshots.
  • Evaluation runs. Submit named benchmarks against a job's output model, list and inspect results.
  • Cluster operations. Cluster status, node listing, node drain and enable, queue depth, data cache inspection.
  • Local cache. Every list and read syncs to mod_scaimind_jobs for fast dashboard queries.

Two-minute mental model#

You manage three nouns and one verb:

  • A Job is one training run with its training, data, resource, output, and scheduling config.
  • A Node is one machine in the GPU cluster, with a status (online, draining, offline) and a GPU inventory.
  • An Evaluation is a benchmark run against a job's output model.
  • And the verb: you submit — a job, a retry, an evaluation — and ScaiMind forwards it to the coordinator.

ScaiMind itself stores no training state of record; the coordinator owns it. ScaiGrid's local table is a read-through cache for the admin UI.

How it differs from calling inference#

A ScaiGrid chat-completion call gives you tokens out of a model that already exists. ScaiMind produces the model in the first place. The two products are deliberately separate:

Concern Inference (/v1/inference/) ScaiMind (/v1/modules/scaimind/)
Latency Milliseconds per request Minutes to days per job
State Stateless per request Long-lived, multi-step lifecycle
Hardware Inference fleet Training cluster (separate GPUs)
Tenancy Per-call accounting Per-job scheduling and budgeting
Output A response A model artefact

When a ScaiMind job completes, the resulting artefact can be registered as a ScaiGrid backend and routed to from /v1/inference/chat like any other model. The hand-off is done outside ScaiMind — through ScaiGrid's models and routing endpoints.

Where to go next#

ScaiMind's module ID inside ScaiGrid is scaimind; its API is mounted at /v1/modules/scaimind/.

Updated 2026-05-18 15:01:31 View source (.md) rev 12