Quickstart

In five minutes you will have a LoRA fine-tune queued on the cluster and a live stream of metrics coming back to your terminal.

You need:

A ScaiGrid API key with the scaimind:manage permission (any tenant admin has this).
A reachable MindCoordinator cluster with at least one online node.
A training dataset already accessible to the coordinator (typically a path resolvable via the per-request ScaiDrive token).

bash
export SCAIGRID_HOST="https://scaigrid.scailabs.ai"
export SCAIGRID_API_KEY="sgk_..."

1. Check cluster status#

Before queueing anything, confirm the cluster has capacity.

bash
curl "$SCAIGRID_HOST/v1/modules/scaimind/cluster" \
  -H "Authorization: Bearer $SCAIGRID_API_KEY"

python
import httpx, os
r = httpx.get(
    f"{os.environ['SCAIGRID_HOST']}/v1/modules/scaimind/cluster",
    headers={"Authorization": f"Bearer {os.environ['SCAIGRID_API_KEY']}"},
)
print(r.json()["data"])

javascript
const r = await fetch(`${process.env.SCAIGRID_HOST}/v1/modules/scaimind/cluster`, {
  headers: { "Authorization": `Bearer ${process.env.SCAIGRID_API_KEY}` },
});
console.log((await r.json()).data);

You will see counts for online_nodes, total_gpus, available_gpus, queued_jobs, and overall cluster_utilization.

2. Submit a LoRA job#

The job request is one envelope with five nested configs: training_config, data_config, resource_config, and the optional output_config and scheduling_config.

bash
curl -X POST "$SCAIGRID_HOST/v1/modules/scaimind/jobs" \
  -H "Authorization: Bearer $SCAIGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "quickstart-lora",
    "training_config": {
      "training_type": "LORA",
      "base_model": {"model_id": "meta-llama/Llama-3-8B"},
      "framework": {"type": "HF_TRAINER"},
      "hyperparameters": {"learning_rate": "2e-4", "num_train_epochs": "3"}
    },
    "data_config": {
      "sources": [{"path": "scaidrive://my-tenant/support.jsonl", "format": "jsonl"}],
      "max_seq_length": 2048,
      "batch_size": 8
    },
    "resource_config": {"gpu_count": 1, "gpu_type": "A100-80GB"},
    "labels": {"team": "quickstart"}
  }'

python
import httpx, os
r = httpx.post(
    f"{os.environ['SCAIGRID_HOST']}/v1/modules/scaimind/jobs",
    headers={"Authorization": f"Bearer {os.environ['SCAIGRID_API_KEY']}"},
    json={
        "name": "quickstart-lora",
        "training_config": {
            "training_type": "LORA",
            "base_model": {"model_id": "meta-llama/Llama-3-8B"},
            "framework": {"type": "HF_TRAINER"},
            "hyperparameters": {"learning_rate": "2e-4", "num_train_epochs": "3"},
        },
        "data_config": {
            "sources": [{"path": "scaidrive://my-tenant/support.jsonl", "format": "jsonl"}],
            "max_seq_length": 2048,
            "batch_size": 8,
        },
        "resource_config": {"gpu_count": 1, "gpu_type": "A100-80GB"},
        "labels": {"team": "quickstart"},
    },
)
job = r.json()["data"]
print(job["job_id"], job["status"])

javascript
const r = await fetch(`${process.env.SCAIGRID_HOST}/v1/modules/scaimind/jobs`, {
  method: "POST",
  headers: {
    "Authorization": `Bearer ${process.env.SCAIGRID_API_KEY}`,
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
    name: "quickstart-lora",
    training_config: {
      training_type: "LORA",
      base_model: { model_id: "meta-llama/Llama-3-8B" },
      framework: { type: "HF_TRAINER" },
      hyperparameters: { learning_rate: "2e-4", num_train_epochs: "3" },
    },
    data_config: {
      sources: [{ path: "scaidrive://my-tenant/support.jsonl", format: "jsonl" }],
      max_seq_length: 2048,
      batch_size: 8,
    },
    resource_config: { gpu_count: 1, gpu_type: "A100-80GB" },
    labels: { team: "quickstart" },
  }),
});
const { data: job } = await r.json();
console.log(job.job_id, job.status);

Save the returned job_id. The response also includes estimated_wait_seconds based on the current queue.

3. Poll the job#

bash
curl "$SCAIGRID_HOST/v1/modules/scaimind/jobs/$JOB_ID" \
  -H "Authorization: Bearer $SCAIGRID_API_KEY"

The status walks the lifecycle: PENDING → QUEUED → SCHEDULING → PREPARING → TRAINING → CHECKPOINTING → EVALUATING → EXPORTING → COMPLETED. A pause moves it to PAUSED; a cancel ends it in CANCELLED; an unrecoverable error ends it in FAILED.

4. Stream metrics#

Once the job reaches TRAINING, open the metrics stream. Each event is a JSON snapshot — loss, learning rate, throughput, GPU utilisation.

bash
curl -N "$SCAIGRID_HOST/v1/modules/scaimind/jobs/$JOB_ID/metrics/stream?interval=5" \
  -H "Authorization: Bearer $SCAIGRID_API_KEY"

python
import httpx, os
with httpx.stream(
    "GET",
    f"{os.environ['SCAIGRID_HOST']}/v1/modules/scaimind/jobs/{os.environ['JOB_ID']}/metrics/stream",
    headers={"Authorization": f"Bearer {os.environ['SCAIGRID_API_KEY']}"},
    params={"interval": 5},
    timeout=None,
) as resp:
    for line in resp.iter_lines():
        if line:
            print(line)

javascript
const resp = await fetch(
  `${process.env.SCAIGRID_HOST}/v1/modules/scaimind/jobs/${process.env.JOB_ID}/metrics/stream?interval=5`,
  { headers: { "Authorization": `Bearer ${process.env.SCAIGRID_API_KEY}` } },
);
const reader = resp.body.getReader();
const dec = new TextDecoder();
while (true) {
  const { value, done } = await reader.read();
  if (done) break;
  process.stdout.write(dec.decode(value));
}

For a point-in-time read instead of a stream, use GET /jobs/{job_id}/metrics?max_points=200.

5. Tail logs#

In another terminal:

bash
curl -N "$SCAIGRID_HOST/v1/modules/scaimind/jobs/$JOB_ID/logs?follow=true&level=INFO&tail=100" \
  -H "Authorization: Bearer $SCAIGRID_API_KEY"

Events arrive over Server-Sent Events tagged log. Pass follow=false for a one-shot tail.

What just happened#

Your POST /jobs request was translated into a SubmitJob gRPC call against MindCoordinator, with your tenant id, user id, and (when configured) a downstream ScaiDrive token forwarded as gRPC metadata.
The coordinator queued the job, picked a node with a free GPU matching gpu_type, fetched the dataset (using the forwarded token), and started training under the requested framework.
ScaiMind's local cache (mod_scaimind_jobs) was populated by the list call you made in step 1's siblings, so the admin UI dashboard already shows the job.
The metric and log streams are direct passthroughs of MindCoordinator's gRPC streams, framed as SSE events.

Next#

Submit a LoRA fine-tune for a production-shape walkthrough.
Training jobs for the full config schema and lifecycle.
Run an evaluation once your job reaches COMPLETED.