---
summary: "Submit a LoRA fine-tune, watch its metrics, fetch the result \u2014 five\
  \ minutes end-to-end against a running cluster."
title: Quickstart
path: quickstart
status: published
---

In five minutes you will have a LoRA fine-tune queued on the cluster and a live stream of metrics coming back to your terminal.

You need:

- A ScaiGrid API key with the `scaimind:manage` permission (any tenant admin has this).
- A reachable MindCoordinator cluster with at least one online node.
- A training dataset already accessible to the coordinator (typically a path resolvable via the per-request ScaiDrive token).

```bash
export SCAIGRID_HOST="https://scaigrid.scailabs.ai"
export SCAIGRID_API_KEY="sgk_..."
```

## 1. Check cluster status

Before queueing anything, confirm the cluster has capacity.

```bash
curl "$SCAIGRID_HOST/v1/modules/scaimind/cluster" \
  -H "Authorization: Bearer $SCAIGRID_API_KEY"
```

```python
import httpx, os
r = httpx.get(
    f"{os.environ['SCAIGRID_HOST']}/v1/modules/scaimind/cluster",
    headers={"Authorization": f"Bearer {os.environ['SCAIGRID_API_KEY']}"},
)
print(r.json()["data"])
```

```javascript
const r = await fetch(`${process.env.SCAIGRID_HOST}/v1/modules/scaimind/cluster`, {
  headers: { "Authorization": `Bearer ${process.env.SCAIGRID_API_KEY}` },
});
console.log((await r.json()).data);
```

You will see counts for `online_nodes`, `total_gpus`, `available_gpus`, `queued_jobs`, and overall `cluster_utilization`.

## 2. Submit a LoRA job

The job request is one envelope with five nested configs: `training_config`, `data_config`, `resource_config`, and the optional `output_config` and `scheduling_config`.

```bash
curl -X POST "$SCAIGRID_HOST/v1/modules/scaimind/jobs" \
  -H "Authorization: Bearer $SCAIGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "quickstart-lora",
    "training_config": {
      "training_type": "LORA",
      "base_model": {"model_id": "meta-llama/Llama-3-8B"},
      "framework": {"type": "HF_TRAINER"},
      "hyperparameters": {"learning_rate": "2e-4", "num_train_epochs": "3"}
    },
    "data_config": {
      "sources": [{"path": "scaidrive://my-tenant/support.jsonl", "format": "jsonl"}],
      "max_seq_length": 2048,
      "batch_size": 8
    },
    "resource_config": {"gpu_count": 1, "gpu_type": "A100-80GB"},
    "labels": {"team": "quickstart"}
  }'
```

```python
import httpx, os
r = httpx.post(
    f"{os.environ['SCAIGRID_HOST']}/v1/modules/scaimind/jobs",
    headers={"Authorization": f"Bearer {os.environ['SCAIGRID_API_KEY']}"},
    json={
        "name": "quickstart-lora",
        "training_config": {
            "training_type": "LORA",
            "base_model": {"model_id": "meta-llama/Llama-3-8B"},
            "framework": {"type": "HF_TRAINER"},
            "hyperparameters": {"learning_rate": "2e-4", "num_train_epochs": "3"},
        },
        "data_config": {
            "sources": [{"path": "scaidrive://my-tenant/support.jsonl", "format": "jsonl"}],
            "max_seq_length": 2048,
            "batch_size": 8,
        },
        "resource_config": {"gpu_count": 1, "gpu_type": "A100-80GB"},
        "labels": {"team": "quickstart"},
    },
)
job = r.json()["data"]
print(job["job_id"], job["status"])
```

```javascript
const r = await fetch(`${process.env.SCAIGRID_HOST}/v1/modules/scaimind/jobs`, {
  method: "POST",
  headers: {
    "Authorization": `Bearer ${process.env.SCAIGRID_API_KEY}`,
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
    name: "quickstart-lora",
    training_config: {
      training_type: "LORA",
      base_model: { model_id: "meta-llama/Llama-3-8B" },
      framework: { type: "HF_TRAINER" },
      hyperparameters: { learning_rate: "2e-4", num_train_epochs: "3" },
    },
    data_config: {
      sources: [{ path: "scaidrive://my-tenant/support.jsonl", format: "jsonl" }],
      max_seq_length: 2048,
      batch_size: 8,
    },
    resource_config: { gpu_count: 1, gpu_type: "A100-80GB" },
    labels: { team: "quickstart" },
  }),
});
const { data: job } = await r.json();
console.log(job.job_id, job.status);
```

Save the returned `job_id`. The response also includes `estimated_wait_seconds` based on the current queue.

## 3. Poll the job

```bash
curl "$SCAIGRID_HOST/v1/modules/scaimind/jobs/$JOB_ID" \
  -H "Authorization: Bearer $SCAIGRID_API_KEY"
```

The status walks the lifecycle: `PENDING` → `QUEUED` → `SCHEDULING` → `PREPARING` → `TRAINING` → `CHECKPOINTING` → `EVALUATING` → `EXPORTING` → `COMPLETED`. A pause moves it to `PAUSED`; a cancel ends it in `CANCELLED`; an unrecoverable error ends it in `FAILED`.

## 4. Stream metrics

Once the job reaches `TRAINING`, open the metrics stream. Each event is a JSON snapshot — loss, learning rate, throughput, GPU utilisation.

```bash
curl -N "$SCAIGRID_HOST/v1/modules/scaimind/jobs/$JOB_ID/metrics/stream?interval=5" \
  -H "Authorization: Bearer $SCAIGRID_API_KEY"
```

```python
import httpx, os
with httpx.stream(
    "GET",
    f"{os.environ['SCAIGRID_HOST']}/v1/modules/scaimind/jobs/{os.environ['JOB_ID']}/metrics/stream",
    headers={"Authorization": f"Bearer {os.environ['SCAIGRID_API_KEY']}"},
    params={"interval": 5},
    timeout=None,
) as resp:
    for line in resp.iter_lines():
        if line:
            print(line)
```

```javascript
const resp = await fetch(
  `${process.env.SCAIGRID_HOST}/v1/modules/scaimind/jobs/${process.env.JOB_ID}/metrics/stream?interval=5`,
  { headers: { "Authorization": `Bearer ${process.env.SCAIGRID_API_KEY}` } },
);
const reader = resp.body.getReader();
const dec = new TextDecoder();
while (true) {
  const { value, done } = await reader.read();
  if (done) break;
  process.stdout.write(dec.decode(value));
}
```

For a point-in-time read instead of a stream, use `GET /jobs/{job_id}/metrics?max_points=200`.

## 5. Tail logs

In another terminal:

```bash
curl -N "$SCAIGRID_HOST/v1/modules/scaimind/jobs/$JOB_ID/logs?follow=true&level=INFO&tail=100" \
  -H "Authorization: Bearer $SCAIGRID_API_KEY"
```

Events arrive over Server-Sent Events tagged `log`. Pass `follow=false` for a one-shot tail.

## What just happened

- Your `POST /jobs` request was translated into a `SubmitJob` gRPC call against MindCoordinator, with your tenant id, user id, and (when configured) a downstream ScaiDrive token forwarded as gRPC metadata.
- The coordinator queued the job, picked a node with a free GPU matching `gpu_type`, fetched the dataset (using the forwarded token), and started training under the requested framework.
- ScaiMind's local cache (`mod_scaimind_jobs`) was populated by the list call you made in step 1's siblings, so the admin UI dashboard already shows the job.
- The metric and log streams are direct passthroughs of MindCoordinator's gRPC streams, framed as SSE events.

## Next

- [Submit a LoRA fine-tune](./tutorials/submit-a-lora-finetune) for a production-shape walkthrough.
- [Training jobs](./concepts/training-jobs) for the full config schema and lifecycle.
- [Run an evaluation](./tutorials/run-an-evaluation) once your job reaches `COMPLETED`.
