Quickstart
In five minutes you will have a LoRA fine-tune queued on the cluster and a live stream of metrics coming back to your terminal.
You need:
- A ScaiGrid API key with the
scaimind:manage permission (any tenant admin has this).
- A reachable MindCoordinator cluster with at least one online node.
- A training dataset already accessible to the coordinator (typically a path resolvable via the per-request ScaiDrive token).
| export SCAIGRID_HOST="https://scaigrid.scailabs.ai"
export SCAIGRID_API_KEY="sgk_..."
|
1. Check cluster status
Before queueing anything, confirm the cluster has capacity.
| curl "$SCAIGRID_HOST/v1/modules/scaimind/cluster" \
-H "Authorization: Bearer $SCAIGRID_API_KEY"
|
| import httpx, os
r = httpx.get(
f"{os.environ['SCAIGRID_HOST']}/v1/modules/scaimind/cluster",
headers={"Authorization": f"Bearer {os.environ['SCAIGRID_API_KEY']}"},
)
print(r.json()["data"])
|
| const r = await fetch(`${process.env.SCAIGRID_HOST}/v1/modules/scaimind/cluster`, {
headers: { "Authorization": `Bearer ${process.env.SCAIGRID_API_KEY}` },
});
console.log((await r.json()).data);
|
You will see counts for online_nodes, total_gpus, available_gpus, queued_jobs, and overall cluster_utilization.
2. Submit a LoRA job
The job request is one envelope with five nested configs: training_config, data_config, resource_config, and the optional output_config and scheduling_config.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19 | curl -X POST "$SCAIGRID_HOST/v1/modules/scaimind/jobs" \
-H "Authorization: Bearer $SCAIGRID_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"name": "quickstart-lora",
"training_config": {
"training_type": "LORA",
"base_model": {"model_id": "meta-llama/Llama-3-8B"},
"framework": {"type": "HF_TRAINER"},
"hyperparameters": {"learning_rate": "2e-4", "num_train_epochs": "3"}
},
"data_config": {
"sources": [{"path": "scaidrive://my-tenant/support.jsonl", "format": "jsonl"}],
"max_seq_length": 2048,
"batch_size": 8
},
"resource_config": {"gpu_count": 1, "gpu_type": "A100-80GB"},
"labels": {"team": "quickstart"}
}'
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23 | import httpx, os
r = httpx.post(
f"{os.environ['SCAIGRID_HOST']}/v1/modules/scaimind/jobs",
headers={"Authorization": f"Bearer {os.environ['SCAIGRID_API_KEY']}"},
json={
"name": "quickstart-lora",
"training_config": {
"training_type": "LORA",
"base_model": {"model_id": "meta-llama/Llama-3-8B"},
"framework": {"type": "HF_TRAINER"},
"hyperparameters": {"learning_rate": "2e-4", "num_train_epochs": "3"},
},
"data_config": {
"sources": [{"path": "scaidrive://my-tenant/support.jsonl", "format": "jsonl"}],
"max_seq_length": 2048,
"batch_size": 8,
},
"resource_config": {"gpu_count": 1, "gpu_type": "A100-80GB"},
"labels": {"team": "quickstart"},
},
)
job = r.json()["data"]
print(job["job_id"], job["status"])
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25 | const r = await fetch(`${process.env.SCAIGRID_HOST}/v1/modules/scaimind/jobs`, {
method: "POST",
headers: {
"Authorization": `Bearer ${process.env.SCAIGRID_API_KEY}`,
"Content-Type": "application/json",
},
body: JSON.stringify({
name: "quickstart-lora",
training_config: {
training_type: "LORA",
base_model: { model_id: "meta-llama/Llama-3-8B" },
framework: { type: "HF_TRAINER" },
hyperparameters: { learning_rate: "2e-4", num_train_epochs: "3" },
},
data_config: {
sources: [{ path: "scaidrive://my-tenant/support.jsonl", format: "jsonl" }],
max_seq_length: 2048,
batch_size: 8,
},
resource_config: { gpu_count: 1, gpu_type: "A100-80GB" },
labels: { team: "quickstart" },
}),
});
const { data: job } = await r.json();
console.log(job.job_id, job.status);
|
Save the returned job_id. The response also includes estimated_wait_seconds based on the current queue.
3. Poll the job
| curl "$SCAIGRID_HOST/v1/modules/scaimind/jobs/$JOB_ID" \
-H "Authorization: Bearer $SCAIGRID_API_KEY"
|
The status walks the lifecycle: PENDING → QUEUED → SCHEDULING → PREPARING → TRAINING → CHECKPOINTING → EVALUATING → EXPORTING → COMPLETED. A pause moves it to PAUSED; a cancel ends it in CANCELLED; an unrecoverable error ends it in FAILED.
4. Stream metrics
Once the job reaches TRAINING, open the metrics stream. Each event is a JSON snapshot — loss, learning rate, throughput, GPU utilisation.
| curl -N "$SCAIGRID_HOST/v1/modules/scaimind/jobs/$JOB_ID/metrics/stream?interval=5" \
-H "Authorization: Bearer $SCAIGRID_API_KEY"
|
| import httpx, os
with httpx.stream(
"GET",
f"{os.environ['SCAIGRID_HOST']}/v1/modules/scaimind/jobs/{os.environ['JOB_ID']}/metrics/stream",
headers={"Authorization": f"Bearer {os.environ['SCAIGRID_API_KEY']}"},
params={"interval": 5},
timeout=None,
) as resp:
for line in resp.iter_lines():
if line:
print(line)
|
| const resp = await fetch(
`${process.env.SCAIGRID_HOST}/v1/modules/scaimind/jobs/${process.env.JOB_ID}/metrics/stream?interval=5`,
{ headers: { "Authorization": `Bearer ${process.env.SCAIGRID_API_KEY}` } },
);
const reader = resp.body.getReader();
const dec = new TextDecoder();
while (true) {
const { value, done } = await reader.read();
if (done) break;
process.stdout.write(dec.decode(value));
}
|
For a point-in-time read instead of a stream, use GET /jobs/{job_id}/metrics?max_points=200.
5. Tail logs
In another terminal:
| curl -N "$SCAIGRID_HOST/v1/modules/scaimind/jobs/$JOB_ID/logs?follow=true&level=INFO&tail=100" \
-H "Authorization: Bearer $SCAIGRID_API_KEY"
|
Events arrive over Server-Sent Events tagged log. Pass follow=false for a one-shot tail.
What just happened
- Your
POST /jobs request was translated into a SubmitJob gRPC call against MindCoordinator, with your tenant id, user id, and (when configured) a downstream ScaiDrive token forwarded as gRPC metadata.
- The coordinator queued the job, picked a node with a free GPU matching
gpu_type, fetched the dataset (using the forwarded token), and started training under the requested framework.
- ScaiMind's local cache (
mod_scaimind_jobs) was populated by the list call you made in step 1's siblings, so the admin UI dashboard already shows the job.
- The metric and log streams are direct passthroughs of MindCoordinator's gRPC streams, framed as SSE events.
Next