Submit a LoRA fine-tune
You are going from a JSONL dataset on ScaiDrive to a finished LoRA adapter. The shape stays the same for SFT, QLORA, and DPO; only the training_type and a few hyperparameters change.
Roughly the duration of the training run itself — typically minutes to hours depending on dataset size and GPU count.
1. Validate the data first
Before queueing GPU time, ask the coordinator if it can reach and parse your dataset.
| curl -X POST "$SCAIGRID_HOST/v1/modules/scaimind/data/validate" \
-H "Authorization: Bearer $SCAIGRID_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"data_sources": [
{"path": "scaidrive://acme/training/support.jsonl", "format": "jsonl"}
]
}'
|
| import httpx, os
H = {"Authorization": f"Bearer {os.environ['SCAIGRID_API_KEY']}"}
r = httpx.post(
f"{os.environ['SCAIGRID_HOST']}/v1/modules/scaimind/data/validate",
headers=H,
json={"data_sources": [
{"path": "scaidrive://acme/training/support.jsonl", "format": "jsonl"}
]},
)
print(r.json()["data"])
|
1
2
3
4
5
6
7
8
9
10
11
12
13 | const r = await fetch(`${process.env.SCAIGRID_HOST}/v1/modules/scaimind/data/validate`, {
method: "POST",
headers: {
"Authorization": `Bearer ${process.env.SCAIGRID_API_KEY}`,
"Content-Type": "application/json",
},
body: JSON.stringify({
data_sources: [
{ path: "scaidrive://acme/training/support.jsonl", format: "jsonl" },
],
}),
});
console.log((await r.json()).data);
|
Each validation result reports accessible, size_bytes, record_count, format, and an error_message if the source failed.
2. Submit the job
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63 | curl -X POST "$SCAIGRID_HOST/v1/modules/scaimind/jobs" \
-H "Authorization: Bearer $SCAIGRID_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"name": "acme-support-lora-v3",
"training_config": {
"training_type": "LORA",
"base_model": {
"model_id": "meta-llama/Llama-3-8B",
"dtype": "bfloat16"
},
"framework": {
"type": "HF_TRAINER",
"config": {"lora_r": "16", "lora_alpha": "32"}
},
"hyperparameters": {
"learning_rate": "2e-4",
"num_train_epochs": "3",
"warmup_ratio": "0.05"
},
"max_retries": 1,
"priority": 6
},
"data_config": {
"sources": [
{"path": "scaidrive://acme/training/support.jsonl", "format": "jsonl"}
],
"preprocess": {"chat_template": "llama-3"},
"max_seq_length": 4096,
"batch_size": 8,
"gradient_accumulation_steps": 4,
"validation_split": 0.05,
"seed": 42
},
"resource_config": {
"gpu_count": 4,
"gpu_type": "A100-80GB",
"ram_min_mb": 65536
},
"output_config": {
"output_model_name": "acme-support-lora-v3",
"output_path": "scaidrive://acme/models/support-v3/",
"checkpoint": {
"save_strategy": "steps",
"save_steps": 500,
"save_total_limit": 3,
"metric_for_best_model": "eval_loss",
"greater_is_better": false
},
"merge_lora": false
},
"scheduling_config": {
"queue": "default",
"priority": 6,
"preemptible": false,
"max_runtime_seconds": 21600
},
"labels": {
"team": "support",
"experiment": "lora-v3",
"dataset_version": "2026-05-01"
}
}'
|
Save the returned job_id. The response also gives estimated_wait_seconds so you know when to expect things to start.
3. Watch it move through the lifecycle
| import httpx, os, time
H = {"Authorization": f"Bearer {os.environ['SCAIGRID_API_KEY']}"}
JOB = os.environ["JOB_ID"]
HOST = os.environ["SCAIGRID_HOST"]
while True:
r = httpx.get(f"{HOST}/v1/modules/scaimind/jobs/{JOB}", headers=H).json()["data"]
print(r["status"])
if r["status"] in {"COMPLETED", "FAILED", "CANCELLED", "PREEMPTED"}:
break
time.sleep(10)
|
| const H = { "Authorization": `Bearer ${process.env.SCAIGRID_API_KEY}` };
const HOST = process.env.SCAIGRID_HOST;
const JOB = process.env.JOB_ID;
while (true) {
const r = await fetch(`${HOST}/v1/modules/scaimind/jobs/${JOB}`, { headers: H });
const { data } = await r.json();
console.log(data.status);
if (["COMPLETED", "FAILED", "CANCELLED", "PREEMPTED"].includes(data.status)) break;
await new Promise(res => setTimeout(res, 10_000));
}
|
| while true; do
s=$(curl -s "$SCAIGRID_HOST/v1/modules/scaimind/jobs/$JOB_ID" \
-H "Authorization: Bearer $SCAIGRID_API_KEY" \
| python -c "import sys,json; print(json.load(sys.stdin)['data']['status'])")
echo "$s"
case "$s" in COMPLETED|FAILED|CANCELLED|PREEMPTED) break ;; esac
sleep 10
done
|
4. Stream metrics while it trains
In another terminal:
| curl -N "$SCAIGRID_HOST/v1/modules/scaimind/jobs/$JOB_ID/metrics/stream?interval=10" \
-H "Authorization: Bearer $SCAIGRID_API_KEY"
|
Each event: metrics carries a JSON snapshot with the current step, loss, learning rate, throughput, and GPU utilisation. The same data is plotted live on the admin UI's Training Monitor page.
5. Tail logs
| curl -N "$SCAIGRID_HOST/v1/modules/scaimind/jobs/$JOB_ID/logs?follow=true&level=INFO&tail=200" \
-H "Authorization: Bearer $SCAIGRID_API_KEY"
|
For a one-shot dump of the last N lines, set follow=false.
6. Handle a failure with retry
If the job ends in FAILED or PREEMPTED, inspect the error and retry from the most recent checkpoint with optionally different resources:
| curl -X POST "$SCAIGRID_HOST/v1/modules/scaimind/jobs/$JOB_ID/retry" \
-H "Authorization: Bearer $SCAIGRID_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"checkpoint_id": "",
"modify_resources": true,
"new_resource_config": {
"gpu_count": 8,
"gpu_type": "H100"
}
}'
|
A new child job is created with parent_job_id set to the failed run. The original stays in FAILED as the audit trail.
7. Pause to free GPUs
If a higher-priority experiment lands and you want to give up GPU time without throwing work away:
| curl -X POST "$SCAIGRID_HOST/v1/modules/scaimind/jobs/$JOB_ID/pause" \
-H "Authorization: Bearer $SCAIGRID_API_KEY" \
-H "Content-Type: application/json" \
-d '{"save_checkpoint": true}'
|
Resume later:
| curl -X POST "$SCAIGRID_HOST/v1/modules/scaimind/jobs/$JOB_ID/resume" \
-H "Authorization: Bearer $SCAIGRID_API_KEY" \
-H "Content-Type: application/json" \
-d '{"checkpoint_id": ""}'
|
8. Fetch the trained artefact
GET /jobs/{id} on a COMPLETED job returns the full proto-derived response. The output_config.output_path you supplied is where the coordinator wrote the trained adapter or merged model. From there, use ScaiDrive (or whatever protocol you pointed at) to download or hand it off to a ScaiGrid backend registration.
Done
You have a finished LoRA run with checkpoints, metrics, logs, and a child-retry pattern in place. Vary training_type to SFT, QLORA, DPO, or RLHF and the rest of the recipe stays largely the same.