Platform
ScaiWave ScaiGrid ScaiCore ScaiBot ScaiDrive ScaiKey Models Tools & Services
Solutions
Organisations Developers Internet Service Providers Managed Service Providers AI-in-a-Box
Resources
Support Documentation Blog Downloads
Company
About Research Careers Investment Opportunities Contact
Log in

Submit a LoRA fine-tune

You are going from a JSONL dataset on ScaiDrive to a finished LoRA adapter. The shape stays the same for SFT, QLORA, and DPO; only the training_type and a few hyperparameters change.

Roughly the duration of the training run itself — typically minutes to hours depending on dataset size and GPU count.

1. Validate the data first#

Before queueing GPU time, ask the coordinator if it can reach and parse your dataset.

bash
1
2
3
4
5
6
7
8
curl -X POST "$SCAIGRID_HOST/v1/modules/scaimind/data/validate" \
  -H "Authorization: Bearer $SCAIGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "data_sources": [
      {"path": "scaidrive://acme/training/support.jsonl", "format": "jsonl"}
    ]
  }'
python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
import httpx, os
H = {"Authorization": f"Bearer {os.environ['SCAIGRID_API_KEY']}"}
r = httpx.post(
    f"{os.environ['SCAIGRID_HOST']}/v1/modules/scaimind/data/validate",
    headers=H,
    json={"data_sources": [
        {"path": "scaidrive://acme/training/support.jsonl", "format": "jsonl"}
    ]},
)
print(r.json()["data"])
javascript
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
const r = await fetch(`${process.env.SCAIGRID_HOST}/v1/modules/scaimind/data/validate`, {
  method: "POST",
  headers: {
    "Authorization": `Bearer ${process.env.SCAIGRID_API_KEY}`,
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
    data_sources: [
      { path: "scaidrive://acme/training/support.jsonl", format: "jsonl" },
    ],
  }),
});
console.log((await r.json()).data);

Each validation result reports accessible, size_bytes, record_count, format, and an error_message if the source failed.

2. Submit the job#

bash
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
curl -X POST "$SCAIGRID_HOST/v1/modules/scaimind/jobs" \
  -H "Authorization: Bearer $SCAIGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "acme-support-lora-v3",
    "training_config": {
      "training_type": "LORA",
      "base_model": {
        "model_id": "meta-llama/Llama-3-8B",
        "dtype": "bfloat16"
      },
      "framework": {
        "type": "HF_TRAINER",
        "config": {"lora_r": "16", "lora_alpha": "32"}
      },
      "hyperparameters": {
        "learning_rate": "2e-4",
        "num_train_epochs": "3",
        "warmup_ratio": "0.05"
      },
      "max_retries": 1,
      "priority": 6
    },
    "data_config": {
      "sources": [
        {"path": "scaidrive://acme/training/support.jsonl", "format": "jsonl"}
      ],
      "preprocess": {"chat_template": "llama-3"},
      "max_seq_length": 4096,
      "batch_size": 8,
      "gradient_accumulation_steps": 4,
      "validation_split": 0.05,
      "seed": 42
    },
    "resource_config": {
      "gpu_count": 4,
      "gpu_type": "A100-80GB",
      "ram_min_mb": 65536
    },
    "output_config": {
      "output_model_name": "acme-support-lora-v3",
      "output_path": "scaidrive://acme/models/support-v3/",
      "checkpoint": {
        "save_strategy": "steps",
        "save_steps": 500,
        "save_total_limit": 3,
        "metric_for_best_model": "eval_loss",
        "greater_is_better": false
      },
      "merge_lora": false
    },
    "scheduling_config": {
      "queue": "default",
      "priority": 6,
      "preemptible": false,
      "max_runtime_seconds": 21600
    },
    "labels": {
      "team": "support",
      "experiment": "lora-v3",
      "dataset_version": "2026-05-01"
    }
  }'

Save the returned job_id. The response also gives estimated_wait_seconds so you know when to expect things to start.

3. Watch it move through the lifecycle#

python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
import httpx, os, time
H = {"Authorization": f"Bearer {os.environ['SCAIGRID_API_KEY']}"}
JOB = os.environ["JOB_ID"]
HOST = os.environ["SCAIGRID_HOST"]

while True:
    r = httpx.get(f"{HOST}/v1/modules/scaimind/jobs/{JOB}", headers=H).json()["data"]
    print(r["status"])
    if r["status"] in {"COMPLETED", "FAILED", "CANCELLED", "PREEMPTED"}:
        break
    time.sleep(10)
javascript
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
const H = { "Authorization": `Bearer ${process.env.SCAIGRID_API_KEY}` };
const HOST = process.env.SCAIGRID_HOST;
const JOB = process.env.JOB_ID;
while (true) {
  const r = await fetch(`${HOST}/v1/modules/scaimind/jobs/${JOB}`, { headers: H });
  const { data } = await r.json();
  console.log(data.status);
  if (["COMPLETED", "FAILED", "CANCELLED", "PREEMPTED"].includes(data.status)) break;
  await new Promise(res => setTimeout(res, 10_000));
}
bash
1
2
3
4
5
6
7
8
while true; do
  s=$(curl -s "$SCAIGRID_HOST/v1/modules/scaimind/jobs/$JOB_ID" \
    -H "Authorization: Bearer $SCAIGRID_API_KEY" \
    | python -c "import sys,json; print(json.load(sys.stdin)['data']['status'])")
  echo "$s"
  case "$s" in COMPLETED|FAILED|CANCELLED|PREEMPTED) break ;; esac
  sleep 10
done

4. Stream metrics while it trains#

In another terminal:

bash
1
2
curl -N "$SCAIGRID_HOST/v1/modules/scaimind/jobs/$JOB_ID/metrics/stream?interval=10" \
  -H "Authorization: Bearer $SCAIGRID_API_KEY"

Each event: metrics carries a JSON snapshot with the current step, loss, learning rate, throughput, and GPU utilisation. The same data is plotted live on the admin UI's Training Monitor page.

5. Tail logs#

bash
1
2
curl -N "$SCAIGRID_HOST/v1/modules/scaimind/jobs/$JOB_ID/logs?follow=true&level=INFO&tail=200" \
  -H "Authorization: Bearer $SCAIGRID_API_KEY"

For a one-shot dump of the last N lines, set follow=false.

6. Handle a failure with retry#

If the job ends in FAILED or PREEMPTED, inspect the error and retry from the most recent checkpoint with optionally different resources:

bash
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
curl -X POST "$SCAIGRID_HOST/v1/modules/scaimind/jobs/$JOB_ID/retry" \
  -H "Authorization: Bearer $SCAIGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "checkpoint_id": "",
    "modify_resources": true,
    "new_resource_config": {
      "gpu_count": 8,
      "gpu_type": "H100"
    }
  }'

A new child job is created with parent_job_id set to the failed run. The original stays in FAILED as the audit trail.

7. Pause to free GPUs#

If a higher-priority experiment lands and you want to give up GPU time without throwing work away:

bash
1
2
3
4
curl -X POST "$SCAIGRID_HOST/v1/modules/scaimind/jobs/$JOB_ID/pause" \
  -H "Authorization: Bearer $SCAIGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"save_checkpoint": true}'

Resume later:

bash
1
2
3
4
curl -X POST "$SCAIGRID_HOST/v1/modules/scaimind/jobs/$JOB_ID/resume" \
  -H "Authorization: Bearer $SCAIGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"checkpoint_id": ""}'

8. Fetch the trained artefact#

GET /jobs/{id} on a COMPLETED job returns the full proto-derived response. The output_config.output_path you supplied is where the coordinator wrote the trained adapter or merged model. From there, use ScaiDrive (or whatever protocol you pointed at) to download or hand it off to a ScaiGrid backend registration.

Done#

You have a finished LoRA run with checkpoints, metrics, logs, and a child-retry pattern in place. Vary training_type to SFT, QLORA, DPO, or RLHF and the rest of the recipe stays largely the same.

Updated 2026-05-18 15:01:31 View source (.md) rev 12