Submit a LoRA fine-tune

You are going from a JSONL dataset on ScaiDrive to a finished LoRA adapter. The shape stays the same for SFT, QLORA, and DPO; only the training_type and a few hyperparameters change.

Roughly the duration of the training run itself — typically minutes to hours depending on dataset size and GPU count.

1. Validate the data first#

Before queueing GPU time, ask the coordinator if it can reach and parse your dataset.

bash
curl -X POST "$SCAIGRID_HOST/v1/modules/scaimind/data/validate" \
  -H "Authorization: Bearer $SCAIGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "data_sources": [
      {"path": "scaidrive://acme/training/support.jsonl", "format": "jsonl"}
    ]
  }'

python
import httpx, os
H = {"Authorization": f"Bearer {os.environ['SCAIGRID_API_KEY']}"}
r = httpx.post(
    f"{os.environ['SCAIGRID_HOST']}/v1/modules/scaimind/data/validate",
    headers=H,
    json={"data_sources": [
        {"path": "scaidrive://acme/training/support.jsonl", "format": "jsonl"}
    ]},
)
print(r.json()["data"])

javascript
const r = await fetch(`${process.env.SCAIGRID_HOST}/v1/modules/scaimind/data/validate`, {
  method: "POST",
  headers: {
    "Authorization": `Bearer ${process.env.SCAIGRID_API_KEY}`,
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
    data_sources: [
      { path: "scaidrive://acme/training/support.jsonl", format: "jsonl" },
    ],
  }),
});
console.log((await r.json()).data);

Each validation result reports accessible, size_bytes, record_count, format, and an error_message if the source failed.

2. Submit the job#

bash
curl -X POST "$SCAIGRID_HOST/v1/modules/scaimind/jobs" \
  -H "Authorization: Bearer $SCAIGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "acme-support-lora-v3",
    "training_config": {
      "training_type": "LORA",
      "base_model": {
        "model_id": "meta-llama/Llama-3-8B",
        "dtype": "bfloat16"
      },
      "framework": {
        "type": "HF_TRAINER",
        "config": {"lora_r": "16", "lora_alpha": "32"}
      },
      "hyperparameters": {
        "learning_rate": "2e-4",
        "num_train_epochs": "3",
        "warmup_ratio": "0.05"
      },
      "max_retries": 1,
      "priority": 6
    },
    "data_config": {
      "sources": [
        {"path": "scaidrive://acme/training/support.jsonl", "format": "jsonl"}
      ],
      "preprocess": {"chat_template": "llama-3"},
      "max_seq_length": 4096,
      "batch_size": 8,
      "gradient_accumulation_steps": 4,
      "validation_split": 0.05,
      "seed": 42
    },
    "resource_config": {
      "gpu_count": 4,
      "gpu_type": "A100-80GB",
      "ram_min_mb": 65536
    },
    "output_config": {
      "output_model_name": "acme-support-lora-v3",
      "output_path": "scaidrive://acme/models/support-v3/",
      "checkpoint": {
        "save_strategy": "steps",
        "save_steps": 500,
        "save_total_limit": 3,
        "metric_for_best_model": "eval_loss",
        "greater_is_better": false
      },
      "merge_lora": false
    },
    "scheduling_config": {
      "queue": "default",
      "priority": 6,
      "preemptible": false,
      "max_runtime_seconds": 21600
    },
    "labels": {
      "team": "support",
      "experiment": "lora-v3",
      "dataset_version": "2026-05-01"
    }
  }'

Save the returned job_id. The response also gives estimated_wait_seconds so you know when to expect things to start.

3. Watch it move through the lifecycle#

python
import httpx, os, time
H = {"Authorization": f"Bearer {os.environ['SCAIGRID_API_KEY']}"}
JOB = os.environ["JOB_ID"]
HOST = os.environ["SCAIGRID_HOST"]

while True:
    r = httpx.get(f"{HOST}/v1/modules/scaimind/jobs/{JOB}", headers=H).json()["data"]
    print(r["status"])
    if r["status"] in {"COMPLETED", "FAILED", "CANCELLED", "PREEMPTED"}:
        break
    time.sleep(10)

javascript
const H = { "Authorization": `Bearer ${process.env.SCAIGRID_API_KEY}` };
const HOST = process.env.SCAIGRID_HOST;
const JOB = process.env.JOB_ID;
while (true) {
  const r = await fetch(`${HOST}/v1/modules/scaimind/jobs/${JOB}`, { headers: H });
  const { data } = await r.json();
  console.log(data.status);
  if (["COMPLETED", "FAILED", "CANCELLED", "PREEMPTED"].includes(data.status)) break;
  await new Promise(res => setTimeout(res, 10_000));
}

bash
while true; do
  s=$(curl -s "$SCAIGRID_HOST/v1/modules/scaimind/jobs/$JOB_ID" \
    -H "Authorization: Bearer $SCAIGRID_API_KEY" \
    | python -c "import sys,json; print(json.load(sys.stdin)['data']['status'])")
  echo "$s"
  case "$s" in COMPLETED|FAILED|CANCELLED|PREEMPTED) break ;; esac
  sleep 10
done

4. Stream metrics while it trains#

In another terminal:

bash
curl -N "$SCAIGRID_HOST/v1/modules/scaimind/jobs/$JOB_ID/metrics/stream?interval=10" \
  -H "Authorization: Bearer $SCAIGRID_API_KEY"

Each event: metrics carries a JSON snapshot with the current step, loss, learning rate, throughput, and GPU utilisation. The same data is plotted live on the admin UI's Training Monitor page.

5. Tail logs#

bash
curl -N "$SCAIGRID_HOST/v1/modules/scaimind/jobs/$JOB_ID/logs?follow=true&level=INFO&tail=200" \
  -H "Authorization: Bearer $SCAIGRID_API_KEY"

For a one-shot dump of the last N lines, set follow=false.

6. Handle a failure with retry#

If the job ends in FAILED or PREEMPTED, inspect the error and retry from the most recent checkpoint with optionally different resources:

bash
curl -X POST "$SCAIGRID_HOST/v1/modules/scaimind/jobs/$JOB_ID/retry" \
  -H "Authorization: Bearer $SCAIGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "checkpoint_id": "",
    "modify_resources": true,
    "new_resource_config": {
      "gpu_count": 8,
      "gpu_type": "H100"
    }
  }'

A new child job is created with parent_job_id set to the failed run. The original stays in FAILED as the audit trail.

7. Pause to free GPUs#

If a higher-priority experiment lands and you want to give up GPU time without throwing work away:

bash
curl -X POST "$SCAIGRID_HOST/v1/modules/scaimind/jobs/$JOB_ID/pause" \
  -H "Authorization: Bearer $SCAIGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"save_checkpoint": true}'

Resume later:

bash
curl -X POST "$SCAIGRID_HOST/v1/modules/scaimind/jobs/$JOB_ID/resume" \
  -H "Authorization: Bearer $SCAIGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"checkpoint_id": ""}'

8. Fetch the trained artefact#

GET /jobs/{id} on a COMPLETED job returns the full proto-derived response. The output_config.output_path you supplied is where the coordinator wrote the trained adapter or merged model. From there, use ScaiDrive (or whatever protocol you pointed at) to download or hand it off to a ScaiGrid backend registration.

Done#

You have a finished LoRA run with checkpoints, metrics, logs, and a child-retry pattern in place. Vary training_type to SFT, QLORA, DPO, or RLHF and the rest of the recipe stays largely the same.