---
summary: "End-to-end recipe \u2014 validate data, submit a LoRA job, watch metrics\
  \ and logs, fetch the trained artefact."
title: Submit a LoRA fine-tune
path: tutorials/submit-a-lora-finetune
status: published
---

You are going from a JSONL dataset on ScaiDrive to a finished LoRA adapter. The shape stays the same for SFT, QLORA, and DPO; only the `training_type` and a few hyperparameters change.

Roughly the duration of the training run itself — typically minutes to hours depending on dataset size and GPU count.

## 1. Validate the data first

Before queueing GPU time, ask the coordinator if it can reach and parse your dataset.

```bash
curl -X POST "$SCAIGRID_HOST/v1/modules/scaimind/data/validate" \
  -H "Authorization: Bearer $SCAIGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "data_sources": [
      {"path": "scaidrive://acme/training/support.jsonl", "format": "jsonl"}
    ]
  }'
```

```python
import httpx, os
H = {"Authorization": f"Bearer {os.environ['SCAIGRID_API_KEY']}"}
r = httpx.post(
    f"{os.environ['SCAIGRID_HOST']}/v1/modules/scaimind/data/validate",
    headers=H,
    json={"data_sources": [
        {"path": "scaidrive://acme/training/support.jsonl", "format": "jsonl"}
    ]},
)
print(r.json()["data"])
```

```javascript
const r = await fetch(`${process.env.SCAIGRID_HOST}/v1/modules/scaimind/data/validate`, {
  method: "POST",
  headers: {
    "Authorization": `Bearer ${process.env.SCAIGRID_API_KEY}`,
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
    data_sources: [
      { path: "scaidrive://acme/training/support.jsonl", format: "jsonl" },
    ],
  }),
});
console.log((await r.json()).data);
```

Each validation result reports `accessible`, `size_bytes`, `record_count`, `format`, and an `error_message` if the source failed.

## 2. Submit the job

```bash
curl -X POST "$SCAIGRID_HOST/v1/modules/scaimind/jobs" \
  -H "Authorization: Bearer $SCAIGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "acme-support-lora-v3",
    "training_config": {
      "training_type": "LORA",
      "base_model": {
        "model_id": "meta-llama/Llama-3-8B",
        "dtype": "bfloat16"
      },
      "framework": {
        "type": "HF_TRAINER",
        "config": {"lora_r": "16", "lora_alpha": "32"}
      },
      "hyperparameters": {
        "learning_rate": "2e-4",
        "num_train_epochs": "3",
        "warmup_ratio": "0.05"
      },
      "max_retries": 1,
      "priority": 6
    },
    "data_config": {
      "sources": [
        {"path": "scaidrive://acme/training/support.jsonl", "format": "jsonl"}
      ],
      "preprocess": {"chat_template": "llama-3"},
      "max_seq_length": 4096,
      "batch_size": 8,
      "gradient_accumulation_steps": 4,
      "validation_split": 0.05,
      "seed": 42
    },
    "resource_config": {
      "gpu_count": 4,
      "gpu_type": "A100-80GB",
      "ram_min_mb": 65536
    },
    "output_config": {
      "output_model_name": "acme-support-lora-v3",
      "output_path": "scaidrive://acme/models/support-v3/",
      "checkpoint": {
        "save_strategy": "steps",
        "save_steps": 500,
        "save_total_limit": 3,
        "metric_for_best_model": "eval_loss",
        "greater_is_better": false
      },
      "merge_lora": false
    },
    "scheduling_config": {
      "queue": "default",
      "priority": 6,
      "preemptible": false,
      "max_runtime_seconds": 21600
    },
    "labels": {
      "team": "support",
      "experiment": "lora-v3",
      "dataset_version": "2026-05-01"
    }
  }'
```

Save the returned `job_id`. The response also gives `estimated_wait_seconds` so you know when to expect things to start.

## 3. Watch it move through the lifecycle

```python
import httpx, os, time
H = {"Authorization": f"Bearer {os.environ['SCAIGRID_API_KEY']}"}
JOB = os.environ["JOB_ID"]
HOST = os.environ["SCAIGRID_HOST"]

while True:
    r = httpx.get(f"{HOST}/v1/modules/scaimind/jobs/{JOB}", headers=H).json()["data"]
    print(r["status"])
    if r["status"] in {"COMPLETED", "FAILED", "CANCELLED", "PREEMPTED"}:
        break
    time.sleep(10)
```

```javascript
const H = { "Authorization": `Bearer ${process.env.SCAIGRID_API_KEY}` };
const HOST = process.env.SCAIGRID_HOST;
const JOB = process.env.JOB_ID;
while (true) {
  const r = await fetch(`${HOST}/v1/modules/scaimind/jobs/${JOB}`, { headers: H });
  const { data } = await r.json();
  console.log(data.status);
  if (["COMPLETED", "FAILED", "CANCELLED", "PREEMPTED"].includes(data.status)) break;
  await new Promise(res => setTimeout(res, 10_000));
}
```

```bash
while true; do
  s=$(curl -s "$SCAIGRID_HOST/v1/modules/scaimind/jobs/$JOB_ID" \
    -H "Authorization: Bearer $SCAIGRID_API_KEY" \
    | python -c "import sys,json; print(json.load(sys.stdin)['data']['status'])")
  echo "$s"
  case "$s" in COMPLETED|FAILED|CANCELLED|PREEMPTED) break ;; esac
  sleep 10
done
```

## 4. Stream metrics while it trains

In another terminal:

```bash
curl -N "$SCAIGRID_HOST/v1/modules/scaimind/jobs/$JOB_ID/metrics/stream?interval=10" \
  -H "Authorization: Bearer $SCAIGRID_API_KEY"
```

Each `event: metrics` carries a JSON snapshot with the current step, loss, learning rate, throughput, and GPU utilisation. The same data is plotted live on the admin UI's Training Monitor page.

## 5. Tail logs

```bash
curl -N "$SCAIGRID_HOST/v1/modules/scaimind/jobs/$JOB_ID/logs?follow=true&level=INFO&tail=200" \
  -H "Authorization: Bearer $SCAIGRID_API_KEY"
```

For a one-shot dump of the last N lines, set `follow=false`.

## 6. Handle a failure with retry

If the job ends in `FAILED` or `PREEMPTED`, inspect the error and retry from the most recent checkpoint with optionally different resources:

```bash
curl -X POST "$SCAIGRID_HOST/v1/modules/scaimind/jobs/$JOB_ID/retry" \
  -H "Authorization: Bearer $SCAIGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "checkpoint_id": "",
    "modify_resources": true,
    "new_resource_config": {
      "gpu_count": 8,
      "gpu_type": "H100"
    }
  }'
```

A new child job is created with `parent_job_id` set to the failed run. The original stays in `FAILED` as the audit trail.

## 7. Pause to free GPUs

If a higher-priority experiment lands and you want to give up GPU time without throwing work away:

```bash
curl -X POST "$SCAIGRID_HOST/v1/modules/scaimind/jobs/$JOB_ID/pause" \
  -H "Authorization: Bearer $SCAIGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"save_checkpoint": true}'
```

Resume later:

```bash
curl -X POST "$SCAIGRID_HOST/v1/modules/scaimind/jobs/$JOB_ID/resume" \
  -H "Authorization: Bearer $SCAIGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"checkpoint_id": ""}'
```

## 8. Fetch the trained artefact

`GET /jobs/{id}` on a `COMPLETED` job returns the full proto-derived response. The `output_config.output_path` you supplied is where the coordinator wrote the trained adapter or merged model. From there, use ScaiDrive (or whatever protocol you pointed at) to download or hand it off to a ScaiGrid backend registration.

## Done

You have a finished LoRA run with checkpoints, metrics, logs, and a child-retry pattern in place. Vary `training_type` to `SFT`, `QLORA`, `DPO`, or `RLHF` and the rest of the recipe stays largely the same.
