Platform
ScaiWave ScaiGrid ScaiCore ScaiBot ScaiDrive ScaiKey Models Tools & Services
Solutions
Organisations Developers Internet Service Providers Managed Service Providers AI-in-a-Box
Resources
Support Documentation Blog Downloads
Company
About Research Careers Investment Opportunities Contact
Log in

Quickstart

In five minutes you will have a LoRA fine-tune queued on the cluster and a live stream of metrics coming back to your terminal.

You need:

  • A ScaiGrid API key with the scaimind:manage permission (any tenant admin has this).
  • A reachable MindCoordinator cluster with at least one online node.
  • A training dataset already accessible to the coordinator (typically a path resolvable via the per-request ScaiDrive token).
bash
1
2
export SCAIGRID_HOST="https://scaigrid.scailabs.ai"
export SCAIGRID_API_KEY="sgk_..."

1. Check cluster status#

Before queueing anything, confirm the cluster has capacity.

bash
1
2
curl "$SCAIGRID_HOST/v1/modules/scaimind/cluster" \
  -H "Authorization: Bearer $SCAIGRID_API_KEY"
python
1
2
3
4
5
6
import httpx, os
r = httpx.get(
    f"{os.environ['SCAIGRID_HOST']}/v1/modules/scaimind/cluster",
    headers={"Authorization": f"Bearer {os.environ['SCAIGRID_API_KEY']}"},
)
print(r.json()["data"])
javascript
1
2
3
4
const r = await fetch(`${process.env.SCAIGRID_HOST}/v1/modules/scaimind/cluster`, {
  headers: { "Authorization": `Bearer ${process.env.SCAIGRID_API_KEY}` },
});
console.log((await r.json()).data);

You will see counts for online_nodes, total_gpus, available_gpus, queued_jobs, and overall cluster_utilization.

2. Submit a LoRA job#

The job request is one envelope with five nested configs: training_config, data_config, resource_config, and the optional output_config and scheduling_config.

bash
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
curl -X POST "$SCAIGRID_HOST/v1/modules/scaimind/jobs" \
  -H "Authorization: Bearer $SCAIGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "quickstart-lora",
    "training_config": {
      "training_type": "LORA",
      "base_model": {"model_id": "meta-llama/Llama-3-8B"},
      "framework": {"type": "HF_TRAINER"},
      "hyperparameters": {"learning_rate": "2e-4", "num_train_epochs": "3"}
    },
    "data_config": {
      "sources": [{"path": "scaidrive://my-tenant/support.jsonl", "format": "jsonl"}],
      "max_seq_length": 2048,
      "batch_size": 8
    },
    "resource_config": {"gpu_count": 1, "gpu_type": "A100-80GB"},
    "labels": {"team": "quickstart"}
  }'
python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
import httpx, os
r = httpx.post(
    f"{os.environ['SCAIGRID_HOST']}/v1/modules/scaimind/jobs",
    headers={"Authorization": f"Bearer {os.environ['SCAIGRID_API_KEY']}"},
    json={
        "name": "quickstart-lora",
        "training_config": {
            "training_type": "LORA",
            "base_model": {"model_id": "meta-llama/Llama-3-8B"},
            "framework": {"type": "HF_TRAINER"},
            "hyperparameters": {"learning_rate": "2e-4", "num_train_epochs": "3"},
        },
        "data_config": {
            "sources": [{"path": "scaidrive://my-tenant/support.jsonl", "format": "jsonl"}],
            "max_seq_length": 2048,
            "batch_size": 8,
        },
        "resource_config": {"gpu_count": 1, "gpu_type": "A100-80GB"},
        "labels": {"team": "quickstart"},
    },
)
job = r.json()["data"]
print(job["job_id"], job["status"])
javascript
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
const r = await fetch(`${process.env.SCAIGRID_HOST}/v1/modules/scaimind/jobs`, {
  method: "POST",
  headers: {
    "Authorization": `Bearer ${process.env.SCAIGRID_API_KEY}`,
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
    name: "quickstart-lora",
    training_config: {
      training_type: "LORA",
      base_model: { model_id: "meta-llama/Llama-3-8B" },
      framework: { type: "HF_TRAINER" },
      hyperparameters: { learning_rate: "2e-4", num_train_epochs: "3" },
    },
    data_config: {
      sources: [{ path: "scaidrive://my-tenant/support.jsonl", format: "jsonl" }],
      max_seq_length: 2048,
      batch_size: 8,
    },
    resource_config: { gpu_count: 1, gpu_type: "A100-80GB" },
    labels: { team: "quickstart" },
  }),
});
const { data: job } = await r.json();
console.log(job.job_id, job.status);

Save the returned job_id. The response also includes estimated_wait_seconds based on the current queue.

3. Poll the job#

bash
1
2
curl "$SCAIGRID_HOST/v1/modules/scaimind/jobs/$JOB_ID" \
  -H "Authorization: Bearer $SCAIGRID_API_KEY"

The status walks the lifecycle: PENDINGQUEUEDSCHEDULINGPREPARINGTRAININGCHECKPOINTINGEVALUATINGEXPORTINGCOMPLETED. A pause moves it to PAUSED; a cancel ends it in CANCELLED; an unrecoverable error ends it in FAILED.

4. Stream metrics#

Once the job reaches TRAINING, open the metrics stream. Each event is a JSON snapshot — loss, learning rate, throughput, GPU utilisation.

bash
1
2
curl -N "$SCAIGRID_HOST/v1/modules/scaimind/jobs/$JOB_ID/metrics/stream?interval=5" \
  -H "Authorization: Bearer $SCAIGRID_API_KEY"
python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
import httpx, os
with httpx.stream(
    "GET",
    f"{os.environ['SCAIGRID_HOST']}/v1/modules/scaimind/jobs/{os.environ['JOB_ID']}/metrics/stream",
    headers={"Authorization": f"Bearer {os.environ['SCAIGRID_API_KEY']}"},
    params={"interval": 5},
    timeout=None,
) as resp:
    for line in resp.iter_lines():
        if line:
            print(line)
javascript
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
const resp = await fetch(
  `${process.env.SCAIGRID_HOST}/v1/modules/scaimind/jobs/${process.env.JOB_ID}/metrics/stream?interval=5`,
  { headers: { "Authorization": `Bearer ${process.env.SCAIGRID_API_KEY}` } },
);
const reader = resp.body.getReader();
const dec = new TextDecoder();
while (true) {
  const { value, done } = await reader.read();
  if (done) break;
  process.stdout.write(dec.decode(value));
}

For a point-in-time read instead of a stream, use GET /jobs/{job_id}/metrics?max_points=200.

5. Tail logs#

In another terminal:

bash
1
2
curl -N "$SCAIGRID_HOST/v1/modules/scaimind/jobs/$JOB_ID/logs?follow=true&level=INFO&tail=100" \
  -H "Authorization: Bearer $SCAIGRID_API_KEY"

Events arrive over Server-Sent Events tagged log. Pass follow=false for a one-shot tail.

What just happened#

  • Your POST /jobs request was translated into a SubmitJob gRPC call against MindCoordinator, with your tenant id, user id, and (when configured) a downstream ScaiDrive token forwarded as gRPC metadata.
  • The coordinator queued the job, picked a node with a free GPU matching gpu_type, fetched the dataset (using the forwarded token), and started training under the requested framework.
  • ScaiMind's local cache (mod_scaimind_jobs) was populated by the list call you made in step 1's siblings, so the admin UI dashboard already shows the job.
  • The metric and log streams are direct passthroughs of MindCoordinator's gRPC streams, framed as SSE events.

Next#

Updated 2026-05-18 15:01:31 View source (.md) rev 12