---
summary: The five-config submission shape, the lifecycle of a job, lifecycle controls
  (cancel, pause, resume, retry), and where checkpoints fit.
title: Training jobs
path: concepts/training-jobs
status: published
---

A job is the unit of work in ScaiMind. Everything else — metrics, logs, evaluations, queue position — is a property of a job. Understanding how one is shaped, how it moves through its lifecycle, and which controls are available is most of what you need.

## The submission shape

A `SubmitJobRequest` has a top-level `name`, optional `labels`, and five nested configs:

```json
{
  "name": "support-lora-v3",
  "training_config": { ... },
  "data_config": { ... },
  "resource_config": { ... },
  "output_config": { ... },
  "scheduling_config": { ... },
  "labels": {"team": "support", "experiment": "lr-sweep"}
}
```

### `training_config`

Defines what model is being trained and how.

| Field | Notes |
|---|---|
| `training_type` | One of `SFT`, `LORA`, `QLORA`, `DPO`, `RLHF`, `CONTINUED_PRETRAIN`, `FULL_FINETUNE`. |
| `base_model.model_id` | Required. Identifier the coordinator can resolve (HuggingFace ref, ScaiAtlas id, etc.). |
| `base_model.revision`, `source`, `tokenizer_id`, `dtype`, `trust_remote_code`, `model_kwargs` | Optional details. |
| `framework.type` | One of `HF_TRAINER`, `DEEPSPEED`, `FSDP`, `MEGATRON`, `CUSTOM`. |
| `framework.config` | Free-form `dict[str,str]` of framework-specific options. |
| `framework.distributed` | `world_size`, `backend` (default `nccl`), `strategy`, `config`. |
| `framework.custom_script_path` | When `type` is `CUSTOM`, path to the user-supplied training script. |
| `hyperparameters` | `dict[str,str]` — coordinator parses by framework convention. |
| `environment` | `dict[str,str]` of env vars to set on the training process. |
| `resume_from_checkpoint` | Checkpoint id to resume from. |
| `max_retries` | Per-job retry cap for transient failures. |
| `priority` | 0-10, default 5. |

### `data_config`

Where the training data lives and how it should be loaded.

| Field | Notes |
|---|---|
| `sources` | List of `DataSource` entries. Each has a `path`, optional `type`/`split`/`format`, column projection, filters, sampling. |
| `preprocess` | Tokenizer config, BOS/EOS handling, padding, truncation, chat template, field mapping. |
| `max_seq_length` | Token cap per sample. |
| `batch_size` | Per-step batch size. |
| `gradient_accumulation_steps` | Default 1. |
| `num_workers`, `pin_memory` | DataLoader knobs. |
| `validation_split` | 0.0-1.0; fraction reserved for validation. |
| `seed` | Determinism seed. |

Paths use protocols the coordinator understands. With token exchange enabled, `scaidrive://...` paths are dereferenced using the forwarded ScaiDrive token; `scaiatlas://...` paths use the forwarded ScaiAtlas token.

### `resource_config`

GPU and host requirements.

| Field | Notes |
|---|---|
| `gpu_count` | Required; default 1. |
| `gpu_type` | Free string (e.g. `A100-80GB`, `H100`). Coordinator matches against node inventory. |
| `gpu_memory_min_mb`, `cpu_cores`, `ram_min_mb`, `storage_min_mb` | Minimums. |
| `node_affinity`, `node_anti_affinity` | Lists of node labels to prefer or avoid. |

### `output_config`

Where to put the result, how to checkpoint, optional Hub push.

| Field | Notes |
|---|---|
| `output_model_name` | Logical name for the artefact. |
| `output_path` | Coordinator-resolvable path for the saved model. |
| `checkpoint.save_strategy` | `steps` (default) or `epoch`. |
| `checkpoint.save_steps` | Default 500. |
| `checkpoint.save_total_limit` | Default 3. |
| `checkpoint.metric_for_best_model`, `greater_is_better` | For best-checkpoint tracking. |
| `checkpoint.save_on_each_node` | Distributed-only. |
| `checkpoint.resume_from_best` | Whether `resume_from_checkpoint` should snap to the best. |
| `push_to_hub`, `hub_repo_id`, `hub_token` | HuggingFace Hub push on success. |
| `export_format`, `merge_lora` | Post-training export controls. |

### `scheduling_config`

How the queue scheduler should treat this job.

| Field | Notes |
|---|---|
| `queue` | Queue name; default `default`. |
| `priority` | 0-10, default 5. |
| `preemptible` | If true, lower-priority jobs may preempt this one. |
| `max_runtime_seconds` | Hard timeout. |
| `required_capabilities` | List of node-capability tags this job needs. |

## Lifecycle

The `JobStatus` enum is the source of truth:

`PENDING` → `QUEUED` → `SCHEDULING` → `PREPARING` → `TRAINING` → `CHECKPOINTING` → `EVALUATING` → `EXPORTING` → `COMPLETED`

Side branches:

- `PAUSED` — entered from `TRAINING` via `POST /jobs/{id}/pause`. Re-enters `TRAINING` via `POST /jobs/{id}/resume`.
- `CANCELLED` — terminal state from `POST /jobs/{id}/cancel`. Reachable from any non-terminal state.
- `FAILED` — terminal, error captured in `error_message` / `error_type`.
- `PREEMPTED` — coordinator reclaimed the GPUs; retryable.

`COMPLETED`, `FAILED`, `CANCELLED`, and `PREEMPTED` are terminal. `CHECKPOINTING`, `EVALUATING`, and `EXPORTING` are transient phases inside the broader "running" lifetime.

## Lifecycle controls

### Cancel

`POST /jobs/{job_id}/cancel`, body `{ "reason": "..." }`. Coordinator stops the run, releases GPUs, marks `CANCELLED`. Any partial checkpoint that was being written may or may not be retained — depends on coordinator implementation.

### Pause

`POST /jobs/{job_id}/pause`, body `{ "save_checkpoint": true }`. Saves a checkpoint by default, releases GPUs, marks `PAUSED`. Useful when you need the hardware for a higher-priority job without losing progress.

### Resume

`POST /jobs/{job_id}/resume`, body `{ "checkpoint_id": "..." }`. Empty `checkpoint_id` resumes from the most recent. Re-queues the job; status walks `QUEUED` → `SCHEDULING` → ... again.

### Retry

`POST /jobs/{job_id}/retry`, body:

```json
{
  "checkpoint_id": "",
  "modify_resources": false,
  "new_resource_config": null
}
```

Creates a child job (`parent_job_id` set on the child) starting from the given checkpoint, with the original `resource_config` unless `modify_resources: true` and `new_resource_config` is supplied. Use this when a job failed for a transient reason (OOM at higher batch, node went away) and you want to try again with adjusted resources.

## Checkpoints

Checkpoints are owned by the coordinator. ScaiMind does not expose a "list checkpoints" endpoint; checkpoint ids surface via the job detail response, the metrics response, and any pause/resume/retry calls you make. The `output_config.checkpoint` block controls cadence and retention.

To restart a job from a checkpoint, supply the id in `training_config.resume_from_checkpoint` on a fresh `POST /jobs`, or on `POST /jobs/{id}/resume` / `POST /jobs/{id}/retry`.

## Labels and queries

`labels` is a free-form `dict[str,str]`. The coordinator indexes labels and supports them as filters on `GET /jobs?labels.team=support` (filtering is handled coordinator-side; see the API reference for current support). Convention is to use labels for team, experiment, dataset version, model family — anything you would want to slice the dashboard by.
