Training jobs

A job is the unit of work in ScaiMind. Everything else — metrics, logs, evaluations, queue position — is a property of a job. Understanding how one is shaped, how it moves through its lifecycle, and which controls are available is most of what you need.

The submission shape#

A SubmitJobRequest has a top-level name, optional labels, and five nested configs:

json
{
  "name": "support-lora-v3",
  "training_config": { ... },
  "data_config": { ... },
  "resource_config": { ... },
  "output_config": { ... },
  "scheduling_config": { ... },
  "labels": {"team": "support", "experiment": "lr-sweep"}
}

`training_config`#

Defines what model is being trained and how.

Field	Notes
`training_type`	One of `SFT`, `LORA`, `QLORA`, `DPO`, `RLHF`, `CONTINUED_PRETRAIN`, `FULL_FINETUNE`.
`base_model.model_id`	Required. Identifier the coordinator can resolve (HuggingFace ref, ScaiAtlas id, etc.).
`base_model.revision`, `source`, `tokenizer_id`, `dtype`, `trust_remote_code`, `model_kwargs`	Optional details.
`framework.type`	One of `HF_TRAINER`, `DEEPSPEED`, `FSDP`, `MEGATRON`, `CUSTOM`.
`framework.config`	Free-form `dict[str,str]` of framework-specific options.
`framework.distributed`	`world_size`, `backend` (default `nccl`), `strategy`, `config`.
`framework.custom_script_path`	When `type` is `CUSTOM`, path to the user-supplied training script.
`hyperparameters`	`dict[str,str]` — coordinator parses by framework convention.
`environment`	`dict[str,str]` of env vars to set on the training process.
`resume_from_checkpoint`	Checkpoint id to resume from.
`max_retries`	Per-job retry cap for transient failures.
`priority`	0-10, default 5.

`data_config`#

Where the training data lives and how it should be loaded.

Field	Notes
`sources`	List of `DataSource` entries. Each has a `path`, optional `type`/`split`/`format`, column projection, filters, sampling.
`preprocess`	Tokenizer config, BOS/EOS handling, padding, truncation, chat template, field mapping.
`max_seq_length`	Token cap per sample.
`batch_size`	Per-step batch size.
`gradient_accumulation_steps`	Default 1.
`num_workers`, `pin_memory`	DataLoader knobs.
`validation_split`	0.0-1.0; fraction reserved for validation.
`seed`	Determinism seed.

Paths use protocols the coordinator understands. With token exchange enabled, scaidrive://... paths are dereferenced using the forwarded ScaiDrive token; scaiatlas://... paths use the forwarded ScaiAtlas token.

`resource_config`#

GPU and host requirements.

Field	Notes
`gpu_count`	Required; default 1.
`gpu_type`	Free string (e.g. `A100-80GB`, `H100`). Coordinator matches against node inventory.
`gpu_memory_min_mb`, `cpu_cores`, `ram_min_mb`, `storage_min_mb`	Minimums.
`node_affinity`, `node_anti_affinity`	Lists of node labels to prefer or avoid.

`output_config`#

Where to put the result, how to checkpoint, optional Hub push.

Field	Notes
`output_model_name`	Logical name for the artefact.
`output_path`	Coordinator-resolvable path for the saved model.
`checkpoint.save_strategy`	`steps` (default) or `epoch`.
`checkpoint.save_steps`	Default 500.
`checkpoint.save_total_limit`	Default 3.
`checkpoint.metric_for_best_model`, `greater_is_better`	For best-checkpoint tracking.
`checkpoint.save_on_each_node`	Distributed-only.
`checkpoint.resume_from_best`	Whether `resume_from_checkpoint` should snap to the best.
`push_to_hub`, `hub_repo_id`, `hub_token`	HuggingFace Hub push on success.
`export_format`, `merge_lora`	Post-training export controls.

`scheduling_config`#

How the queue scheduler should treat this job.

Field	Notes
`queue`	Queue name; default `default`.
`priority`	0-10, default 5.
`preemptible`	If true, lower-priority jobs may preempt this one.
`max_runtime_seconds`	Hard timeout.
`required_capabilities`	List of node-capability tags this job needs.

Lifecycle#

The JobStatus enum is the source of truth:

PENDING → QUEUED → SCHEDULING → PREPARING → TRAINING → CHECKPOINTING → EVALUATING → EXPORTING → COMPLETED

Side branches:

PAUSED — entered from TRAINING via POST /jobs/{id}/pause. Re-enters TRAINING via POST /jobs/{id}/resume.
CANCELLED — terminal state from POST /jobs/{id}/cancel. Reachable from any non-terminal state.
FAILED — terminal, error captured in error_message / error_type.
PREEMPTED — coordinator reclaimed the GPUs; retryable.

COMPLETED, FAILED, CANCELLED, and PREEMPTED are terminal. CHECKPOINTING, EVALUATING, and EXPORTING are transient phases inside the broader "running" lifetime.

Lifecycle controls#

Cancel#

POST /jobs/{job_id}/cancel, body { "reason": "..." }. Coordinator stops the run, releases GPUs, marks CANCELLED. Any partial checkpoint that was being written may or may not be retained — depends on coordinator implementation.

Pause#

POST /jobs/{job_id}/pause, body { "save_checkpoint": true }. Saves a checkpoint by default, releases GPUs, marks PAUSED. Useful when you need the hardware for a higher-priority job without losing progress.

Resume#

POST /jobs/{job_id}/resume, body { "checkpoint_id": "..." }. Empty checkpoint_id resumes from the most recent. Re-queues the job; status walks QUEUED → SCHEDULING → ... again.

Retry#

POST /jobs/{job_id}/retry, body:

json
{
  "checkpoint_id": "",
  "modify_resources": false,
  "new_resource_config": null
}

Creates a child job (parent_job_id set on the child) starting from the given checkpoint, with the original resource_config unless modify_resources: true and new_resource_config is supplied. Use this when a job failed for a transient reason (OOM at higher batch, node went away) and you want to try again with adjusted resources.

Checkpoints#

Checkpoints are owned by the coordinator. ScaiMind does not expose a "list checkpoints" endpoint; checkpoint ids surface via the job detail response, the metrics response, and any pause/resume/retry calls you make. The output_config.checkpoint block controls cadence and retention.

To restart a job from a checkpoint, supply the id in training_config.resume_from_checkpoint on a fresh POST /jobs, or on POST /jobs/{id}/resume / POST /jobs/{id}/retry.

Labels and queries#

labels is a free-form dict[str,str]. The coordinator indexes labels and supports them as filters on GET /jobs?labels.team=support (filtering is handled coordinator-side; see the API reference for current support). Convention is to use labels for team, experiment, dataset version, model family — anything you would want to slice the dashboard by.