Training jobs
A job is the unit of work in ScaiMind. Everything else — metrics, logs, evaluations, queue position — is a property of a job. Understanding how one is shaped, how it moves through its lifecycle, and which controls are available is most of what you need.
The submission shape#
A SubmitJobRequest has a top-level name, optional labels, and five nested configs:
1 2 3 4 5 6 7 8 9 | |
training_config#
Defines what model is being trained and how.
| Field | Notes |
|---|---|
training_type |
One of SFT, LORA, QLORA, DPO, RLHF, CONTINUED_PRETRAIN, FULL_FINETUNE. |
base_model.model_id |
Required. Identifier the coordinator can resolve (HuggingFace ref, ScaiAtlas id, etc.). |
base_model.revision, source, tokenizer_id, dtype, trust_remote_code, model_kwargs |
Optional details. |
framework.type |
One of HF_TRAINER, DEEPSPEED, FSDP, MEGATRON, CUSTOM. |
framework.config |
Free-form dict[str,str] of framework-specific options. |
framework.distributed |
world_size, backend (default nccl), strategy, config. |
framework.custom_script_path |
When type is CUSTOM, path to the user-supplied training script. |
hyperparameters |
dict[str,str] — coordinator parses by framework convention. |
environment |
dict[str,str] of env vars to set on the training process. |
resume_from_checkpoint |
Checkpoint id to resume from. |
max_retries |
Per-job retry cap for transient failures. |
priority |
0-10, default 5. |
data_config#
Where the training data lives and how it should be loaded.
| Field | Notes |
|---|---|
sources |
List of DataSource entries. Each has a path, optional type/split/format, column projection, filters, sampling. |
preprocess |
Tokenizer config, BOS/EOS handling, padding, truncation, chat template, field mapping. |
max_seq_length |
Token cap per sample. |
batch_size |
Per-step batch size. |
gradient_accumulation_steps |
Default 1. |
num_workers, pin_memory |
DataLoader knobs. |
validation_split |
0.0-1.0; fraction reserved for validation. |
seed |
Determinism seed. |
Paths use protocols the coordinator understands. With token exchange enabled, scaidrive://... paths are dereferenced using the forwarded ScaiDrive token; scaiatlas://... paths use the forwarded ScaiAtlas token.
resource_config#
GPU and host requirements.
| Field | Notes |
|---|---|
gpu_count |
Required; default 1. |
gpu_type |
Free string (e.g. A100-80GB, H100). Coordinator matches against node inventory. |
gpu_memory_min_mb, cpu_cores, ram_min_mb, storage_min_mb |
Minimums. |
node_affinity, node_anti_affinity |
Lists of node labels to prefer or avoid. |
output_config#
Where to put the result, how to checkpoint, optional Hub push.
| Field | Notes |
|---|---|
output_model_name |
Logical name for the artefact. |
output_path |
Coordinator-resolvable path for the saved model. |
checkpoint.save_strategy |
steps (default) or epoch. |
checkpoint.save_steps |
Default 500. |
checkpoint.save_total_limit |
Default 3. |
checkpoint.metric_for_best_model, greater_is_better |
For best-checkpoint tracking. |
checkpoint.save_on_each_node |
Distributed-only. |
checkpoint.resume_from_best |
Whether resume_from_checkpoint should snap to the best. |
push_to_hub, hub_repo_id, hub_token |
HuggingFace Hub push on success. |
export_format, merge_lora |
Post-training export controls. |
scheduling_config#
How the queue scheduler should treat this job.
| Field | Notes |
|---|---|
queue |
Queue name; default default. |
priority |
0-10, default 5. |
preemptible |
If true, lower-priority jobs may preempt this one. |
max_runtime_seconds |
Hard timeout. |
required_capabilities |
List of node-capability tags this job needs. |
Lifecycle#
The JobStatus enum is the source of truth:
PENDING → QUEUED → SCHEDULING → PREPARING → TRAINING → CHECKPOINTING → EVALUATING → EXPORTING → COMPLETED
Side branches:
PAUSED— entered fromTRAININGviaPOST /jobs/{id}/pause. Re-entersTRAININGviaPOST /jobs/{id}/resume.CANCELLED— terminal state fromPOST /jobs/{id}/cancel. Reachable from any non-terminal state.FAILED— terminal, error captured inerror_message/error_type.PREEMPTED— coordinator reclaimed the GPUs; retryable.
COMPLETED, FAILED, CANCELLED, and PREEMPTED are terminal. CHECKPOINTING, EVALUATING, and EXPORTING are transient phases inside the broader "running" lifetime.
Lifecycle controls#
Cancel#
POST /jobs/{job_id}/cancel, body { "reason": "..." }. Coordinator stops the run, releases GPUs, marks CANCELLED. Any partial checkpoint that was being written may or may not be retained — depends on coordinator implementation.
Pause#
POST /jobs/{job_id}/pause, body { "save_checkpoint": true }. Saves a checkpoint by default, releases GPUs, marks PAUSED. Useful when you need the hardware for a higher-priority job without losing progress.
Resume#
POST /jobs/{job_id}/resume, body { "checkpoint_id": "..." }. Empty checkpoint_id resumes from the most recent. Re-queues the job; status walks QUEUED → SCHEDULING → ... again.
Retry#
POST /jobs/{job_id}/retry, body:
1 2 3 4 5 | |
Creates a child job (parent_job_id set on the child) starting from the given checkpoint, with the original resource_config unless modify_resources: true and new_resource_config is supplied. Use this when a job failed for a transient reason (OOM at higher batch, node went away) and you want to try again with adjusted resources.
Checkpoints#
Checkpoints are owned by the coordinator. ScaiMind does not expose a "list checkpoints" endpoint; checkpoint ids surface via the job detail response, the metrics response, and any pause/resume/retry calls you make. The output_config.checkpoint block controls cadence and retention.
To restart a job from a checkpoint, supply the id in training_config.resume_from_checkpoint on a fresh POST /jobs, or on POST /jobs/{id}/resume / POST /jobs/{id}/retry.
Labels and queries#
labels is a free-form dict[str,str]. The coordinator indexes labels and supports them as filters on GET /jobs?labels.team=support (filtering is handled coordinator-side; see the API reference for current support). Convention is to use labels for team, experiment, dataset version, model family — anything you would want to slice the dashboard by.