Platform
ScaiWave ScaiGrid ScaiCore ScaiBot ScaiDrive ScaiKey Models Tools & Services
Solutions
Organisations Developers Internet Service Providers Managed Service Providers AI-in-a-Box
Resources
Support Documentation Blog Downloads
Company
About Research Careers Investment Opportunities Contact
Log in

Training jobs

A job is the unit of work in ScaiMind. Everything else — metrics, logs, evaluations, queue position — is a property of a job. Understanding how one is shaped, how it moves through its lifecycle, and which controls are available is most of what you need.

The submission shape#

A SubmitJobRequest has a top-level name, optional labels, and five nested configs:

json
1
2
3
4
5
6
7
8
9
{
  "name": "support-lora-v3",
  "training_config": { ... },
  "data_config": { ... },
  "resource_config": { ... },
  "output_config": { ... },
  "scheduling_config": { ... },
  "labels": {"team": "support", "experiment": "lr-sweep"}
}

training_config#

Defines what model is being trained and how.

Field Notes
training_type One of SFT, LORA, QLORA, DPO, RLHF, CONTINUED_PRETRAIN, FULL_FINETUNE.
base_model.model_id Required. Identifier the coordinator can resolve (HuggingFace ref, ScaiAtlas id, etc.).
base_model.revision, source, tokenizer_id, dtype, trust_remote_code, model_kwargs Optional details.
framework.type One of HF_TRAINER, DEEPSPEED, FSDP, MEGATRON, CUSTOM.
framework.config Free-form dict[str,str] of framework-specific options.
framework.distributed world_size, backend (default nccl), strategy, config.
framework.custom_script_path When type is CUSTOM, path to the user-supplied training script.
hyperparameters dict[str,str] — coordinator parses by framework convention.
environment dict[str,str] of env vars to set on the training process.
resume_from_checkpoint Checkpoint id to resume from.
max_retries Per-job retry cap for transient failures.
priority 0-10, default 5.

data_config#

Where the training data lives and how it should be loaded.

Field Notes
sources List of DataSource entries. Each has a path, optional type/split/format, column projection, filters, sampling.
preprocess Tokenizer config, BOS/EOS handling, padding, truncation, chat template, field mapping.
max_seq_length Token cap per sample.
batch_size Per-step batch size.
gradient_accumulation_steps Default 1.
num_workers, pin_memory DataLoader knobs.
validation_split 0.0-1.0; fraction reserved for validation.
seed Determinism seed.

Paths use protocols the coordinator understands. With token exchange enabled, scaidrive://... paths are dereferenced using the forwarded ScaiDrive token; scaiatlas://... paths use the forwarded ScaiAtlas token.

resource_config#

GPU and host requirements.

Field Notes
gpu_count Required; default 1.
gpu_type Free string (e.g. A100-80GB, H100). Coordinator matches against node inventory.
gpu_memory_min_mb, cpu_cores, ram_min_mb, storage_min_mb Minimums.
node_affinity, node_anti_affinity Lists of node labels to prefer or avoid.

output_config#

Where to put the result, how to checkpoint, optional Hub push.

Field Notes
output_model_name Logical name for the artefact.
output_path Coordinator-resolvable path for the saved model.
checkpoint.save_strategy steps (default) or epoch.
checkpoint.save_steps Default 500.
checkpoint.save_total_limit Default 3.
checkpoint.metric_for_best_model, greater_is_better For best-checkpoint tracking.
checkpoint.save_on_each_node Distributed-only.
checkpoint.resume_from_best Whether resume_from_checkpoint should snap to the best.
push_to_hub, hub_repo_id, hub_token HuggingFace Hub push on success.
export_format, merge_lora Post-training export controls.

scheduling_config#

How the queue scheduler should treat this job.

Field Notes
queue Queue name; default default.
priority 0-10, default 5.
preemptible If true, lower-priority jobs may preempt this one.
max_runtime_seconds Hard timeout.
required_capabilities List of node-capability tags this job needs.

Lifecycle#

The JobStatus enum is the source of truth:

PENDINGQUEUEDSCHEDULINGPREPARINGTRAININGCHECKPOINTINGEVALUATINGEXPORTINGCOMPLETED

Side branches:

  • PAUSED — entered from TRAINING via POST /jobs/{id}/pause. Re-enters TRAINING via POST /jobs/{id}/resume.
  • CANCELLED — terminal state from POST /jobs/{id}/cancel. Reachable from any non-terminal state.
  • FAILED — terminal, error captured in error_message / error_type.
  • PREEMPTED — coordinator reclaimed the GPUs; retryable.

COMPLETED, FAILED, CANCELLED, and PREEMPTED are terminal. CHECKPOINTING, EVALUATING, and EXPORTING are transient phases inside the broader "running" lifetime.

Lifecycle controls#

Cancel#

POST /jobs/{job_id}/cancel, body { "reason": "..." }. Coordinator stops the run, releases GPUs, marks CANCELLED. Any partial checkpoint that was being written may or may not be retained — depends on coordinator implementation.

Pause#

POST /jobs/{job_id}/pause, body { "save_checkpoint": true }. Saves a checkpoint by default, releases GPUs, marks PAUSED. Useful when you need the hardware for a higher-priority job without losing progress.

Resume#

POST /jobs/{job_id}/resume, body { "checkpoint_id": "..." }. Empty checkpoint_id resumes from the most recent. Re-queues the job; status walks QUEUEDSCHEDULING → ... again.

Retry#

POST /jobs/{job_id}/retry, body:

json
1
2
3
4
5
{
  "checkpoint_id": "",
  "modify_resources": false,
  "new_resource_config": null
}

Creates a child job (parent_job_id set on the child) starting from the given checkpoint, with the original resource_config unless modify_resources: true and new_resource_config is supplied. Use this when a job failed for a transient reason (OOM at higher batch, node went away) and you want to try again with adjusted resources.

Checkpoints#

Checkpoints are owned by the coordinator. ScaiMind does not expose a "list checkpoints" endpoint; checkpoint ids surface via the job detail response, the metrics response, and any pause/resume/retry calls you make. The output_config.checkpoint block controls cadence and retention.

To restart a job from a checkpoint, supply the id in training_config.resume_from_checkpoint on a fresh POST /jobs, or on POST /jobs/{id}/resume / POST /jobs/{id}/retry.

Labels and queries#

labels is a free-form dict[str,str]. The coordinator indexes labels and supports them as filters on GET /jobs?labels.team=support (filtering is handled coordinator-side; see the API reference for current support). Convention is to use labels for team, experiment, dataset version, model family — anything you would want to slice the dashboard by.

Updated 2026-05-18 15:01:31 View source (.md) rev 12