Changelog
User-visible changes only. Internal refactors and infrastructure work omitted.
v1.0 — Launch#
First generally-available release.
- REST submission for seven training types:
SFT,LORA,QLORA,DPO,RLHF,CONTINUED_PRETRAIN,FULL_FINETUNE. - Five framework targets:
HF_TRAINER,DEEPSPEED,FSDP,MEGATRON,CUSTOM. - Full lifecycle: submit, cancel, pause, resume, retry (with optional resource modification).
- Point-in-time job metrics plus Server-Sent Event streams for live logs and metric snapshots.
- Cluster operations: status, node listing, node drain (force-able) and enable, queue depth.
- Evaluation submission and retrieval, with multiple named benchmarks per run.
- Data operations: pre-flight validation of training sources, coordinator-side dataset cache inspection.
- Local job-state cache (
mod_scaimind_jobs) auto-synced on list calls for fast dashboard reads. - Downstream token forwarding for ScaiDrive and ScaiAtlas, scoped per request.
- gRPC error sanitisation —
INTERNALandUNKNOWNcoordinator details are replaced with friendly messages and captured in structured logs. - Admin UI: Training Dashboard, Job Creator, Training Monitor, Evaluation Center, Hardware Monitor.