Run an evaluation
An evaluation runs one or more named benchmarks against a model produced by a completed job. The shape is small: a job id, a model URI, a list of benchmarks. The coordinator queues the run as a separate workload (labelled type=evaluation so the listing endpoints can distinguish it).
What you need#
- A
job_idthat has reachedCOMPLETED(or at least produced a checkpoint you want to score). - A
model_urithe coordinator can resolve to the artefact under evaluation. - One or more benchmarks the coordinator knows about — name plus optional dataset, sample count, and parameters.
1. Submit the evaluation#
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | |
The response returns the new evaluation_id and an initial status.
2. Poll for results#
1 2 | |
The full result payload comes back from the coordinator as a proto-to-dict serialisation. Field names and the nesting of per-benchmark results depend on the coordinator version; treat the response as the contract.
1 2 3 4 5 6 7 8 9 10 11 12 | |
3. List past evaluations#
The list endpoint reuses ListJobs server-side with a type=evaluation label filter, so the response shape mirrors job listings:
1 2 | |
1 2 3 4 5 6 7 8 9 | |
Paginate with page_token returned in next_page_token.
Patterns#
Pin temperature to zero on benchmarks. Stochastic generation undermines reproducibility — set parameters.temperature = "0.0" (or whichever knob your benchmark exposes) so re-runs match.
Use checkpoint_id to score intermediate states. If you want to know whether epoch 2 already plateaus, pass the corresponding checkpoint id rather than the final model.
Tag with labels. Although the submission body doesn't carry top-level labels, the coordinator may surface them in the underlying job record. Use job-level labels on the parent training job so evaluations group cleanly in the dashboard.
Don't loop benchmarks per evaluation call. Submit one evaluation with multiple benchmarks rather than one evaluation per benchmark. It's cheaper for the coordinator and gives you one record to track.
Limits and gotchas#
- The set of recognised benchmark
namevalues is owned by the coordinator. Check what your deployment supports — names likemmluandhumanevalare typical, butcustom-*patterns require you to provide adatasetpath the coordinator can read. - Custom datasets must be reachable by the coordinator using its own credentials or the standard token forwarding. If a dataset lives on ScaiDrive, the data validation endpoint is a good prechecker.
- Evaluations consume GPU time. They share the queue with training jobs and respect the same scheduling rules.