Run an evaluation

An evaluation runs one or more named benchmarks against a model produced by a completed job. The shape is small: a job id, a model URI, a list of benchmarks. The coordinator queues the run as a separate workload (labelled type=evaluation so the listing endpoints can distinguish it).

What you need#

A job_id that has reached COMPLETED (or at least produced a checkpoint you want to score).
A model_uri the coordinator can resolve to the artefact under evaluation.
One or more benchmarks the coordinator knows about — name plus optional dataset, sample count, and parameters.

1. Submit the evaluation#

bash
curl -X POST "$SCAIGRID_HOST/v1/modules/scaimind/evaluations" \
  -H "Authorization: Bearer $SCAIGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "job_id": "job_abc123",
    "model_uri": "scaidrive://acme/models/support-v3/",
    "checkpoint_id": "",
    "benchmarks": [
      {"name": "mmlu", "num_samples": 1000},
      {"name": "humaneval"},
      {
        "name": "custom-acme-support",
        "dataset": "scaidrive://acme/eval/support-testset.jsonl",
        "parameters": {"max_new_tokens": "256", "temperature": "0.0"}
      }
    ]
  }'

python
import httpx, os
r = httpx.post(
    f"{os.environ['SCAIGRID_HOST']}/v1/modules/scaimind/evaluations",
    headers={"Authorization": f"Bearer {os.environ['SCAIGRID_API_KEY']}"},
    json={
        "job_id": "job_abc123",
        "model_uri": "scaidrive://acme/models/support-v3/",
        "benchmarks": [
            {"name": "mmlu", "num_samples": 1000},
            {"name": "humaneval"},
            {
                "name": "custom-acme-support",
                "dataset": "scaidrive://acme/eval/support-testset.jsonl",
                "parameters": {"max_new_tokens": "256", "temperature": "0.0"},
            },
        ],
    },
)
eval_record = r.json()["data"]
print(eval_record["evaluation_id"], eval_record["status"])

javascript
const r = await fetch(`${process.env.SCAIGRID_HOST}/v1/modules/scaimind/evaluations`, {
  method: "POST",
  headers: {
    "Authorization": `Bearer ${process.env.SCAIGRID_API_KEY}`,
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
    job_id: "job_abc123",
    model_uri: "scaidrive://acme/models/support-v3/",
    benchmarks: [
      { name: "mmlu", num_samples: 1000 },
      { name: "humaneval" },
      {
        name: "custom-acme-support",
        dataset: "scaidrive://acme/eval/support-testset.jsonl",
        parameters: { max_new_tokens: "256", temperature: "0.0" },
      },
    ],
  }),
});
const { data: evalRec } = await r.json();
console.log(evalRec.evaluation_id, evalRec.status);

The response returns the new evaluation_id and an initial status.

2. Poll for results#

bash
curl "$SCAIGRID_HOST/v1/modules/scaimind/evaluations/$EVAL_ID" \
  -H "Authorization: Bearer $SCAIGRID_API_KEY"

The full result payload comes back from the coordinator as a proto-to-dict serialisation. Field names and the nesting of per-benchmark results depend on the coordinator version; treat the response as the contract.

python
import httpx, os, time
H = {"Authorization": f"Bearer {os.environ['SCAIGRID_API_KEY']}"}
HOST = os.environ["SCAIGRID_HOST"]
EVAL = os.environ["EVAL_ID"]

while True:
    r = httpx.get(f"{HOST}/v1/modules/scaimind/evaluations/{EVAL}", headers=H).json()["data"]
    print(r.get("status"))
    if r.get("status") in {"COMPLETED", "FAILED", "CANCELLED"}:
        break
    time.sleep(15)
print(r)

3. List past evaluations#

The list endpoint reuses ListJobs server-side with a type=evaluation label filter, so the response shape mirrors job listings:

bash
curl "$SCAIGRID_HOST/v1/modules/scaimind/evaluations?page_size=20" \
  -H "Authorization: Bearer $SCAIGRID_API_KEY"

json
{
  "data": {
    "evaluations": [
      {"job_id": "eval_xyz", "name": "...", "status": "COMPLETED", "...": "..."}
    ],
    "next_page_token": "...",
    "total_count": 42
  }
}

Paginate with page_token returned in next_page_token.

Patterns#

Pin temperature to zero on benchmarks. Stochastic generation undermines reproducibility — set parameters.temperature = "0.0" (or whichever knob your benchmark exposes) so re-runs match.

Use checkpoint_id to score intermediate states. If you want to know whether epoch 2 already plateaus, pass the corresponding checkpoint id rather than the final model.

Tag with labels. Although the submission body doesn't carry top-level labels, the coordinator may surface them in the underlying job record. Use job-level labels on the parent training job so evaluations group cleanly in the dashboard.

Don't loop benchmarks per evaluation call. Submit one evaluation with multiple benchmarks rather than one evaluation per benchmark. It's cheaper for the coordinator and gives you one record to track.

Limits and gotchas#

The set of recognised benchmark name values is owned by the coordinator. Check what your deployment supports — names like mmlu and humaneval are typical, but custom-* patterns require you to provide a dataset path the coordinator can read.
Custom datasets must be reachable by the coordinator using its own credentials or the standard token forwarding. If a dataset lives on ScaiDrive, the data validation endpoint is a good prechecker.
Evaluations consume GPU time. They share the queue with training jobs and respect the same scheduling rules.