---
summary: Submit a benchmark run against a completed job, poll for results, list past
  evaluations.
title: Run an evaluation
path: tutorials/run-an-evaluation
status: published
---

An evaluation runs one or more named benchmarks against a model produced by a completed job. The shape is small: a job id, a model URI, a list of benchmarks. The coordinator queues the run as a separate workload (labelled `type=evaluation` so the listing endpoints can distinguish it).

## What you need

- A `job_id` that has reached `COMPLETED` (or at least produced a checkpoint you want to score).
- A `model_uri` the coordinator can resolve to the artefact under evaluation.
- One or more benchmarks the coordinator knows about — name plus optional dataset, sample count, and parameters.

## 1. Submit the evaluation

```bash
curl -X POST "$SCAIGRID_HOST/v1/modules/scaimind/evaluations" \
  -H "Authorization: Bearer $SCAIGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "job_id": "job_abc123",
    "model_uri": "scaidrive://acme/models/support-v3/",
    "checkpoint_id": "",
    "benchmarks": [
      {"name": "mmlu", "num_samples": 1000},
      {"name": "humaneval"},
      {
        "name": "custom-acme-support",
        "dataset": "scaidrive://acme/eval/support-testset.jsonl",
        "parameters": {"max_new_tokens": "256", "temperature": "0.0"}
      }
    ]
  }'
```

```python
import httpx, os
r = httpx.post(
    f"{os.environ['SCAIGRID_HOST']}/v1/modules/scaimind/evaluations",
    headers={"Authorization": f"Bearer {os.environ['SCAIGRID_API_KEY']}"},
    json={
        "job_id": "job_abc123",
        "model_uri": "scaidrive://acme/models/support-v3/",
        "benchmarks": [
            {"name": "mmlu", "num_samples": 1000},
            {"name": "humaneval"},
            {
                "name": "custom-acme-support",
                "dataset": "scaidrive://acme/eval/support-testset.jsonl",
                "parameters": {"max_new_tokens": "256", "temperature": "0.0"},
            },
        ],
    },
)
eval_record = r.json()["data"]
print(eval_record["evaluation_id"], eval_record["status"])
```

```javascript
const r = await fetch(`${process.env.SCAIGRID_HOST}/v1/modules/scaimind/evaluations`, {
  method: "POST",
  headers: {
    "Authorization": `Bearer ${process.env.SCAIGRID_API_KEY}`,
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
    job_id: "job_abc123",
    model_uri: "scaidrive://acme/models/support-v3/",
    benchmarks: [
      { name: "mmlu", num_samples: 1000 },
      { name: "humaneval" },
      {
        name: "custom-acme-support",
        dataset: "scaidrive://acme/eval/support-testset.jsonl",
        parameters: { max_new_tokens: "256", temperature: "0.0" },
      },
    ],
  }),
});
const { data: evalRec } = await r.json();
console.log(evalRec.evaluation_id, evalRec.status);
```

The response returns the new `evaluation_id` and an initial `status`.

## 2. Poll for results

```bash
curl "$SCAIGRID_HOST/v1/modules/scaimind/evaluations/$EVAL_ID" \
  -H "Authorization: Bearer $SCAIGRID_API_KEY"
```

The full result payload comes back from the coordinator as a proto-to-dict serialisation. Field names and the nesting of per-benchmark results depend on the coordinator version; treat the response as the contract.

```python
import httpx, os, time
H = {"Authorization": f"Bearer {os.environ['SCAIGRID_API_KEY']}"}
HOST = os.environ["SCAIGRID_HOST"]
EVAL = os.environ["EVAL_ID"]

while True:
    r = httpx.get(f"{HOST}/v1/modules/scaimind/evaluations/{EVAL}", headers=H).json()["data"]
    print(r.get("status"))
    if r.get("status") in {"COMPLETED", "FAILED", "CANCELLED"}:
        break
    time.sleep(15)
print(r)
```

## 3. List past evaluations

The list endpoint reuses `ListJobs` server-side with a `type=evaluation` label filter, so the response shape mirrors job listings:

```bash
curl "$SCAIGRID_HOST/v1/modules/scaimind/evaluations?page_size=20" \
  -H "Authorization: Bearer $SCAIGRID_API_KEY"
```

```json
{
  "data": {
    "evaluations": [
      {"job_id": "eval_xyz", "name": "...", "status": "COMPLETED", "...": "..."}
    ],
    "next_page_token": "...",
    "total_count": 42
  }
}
```

Paginate with `page_token` returned in `next_page_token`.

## Patterns

**Pin temperature to zero on benchmarks.** Stochastic generation undermines reproducibility — set `parameters.temperature = "0.0"` (or whichever knob your benchmark exposes) so re-runs match.

**Use `checkpoint_id` to score intermediate states.** If you want to know whether epoch 2 already plateaus, pass the corresponding checkpoint id rather than the final model.

**Tag with labels.** Although the submission body doesn't carry top-level labels, the coordinator may surface them in the underlying job record. Use job-level labels on the parent training job so evaluations group cleanly in the dashboard.

**Don't loop benchmarks per evaluation call.** Submit one evaluation with multiple benchmarks rather than one evaluation per benchmark. It's cheaper for the coordinator and gives you one record to track.

## Limits and gotchas

- The set of recognised benchmark `name` values is owned by the coordinator. Check what your deployment supports — names like `mmlu` and `humaneval` are typical, but `custom-*` patterns require you to provide a `dataset` path the coordinator can read.
- Custom datasets must be reachable by the coordinator using its own credentials or the standard token forwarding. If a dataset lives on ScaiDrive, the data validation endpoint is a good prechecker.
- Evaluations consume GPU time. They share the queue with training jobs and respect the same scheduling rules.
