Checkpoints
A checkpoint is a paused execution inside a Core, persisted to MariaDB while it waits for a human (or another system) to make a decision. The IR-level details of when a checkpoint is created live in the language — see /docs/scaicore. This page is about what ScaiGrid does with the checkpoint once it exists.
The row#
Every checkpoint is one row in mod_scaicore_checkpoints. The fields you care about as a wrapper user:
| Field | Purpose |
|---|---|
id |
UUID; also the correlation id used by the ScaiQueue HITL bridge. |
core_id, tenant_id |
Owning Core and tenant. |
execution_id, flow_name, block_index |
Where in the program execution suspended. |
instance_key |
For entity-mode Cores, the entity that suspended. |
checkpoint_type |
Kind of decision — approval, choice, freeform, etc. (set by the program). |
prompt |
The question shown to the assignee. |
options |
Optional JSON list of canonical answer choices. |
assignee_raw |
The raw assignee string from the program — e.g. user:alice@acme, group:approvers, role:tenant_admin. |
assignee_type |
Parsed type: user, group, role, delegated_user, or unrouted. |
assignee_resolved |
Parsed structure with the type + value. |
context |
Arbitrary JSON the program attached. Includes resolved_skills when bound. |
status |
pending, resolved, expired, cancelled. |
priority |
low, normal, high, critical. |
expires_at, expiry_action, escalation_target |
Lifecycle controls. |
notification_sent, reminder_count, reminder_interval_m |
Notification bookkeeping. |
resolution, resolved_by, resolved_at |
Populated when a human acts. |
The state blob (the captured execution state the engine needs to resume) is stored in S3, keyed by state_s3_key. The row is the index; S3 holds the bytes.
Assignment#
The program writes a raw assignee string. The wrapper parses the <type>:<value> prefix:
user:<email-or-id>— one human.group:<group-id>— anyone in the group can act.role:<role>— anyone holding the role can act.- (no prefix) — treated as
user:with the raw value.
If the Core is delegated to a specific user, ScaiCore programs can also produce delegated_user checkpoints — they route to whatever human the delegation is currently scoped to.
When no assignee can be resolved (typical for offline-debugging Cores), the type is unrouted. Unrouted checkpoints accumulate in the admin UI's checkpoint queue for a tenant admin to claim manually.
Notifications#
If a notifier is configured (email transport, Slack, etc.), the wrapper sends a notification at create time and marks notification_sent = true. The 15-minute send_checkpoint_reminders cron re-pings still-pending checkpoints whose reminder_interval_m is set, incrementing reminder_count each time.
Emails are sent to addresses extracted from the resolved assignee — addresses that look like emails are passed straight through; group / role resolution to email lists is the notifier's responsibility.
Expiry#
The 5-minute expire_checkpoints cron picks up rows where expires_at <= now() and status = pending. The expiry_action decides what happens:
cancel— the row is markedcancelled. The program never resumes from this checkpoint.default_option— the row is markedresolvedwithdecision: "default"andauto_expired: true. The runtime resumes as if a human had picked the default.escalate— the row'sassignee_rawis replaced withescalation_target, the notifier is asked to escalate, and the row stayspendingfor the new assignee.
Resolution#
A human (or any caller with scaicore:checkpoint_resolve) posts to /checkpoints/{id}/resolve:
1 2 3 4 5 | |
The row's status flips to resolved, the resolution JSON captures the decision + response + comment, and resolved_by / resolved_at get set. The runtime is responsible for picking up the resolution and resuming the suspended execution.
A checkpoint can also be cancelled (POST /checkpoints/{id}/cancel) without a decision, or reassigned (POST /checkpoints/{id}/reassign) to a different assignee — which re-runs the assignment parser and re-fires the notifier.
Frozen skill versions#
If the Core has bound ScaiSkills, the resolved skill set at checkpoint-creation time is frozen into the row's context.resolved_skills. On resume, the runtime reads those pinned versions back instead of re-resolving — so the resumed execution uses the exact same skill versions it suspended with, even if a new version has been published or yanked since.
Yanked pinned versions are allowed through with a warning (ScaiSkills ERRATA-v0.2 option 2). Missing pinned skill rows are also tolerated; the warning is logged but the resume proceeds.
ScaiQueue HITL bridge#
Programs can publish a checkpoint as a hitl_request message into ScaiQueue (with hitl_message_id recorded on the row). When the ScaiQueue message is completed, an in-process scaiqueue.message.completed event fires. The wrapper's handler — registered by the module — looks the checkpoint up by correlation_id == checkpoint_id and resolves it internally.
The handler is idempotent: a re-delivered completion event finds the checkpoint already resolved and returns cleanly. If the checkpoint was never ours (different correlation_id), the handler exits without touching anything.
Audit history#
GET /checkpoints/{id}/history returns a simplified event list — created, notification sent, resolved. It is not a full append-only audit log; for that, use ScaiGrid's audit-events pipeline filtered by module=scaicore.