Service marked unreachable

A registered ScaiLabs service shows health_status='unreachable' in /admin/registry, and the operations team sees alerts. Use this page to diagnose why ScaiControl can't see its heartbeats.

How "unreachable" is determined#

The registry_heartbeat_monitor cron runs every REGISTRY_HEARTBEAT_MONITOR_INTERVAL seconds (default 60). For each registered service:

It looks at last_heartbeat_at from service_registry.
Grace period = service's own heartbeat_interval_seconds × REGISTRY_HEARTBEAT_GRACE_MULTIPLIER (default ×3).
If now - last_heartbeat_at > grace, the consecutive_misses counter increments.

Thresholds (configurable via env):

Consecutive misses	Health status
0	`healthy`
`REGISTRY_HEARTBEAT_DEGRADED_THRESHOLD` (default 3)	`degraded`
`REGISTRY_HEARTBEAT_UNREACHABLE_THRESHOLD` (default 10)	`unreachable`

So with defaults: a service heartbeating every 30 seconds, with grace = 90 sec, hits unreachable after ~10 missed grace windows = ~15 minutes of silence.

Step 1 — Is the service actually running?#

Standard process check — ps, systemctl status, kubectl get pods, whatever your runtime exposes. If the service is down, that's the answer; start it.

Step 2 — Is it heartbeating?#

Look at the most recent heartbeat in ScaiControl:

sql
SELECT id, slug, name, last_heartbeat_at, consecutive_misses, health_status,
       heartbeat_interval_seconds
FROM service_registry
WHERE slug = '<service-slug>';

If last_heartbeat_at is very recent but health_status is still unreachable, the monitor hasn't run yet — wait one cycle.

If last_heartbeat_at is stale, the heartbeats are not arriving. Move to Step 3.

Step 3 — Can the service reach ScaiControl?#

Heartbeats are POST /api/v1/registry/heartbeat with a service token. Test from the service host:

bash
curl -i -X POST "$SCAICONTROL_URL/api/v1/registry/heartbeat" \
     -H "Authorization: Bearer $SERVICE_TOKEN" \
     -H "Content-Type: application/json" \
     -d '{"status":"healthy"}'

Expected: 200 {"ok": true}.

Possible failures:

HTTP / network	Meaning
Network timeout / connection refused	Service can't reach ScaiControl's URL. Check DNS, firewall, ingress rules
`401`	Service token invalid, expired, or wrong issuer. The service might be reading stale credentials
`403`	Token valid but lacks `registry:manage` scope. Re-issue via ScaiKey
`404`	Wrong URL (e.g. missing `/api/v1`)
`5xx`	ScaiControl problem — check its logs

Step 4 — Is the heartbeat being recorded?#

If the service reports successful heartbeats but ScaiControl still says last_heartbeat_at is stale, the request is reaching a different ScaiControl instance (load balancer fronting multiple deployments with separate databases) or a stale cache. Verify the service is hitting the actual SCAICONTROL_URL it should.

Backend log line:

verilog
INFO  registry.heartbeat slug=<slug> status=healthy

Grep for it; absence at the expected time means the request didn't land.

Step 5 — Is the monitor cron running?#

bash
ps aux | grep -E 'arq|heartbeat_monitor'

The cron lives inside the arq worker. If the worker is down, consecutive_misses won't tick down even after heartbeats resume — but last_heartbeat_at WILL update from the live POSTs, so health_status will look stuck at unreachable until the cron runs next.

Restart the worker; one cycle resets the counter.

Step 6 — Service is up but ScaiControl is misconfigured#

Mismatch in the registered URL. ScaiControl's service_registry.base_url is what it'd USE to reach back, not where heartbeats come from — but if you've changed the service's deployment URL without re-registering, downstream provisioning calls will fail (the service marked itself unreachable through ScaiControl's reverse health checks, not via missed heartbeats).

sql
SELECT slug, base_url, callback_url FROM service_registry WHERE slug = '<slug>';

Update via PATCH /api/v1/admin/registry/{id} if wrong.

Step 7 — Force-reset the status#

Once the underlying issue is fixed and heartbeats are flowing, the service moves back to healthy automatically on the next successful heartbeat (the heartbeat handler clears consecutive_misses and sets health_status='healthy' in the same transaction). No manual action required.

If you need to nudge it for testing:

sql
UPDATE service_registry
SET health_status = 'healthy', consecutive_misses = 0, last_heartbeat_at = NOW()
WHERE slug = '<slug>';

This is purely cosmetic — if the underlying issue persists, the next monitor cycle will revert the status.

"Approved" vs "healthy" — different concepts#

Don't conflate them:

registration_status ∈ {pending, approved, rejected} — administrative gate; only approved services can heartbeat or be provisioned to.
health_status ∈ {healthy, degraded, unreachable} — operational signal, derived from heartbeats.

A service can be approved + unreachable (just down right now). It cannot be pending + healthy — a pending service has no token to heartbeat with.