Service marked unreachable
A registered ScaiLabs service shows health_status='unreachable' in /admin/registry, and the operations team sees alerts. Use this page to diagnose why ScaiControl can't see its heartbeats.
How "unreachable" is determined#
The registry_heartbeat_monitor cron runs every REGISTRY_HEARTBEAT_MONITOR_INTERVAL seconds (default 60). For each registered service:
- It looks at
last_heartbeat_atfromservice_registry. - Grace period = service's own
heartbeat_interval_seconds × REGISTRY_HEARTBEAT_GRACE_MULTIPLIER(default ×3). - If
now - last_heartbeat_at > grace, theconsecutive_missescounter increments.
Thresholds (configurable via env):
| Consecutive misses | Health status |
|---|---|
| 0 | healthy |
REGISTRY_HEARTBEAT_DEGRADED_THRESHOLD (default 3) |
degraded |
REGISTRY_HEARTBEAT_UNREACHABLE_THRESHOLD (default 10) |
unreachable |
So with defaults: a service heartbeating every 30 seconds, with grace = 90 sec, hits unreachable after ~10 missed grace windows = ~15 minutes of silence.
Step 1 — Is the service actually running?#
Standard process check — ps, systemctl status, kubectl get pods, whatever your runtime exposes. If the service is down, that's the answer; start it.
Step 2 — Is it heartbeating?#
Look at the most recent heartbeat in ScaiControl:
1 2 3 4 | |
If last_heartbeat_at is very recent but health_status is still unreachable, the monitor hasn't run yet — wait one cycle.
If last_heartbeat_at is stale, the heartbeats are not arriving. Move to Step 3.
Step 3 — Can the service reach ScaiControl?#
Heartbeats are POST /api/v1/registry/heartbeat with a service token. Test from the service host:
1 2 3 4 | |
Expected: 200 {"ok": true}.
Possible failures:
| HTTP / network | Meaning |
|---|---|
| Network timeout / connection refused | Service can't reach ScaiControl's URL. Check DNS, firewall, ingress rules |
401 |
Service token invalid, expired, or wrong issuer. The service might be reading stale credentials |
403 |
Token valid but lacks registry:manage scope. Re-issue via ScaiKey |
404 |
Wrong URL (e.g. missing /api/v1) |
5xx |
ScaiControl problem — check its logs |
Step 4 — Is the heartbeat being recorded?#
If the service reports successful heartbeats but ScaiControl still says last_heartbeat_at is stale, the request is reaching a different ScaiControl instance (load balancer fronting multiple deployments with separate databases) or a stale cache. Verify the service is hitting the actual SCAICONTROL_URL it should.
Backend log line:
1 | |
Grep for it; absence at the expected time means the request didn't land.
Step 5 — Is the monitor cron running?#
1 | |
The cron lives inside the arq worker. If the worker is down, consecutive_misses won't tick down even after heartbeats resume — but last_heartbeat_at WILL update from the live POSTs, so health_status will look stuck at unreachable until the cron runs next.
Restart the worker; one cycle resets the counter.
Step 6 — Service is up but ScaiControl is misconfigured#
Mismatch in the registered URL. ScaiControl's service_registry.base_url is what it'd USE to reach back, not where heartbeats come from — but if you've changed the service's deployment URL without re-registering, downstream provisioning calls will fail (the service marked itself unreachable through ScaiControl's reverse health checks, not via missed heartbeats).
1 | |
Update via PATCH /api/v1/admin/registry/{id} if wrong.
Step 7 — Force-reset the status#
Once the underlying issue is fixed and heartbeats are flowing, the service moves back to healthy automatically on the next successful heartbeat (the heartbeat handler clears consecutive_misses and sets health_status='healthy' in the same transaction). No manual action required.
If you need to nudge it for testing:
1 2 3 | |
This is purely cosmetic — if the underlying issue persists, the next monitor cycle will revert the status.
"Approved" vs "healthy" — different concepts#
Don't conflate them:
registration_status∈ {pending,approved,rejected} — administrative gate; onlyapprovedservices can heartbeat or be provisioned to.health_status∈ {healthy,degraded,unreachable} — operational signal, derived from heartbeats.
A service can be approved + unreachable (just down right now). It cannot be pending + healthy — a pending service has no token to heartbeat with.
See also#
- Reference: configuration — heartbeat env vars
- Reference: state-machines — registry health transitions
- Concepts: architecture — service registry's role