Platform
ScaiWave ScaiGrid ScaiCore ScaiBot ScaiDrive ScaiKey Models Tools & Services
Solutions
Organisations Developers Internet Service Providers Managed Service Providers AI-in-a-Box
Resources
Support Documentation Blog Downloads
Company
About Research Careers Investment Opportunities Contact
Log in

Service marked unreachable

A registered ScaiLabs service shows health_status='unreachable' in /admin/registry, and the operations team sees alerts. Use this page to diagnose why ScaiControl can't see its heartbeats.

How "unreachable" is determined#

The registry_heartbeat_monitor cron runs every REGISTRY_HEARTBEAT_MONITOR_INTERVAL seconds (default 60). For each registered service:

  • It looks at last_heartbeat_at from service_registry.
  • Grace period = service's own heartbeat_interval_seconds × REGISTRY_HEARTBEAT_GRACE_MULTIPLIER (default ×3).
  • If now - last_heartbeat_at > grace, the consecutive_misses counter increments.

Thresholds (configurable via env):

Consecutive misses Health status
0 healthy
REGISTRY_HEARTBEAT_DEGRADED_THRESHOLD (default 3) degraded
REGISTRY_HEARTBEAT_UNREACHABLE_THRESHOLD (default 10) unreachable

So with defaults: a service heartbeating every 30 seconds, with grace = 90 sec, hits unreachable after ~10 missed grace windows = ~15 minutes of silence.

Step 1 — Is the service actually running?#

Standard process check — ps, systemctl status, kubectl get pods, whatever your runtime exposes. If the service is down, that's the answer; start it.

Step 2 — Is it heartbeating?#

Look at the most recent heartbeat in ScaiControl:

sql
1
2
3
4
SELECT id, slug, name, last_heartbeat_at, consecutive_misses, health_status,
       heartbeat_interval_seconds
FROM service_registry
WHERE slug = '<service-slug>';

If last_heartbeat_at is very recent but health_status is still unreachable, the monitor hasn't run yet — wait one cycle.

If last_heartbeat_at is stale, the heartbeats are not arriving. Move to Step 3.

Step 3 — Can the service reach ScaiControl?#

Heartbeats are POST /api/v1/registry/heartbeat with a service token. Test from the service host:

bash
1
2
3
4
curl -i -X POST "$SCAICONTROL_URL/api/v1/registry/heartbeat" \
     -H "Authorization: Bearer $SERVICE_TOKEN" \
     -H "Content-Type: application/json" \
     -d '{"status":"healthy"}'

Expected: 200 {"ok": true}.

Possible failures:

HTTP / network Meaning
Network timeout / connection refused Service can't reach ScaiControl's URL. Check DNS, firewall, ingress rules
401 Service token invalid, expired, or wrong issuer. The service might be reading stale credentials
403 Token valid but lacks registry:manage scope. Re-issue via ScaiKey
404 Wrong URL (e.g. missing /api/v1)
5xx ScaiControl problem — check its logs

Step 4 — Is the heartbeat being recorded?#

If the service reports successful heartbeats but ScaiControl still says last_heartbeat_at is stale, the request is reaching a different ScaiControl instance (load balancer fronting multiple deployments with separate databases) or a stale cache. Verify the service is hitting the actual SCAICONTROL_URL it should.

Backend log line:

verilog
1
INFO  registry.heartbeat slug=<slug> status=healthy

Grep for it; absence at the expected time means the request didn't land.

Step 5 — Is the monitor cron running?#

bash
1
ps aux | grep -E 'arq|heartbeat_monitor'

The cron lives inside the arq worker. If the worker is down, consecutive_misses won't tick down even after heartbeats resume — but last_heartbeat_at WILL update from the live POSTs, so health_status will look stuck at unreachable until the cron runs next.

Restart the worker; one cycle resets the counter.

Step 6 — Service is up but ScaiControl is misconfigured#

Mismatch in the registered URL. ScaiControl's service_registry.base_url is what it'd USE to reach back, not where heartbeats come from — but if you've changed the service's deployment URL without re-registering, downstream provisioning calls will fail (the service marked itself unreachable through ScaiControl's reverse health checks, not via missed heartbeats).

sql
1
SELECT slug, base_url, callback_url FROM service_registry WHERE slug = '<slug>';

Update via PATCH /api/v1/admin/registry/{id} if wrong.

Step 7 — Force-reset the status#

Once the underlying issue is fixed and heartbeats are flowing, the service moves back to healthy automatically on the next successful heartbeat (the heartbeat handler clears consecutive_misses and sets health_status='healthy' in the same transaction). No manual action required.

If you need to nudge it for testing:

sql
1
2
3
UPDATE service_registry
SET health_status = 'healthy', consecutive_misses = 0, last_heartbeat_at = NOW()
WHERE slug = '<slug>';

This is purely cosmetic — if the underlying issue persists, the next monitor cycle will revert the status.

"Approved" vs "healthy" — different concepts#

Don't conflate them:

  • registration_status ∈ {pending, approved, rejected} — administrative gate; only approved services can heartbeat or be provisioned to.
  • health_status ∈ {healthy, degraded, unreachable} — operational signal, derived from heartbeats.

A service can be approved + unreachable (just down right now). It cannot be pending + healthy — a pending service has no token to heartbeat with.

See also#

Updated 2026-05-18 01:48:40 View source (.md) rev 2