Health and Monitoring

Health checks, Prometheus metrics, log format, and what to alert on.

Health endpoints#

GET /health#

Basic liveness check. Returns 200 OK if the process is running and can serve requests.

bash

1	`curl https://scaigrid.scailabs.ai/health`

json
{"status": "ok"}

No authentication required. Use for load-balancer health checks.

GET /health/ready#

Readiness check. Returns 200 OK only if the process can handle traffic — database reachable, Redis reachable, essential modules loaded.

json
{
  "status": "ready",
  "checks": {
    "database": "ok",
    "redis": "ok",
    "modules": "ok"
  }
}

Returns 503 if any check fails. Use for Kubernetes readiness probes — traffic won't route to a not-ready pod.

GET /health/detailed#

Detailed status — per-module, per-dependency, with error messages. Admin-only.

bash
curl https://scaigrid.scailabs.ai/health/detailed \
  -H "Authorization: Bearer $ADMIN_TOKEN"

Response includes every module's init status, last heartbeat from external dependencies (ScaiInfer nodes, ScaiBunker workers), Redis and MariaDB ping latencies.

Useful for diagnostics during incidents.

Prometheus metrics#

Endpoint: GET /metrics (no auth required by default; firewall or basic auth in front for production)

Core metrics#

Request-level:

scdoc

scaigrid_requests_total{model, status, protocol}              counter
scaigrid_request_duration_seconds{model, backend}             histogram
scaigrid_tokens_total{model, direction, tenant_id}            counter
scaigrid_time_to_first_token_seconds{model}                   histogram

Backend health:

scdoc

1
2
3

scaigrid_backend_health{backend_id}                            gauge (1 healthy, 0 unhealthy)
scaigrid_backend_inflight_requests{backend_id}                 gauge
scaigrid_circuit_breaker_state{backend_id}                     gauge (0 closed, 1 open, 2 half-open)

Accounting pipeline:

scdoc

1
2
3

scaigrid_accounting_flush_lag_seconds                          gauge
scaigrid_event_bus_consumer_lag                                gauge
scaigrid_redis_stream_length{stream_name}                      gauge

Budgets:

scdoc

1	`scaigrid_budget_utilization_ratio{scope, scope_id} gauge (0.0 to > 1.0)`

Session / activity:

scdoc

1
2
3

scaigrid_active_sessions                                       gauge
scaigrid_active_cores                                          gauge
scaigrid_checkpoint_pending_count                              gauge

Webhooks:

carbon
scaigrid_webhook_delivery_failures_total{webhook_id, event_type}  counter

Module metrics#

Each module contributes its own metrics with scai{module}_* naming:

ScaiBunker: scaibunker_bunkers_active, scaibunker_exec_total, scaibunker_placement_duration_seconds, etc.
ScaiCore: scaicore_invocations_total, scaicore_llm_calls_total, scaicore_plugin_calls_total
ScaiQueue: (documented in ScaiQueue's internal spec)

Full list: scrape /metrics on a running instance to see what's exposed.

Logging#

ScaiGrid emits structured JSON logs. Every log line has:

json
{
  "timestamp": "2026-04-22T14:30:01.234Z",
  "level": "info",
  "logger": "app.services.inference",
  "event": "chat_completion",
  "request_id": "req_abc",
  "tenant_id": "tenant_acme",
  "user_id": "user_alice",
  "model": "scailabs/poolnoodle-omni",
  "latency_ms": 842,
  ...
}

Critical fields for tracing:

request_id — correlates across middleware, handlers, dispatchers, database, accounting pipeline
tenant_id / user_id — for per-tenant/per-user investigations
event — the logical event name (snake_case)

Tenant admins can retrieve logs via the admin UI. For platform operators, logs flow to stdout; point them at your log aggregation stack (Loki, Datadog, Elasticsearch, CloudWatch).

Recommended alerts#

P0 — Wake someone up:

/health/ready returns non-200 for > 2 minutes
scaigrid_backend_health == 0 for > 50% of backends
scaigrid_accounting_flush_lag_seconds > 300 — accounting pipeline stuck
MariaDB cluster has < majority nodes healthy

P1 — Investigate in business hours:

rate(scaigrid_requests_total{status=~"5.."}[5m]) / rate(scaigrid_requests_total[5m]) > 0.01 — > 1% error rate
histogram_quantile(0.99, scaigrid_request_duration_seconds_bucket) > 10 — p99 latency over 10 seconds
scaigrid_circuit_breaker_state == 1 for any backend — circuit open

P2 — Keep an eye on:

scaigrid_budget_utilization_ratio > 0.8 for any budget — approaching limits
rate(scaigrid_webhook_delivery_failures_total[1h]) > 0 — webhook delivery issues
scaigrid_event_bus_consumer_lag > 1000 — event processing backing up

Tracing#

ScaiGrid propagates request IDs but doesn't ship with OpenTelemetry instrumentation out of the box. For distributed tracing:

Set X-Request-ID on incoming requests from your frontend load balancer.
ScaiGrid passes it through all downstream calls (database, Redis, upstream LLM APIs, webhook deliveries).
Your logging pipeline correlates by request ID.

For full OTel spans, plug in via the optional instrumentation hook. Ask your ScaiGrid support contact for the latest integration guide.

Dashboards#

Import our reference Grafana dashboards:

ScaiGrid Overview — request rate, latency, error rate, backend health
ScaiGrid Per-Tenant — same metrics sliced by tenant_id
ScaiGrid Modules — per-module metrics for enabled modules
ScaiGrid Accounting — token consumption, cost, budget utilization

Dashboard JSON files are in the ScaiGrid source repository under ops/grafana/.

What to check first during an incident#

GET /health/ready — is the basic plumbing alive?
GET /health/detailed — which specific component is unhealthy?
Grep logs for recent ERRORs: level=error
Check backend health: scaigrid_backend_health{backend_id=...}
Check upstream provider status pages (OpenAI, Anthropic, etc.) if a specific provider is failing
Check Redis and MariaDB cluster state

Health and Monitoring

Health endpoints#

GET /health#

GET /health/ready#

GET /health/detailed#

Prometheus metrics#

Core metrics#

Module metrics#

Logging#

Recommended alerts#

Tracing#

Dashboards#

What to check first during an incident#

Related#