Platform
ScaiWave ScaiGrid ScaiCore ScaiBot ScaiDrive ScaiKey Models Tools & Services
Solutions
Organisations Developers Internet Service Providers Managed Service Providers AI-in-a-Box
Resources
Support Documentation Blog Downloads
Company
About Research Careers Investment Opportunities Contact
Log in

Health and Monitoring

Health checks, Prometheus metrics, log format, and what to alert on.

Health endpoints#

GET /health#

Basic liveness check. Returns 200 OK if the process is running and can serve requests.

bash
1
curl https://scaigrid.scailabs.ai/health
json
1
{"status": "ok"}

No authentication required. Use for load-balancer health checks.

GET /health/ready#

Readiness check. Returns 200 OK only if the process can handle traffic — database reachable, Redis reachable, essential modules loaded.

json
1
2
3
4
5
6
7
8
{
  "status": "ready",
  "checks": {
    "database": "ok",
    "redis": "ok",
    "modules": "ok"
  }
}

Returns 503 if any check fails. Use for Kubernetes readiness probes — traffic won't route to a not-ready pod.

GET /health/detailed#

Detailed status — per-module, per-dependency, with error messages. Admin-only.

bash
1
2
curl https://scaigrid.scailabs.ai/health/detailed \
  -H "Authorization: Bearer $ADMIN_TOKEN"

Response includes every module's init status, last heartbeat from external dependencies (ScaiInfer nodes, ScaiBunker workers), Redis and MariaDB ping latencies.

Useful for diagnostics during incidents.

Prometheus metrics#

Endpoint: GET /metrics (no auth required by default; firewall or basic auth in front for production)

Core metrics#

Request-level:

scdoc
1
2
3
4
scaigrid_requests_total{model, status, protocol}              counter
scaigrid_request_duration_seconds{model, backend}             histogram
scaigrid_tokens_total{model, direction, tenant_id}            counter
scaigrid_time_to_first_token_seconds{model}                   histogram

Backend health:

scdoc
1
2
3
scaigrid_backend_health{backend_id}                            gauge (1 healthy, 0 unhealthy)
scaigrid_backend_inflight_requests{backend_id}                 gauge
scaigrid_circuit_breaker_state{backend_id}                     gauge (0 closed, 1 open, 2 half-open)

Accounting pipeline:

scdoc
1
2
3
scaigrid_accounting_flush_lag_seconds                          gauge
scaigrid_event_bus_consumer_lag                                gauge
scaigrid_redis_stream_length{stream_name}                      gauge

Budgets:

scdoc
1
scaigrid_budget_utilization_ratio{scope, scope_id}             gauge (0.0 to > 1.0)

Session / activity:

scdoc
1
2
3
scaigrid_active_sessions                                       gauge
scaigrid_active_cores                                          gauge
scaigrid_checkpoint_pending_count                              gauge

Webhooks:

carbon
1
scaigrid_webhook_delivery_failures_total{webhook_id, event_type}  counter

Module metrics#

Each module contributes its own metrics with scai{module}_* naming:

  • ScaiBunker: scaibunker_bunkers_active, scaibunker_exec_total, scaibunker_placement_duration_seconds, etc.
  • ScaiCore: scaicore_invocations_total, scaicore_llm_calls_total, scaicore_plugin_calls_total
  • ScaiQueue: (documented in ScaiQueue's internal spec)

Full list: scrape /metrics on a running instance to see what's exposed.

Logging#

ScaiGrid emits structured JSON logs. Every log line has:

json
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
{
  "timestamp": "2026-04-22T14:30:01.234Z",
  "level": "info",
  "logger": "app.services.inference",
  "event": "chat_completion",
  "request_id": "req_abc",
  "tenant_id": "tenant_acme",
  "user_id": "user_alice",
  "model": "scailabs/poolnoodle-omni",
  "latency_ms": 842,
  ...
}

Critical fields for tracing:

  • request_id — correlates across middleware, handlers, dispatchers, database, accounting pipeline
  • tenant_id / user_id — for per-tenant/per-user investigations
  • event — the logical event name (snake_case)

Tenant admins can retrieve logs via the admin UI. For platform operators, logs flow to stdout; point them at your log aggregation stack (Loki, Datadog, Elasticsearch, CloudWatch).

P0 — Wake someone up:

  • /health/ready returns non-200 for > 2 minutes
  • scaigrid_backend_health == 0 for > 50% of backends
  • scaigrid_accounting_flush_lag_seconds > 300 — accounting pipeline stuck
  • MariaDB cluster has < majority nodes healthy

P1 — Investigate in business hours:

  • rate(scaigrid_requests_total{status=~"5.."}[5m]) / rate(scaigrid_requests_total[5m]) > 0.01 — > 1% error rate
  • histogram_quantile(0.99, scaigrid_request_duration_seconds_bucket) > 10 — p99 latency over 10 seconds
  • scaigrid_circuit_breaker_state == 1 for any backend — circuit open

P2 — Keep an eye on:

  • scaigrid_budget_utilization_ratio > 0.8 for any budget — approaching limits
  • rate(scaigrid_webhook_delivery_failures_total[1h]) > 0 — webhook delivery issues
  • scaigrid_event_bus_consumer_lag > 1000 — event processing backing up

Tracing#

ScaiGrid propagates request IDs but doesn't ship with OpenTelemetry instrumentation out of the box. For distributed tracing:

  1. Set X-Request-ID on incoming requests from your frontend load balancer.
  2. ScaiGrid passes it through all downstream calls (database, Redis, upstream LLM APIs, webhook deliveries).
  3. Your logging pipeline correlates by request ID.

For full OTel spans, plug in via the optional instrumentation hook. Ask your ScaiGrid support contact for the latest integration guide.

Dashboards#

Import our reference Grafana dashboards:

  • ScaiGrid Overview — request rate, latency, error rate, backend health
  • ScaiGrid Per-Tenant — same metrics sliced by tenant_id
  • ScaiGrid Modules — per-module metrics for enabled modules
  • ScaiGrid Accounting — token consumption, cost, budget utilization

Dashboard JSON files are in the ScaiGrid source repository under ops/grafana/.

What to check first during an incident#

  1. GET /health/ready — is the basic plumbing alive?
  2. GET /health/detailed — which specific component is unhealthy?
  3. Grep logs for recent ERRORs: level=error
  4. Check backend health: scaigrid_backend_health{backend_id=...}
  5. Check upstream provider status pages (OpenAI, Anthropic, etc.) if a specific provider is failing
  6. Check Redis and MariaDB cluster state
Updated 2026-05-18 15:01:28 View source (.md) rev 17