Health and Monitoring
Health checks, Prometheus metrics, log format, and what to alert on.
Health endpoints#
GET /health#
Basic liveness check. Returns 200 OK if the process is running and can serve requests.
1 | |
1 | |
No authentication required. Use for load-balancer health checks.
GET /health/ready#
Readiness check. Returns 200 OK only if the process can handle traffic — database reachable, Redis reachable, essential modules loaded.
1 2 3 4 5 6 7 8 | |
Returns 503 if any check fails. Use for Kubernetes readiness probes — traffic won't route to a not-ready pod.
GET /health/detailed#
Detailed status — per-module, per-dependency, with error messages. Admin-only.
1 2 | |
Response includes every module's init status, last heartbeat from external dependencies (ScaiInfer nodes, ScaiBunker workers), Redis and MariaDB ping latencies.
Useful for diagnostics during incidents.
Prometheus metrics#
Endpoint: GET /metrics (no auth required by default; firewall or basic auth in front for production)
Core metrics#
Request-level:
1 2 3 4 | |
Backend health:
1 2 3 | |
Accounting pipeline:
1 2 3 | |
Budgets:
1 | |
Session / activity:
1 2 3 | |
Webhooks:
1 | |
Module metrics#
Each module contributes its own metrics with scai{module}_* naming:
- ScaiBunker:
scaibunker_bunkers_active,scaibunker_exec_total,scaibunker_placement_duration_seconds, etc. - ScaiCore:
scaicore_invocations_total,scaicore_llm_calls_total,scaicore_plugin_calls_total - ScaiQueue: (documented in ScaiQueue's internal spec)
Full list: scrape /metrics on a running instance to see what's exposed.
Logging#
ScaiGrid emits structured JSON logs. Every log line has:
1 2 3 4 5 6 7 8 9 10 11 12 | |
Critical fields for tracing:
request_id— correlates across middleware, handlers, dispatchers, database, accounting pipelinetenant_id/user_id— for per-tenant/per-user investigationsevent— the logical event name (snake_case)
Tenant admins can retrieve logs via the admin UI. For platform operators, logs flow to stdout; point them at your log aggregation stack (Loki, Datadog, Elasticsearch, CloudWatch).
Recommended alerts#
P0 — Wake someone up:
/health/readyreturns non-200 for > 2 minutesscaigrid_backend_health == 0for > 50% of backendsscaigrid_accounting_flush_lag_seconds > 300— accounting pipeline stuck- MariaDB cluster has < majority nodes healthy
P1 — Investigate in business hours:
rate(scaigrid_requests_total{status=~"5.."}[5m]) / rate(scaigrid_requests_total[5m]) > 0.01— > 1% error ratehistogram_quantile(0.99, scaigrid_request_duration_seconds_bucket) > 10— p99 latency over 10 secondsscaigrid_circuit_breaker_state == 1for any backend — circuit open
P2 — Keep an eye on:
scaigrid_budget_utilization_ratio > 0.8for any budget — approaching limitsrate(scaigrid_webhook_delivery_failures_total[1h]) > 0— webhook delivery issuesscaigrid_event_bus_consumer_lag > 1000— event processing backing up
Tracing#
ScaiGrid propagates request IDs but doesn't ship with OpenTelemetry instrumentation out of the box. For distributed tracing:
- Set
X-Request-IDon incoming requests from your frontend load balancer. - ScaiGrid passes it through all downstream calls (database, Redis, upstream LLM APIs, webhook deliveries).
- Your logging pipeline correlates by request ID.
For full OTel spans, plug in via the optional instrumentation hook. Ask your ScaiGrid support contact for the latest integration guide.
Dashboards#
Import our reference Grafana dashboards:
- ScaiGrid Overview — request rate, latency, error rate, backend health
- ScaiGrid Per-Tenant — same metrics sliced by tenant_id
- ScaiGrid Modules — per-module metrics for enabled modules
- ScaiGrid Accounting — token consumption, cost, budget utilization
Dashboard JSON files are in the ScaiGrid source repository under ops/grafana/.
What to check first during an incident#
GET /health/ready— is the basic plumbing alive?GET /health/detailed— which specific component is unhealthy?- Grep logs for recent ERRORs:
level=error - Check backend health:
scaigrid_backend_health{backend_id=...} - Check upstream provider status pages (OpenAI, Anthropic, etc.) if a specific provider is failing
- Check Redis and MariaDB cluster state