---
title: Health and Monitoring
path: operations/health-and-monitoring
status: published
---

# Health and Monitoring

Health checks, Prometheus metrics, log format, and what to alert on.

## Health endpoints

### GET /health

Basic liveness check. Returns `200 OK` if the process is running and can serve requests.

```bash
curl https://scaigrid.scailabs.ai/health
```

```json
{"status": "ok"}
```

No authentication required. Use for load-balancer health checks.

### GET /health/ready

Readiness check. Returns `200 OK` only if the process can handle traffic — database reachable, Redis reachable, essential modules loaded.

```json
{
  "status": "ready",
  "checks": {
    "database": "ok",
    "redis": "ok",
    "modules": "ok"
  }
}
```

Returns `503` if any check fails. Use for Kubernetes readiness probes — traffic won't route to a not-ready pod.

### GET /health/detailed

Detailed status — per-module, per-dependency, with error messages. Admin-only.

```bash
curl https://scaigrid.scailabs.ai/health/detailed \
  -H "Authorization: Bearer $ADMIN_TOKEN"
```

Response includes every module's init status, last heartbeat from external dependencies (ScaiInfer nodes, ScaiBunker workers), Redis and MariaDB ping latencies.

Useful for diagnostics during incidents.

## Prometheus metrics

**Endpoint:** `GET /metrics` (no auth required by default; firewall or basic auth in front for production)

### Core metrics

**Request-level:**

```
scaigrid_requests_total{model, status, protocol}              counter
scaigrid_request_duration_seconds{model, backend}             histogram
scaigrid_tokens_total{model, direction, tenant_id}            counter
scaigrid_time_to_first_token_seconds{model}                   histogram
```

**Backend health:**

```
scaigrid_backend_health{backend_id}                            gauge (1 healthy, 0 unhealthy)
scaigrid_backend_inflight_requests{backend_id}                 gauge
scaigrid_circuit_breaker_state{backend_id}                     gauge (0 closed, 1 open, 2 half-open)
```

**Accounting pipeline:**

```
scaigrid_accounting_flush_lag_seconds                          gauge
scaigrid_event_bus_consumer_lag                                gauge
scaigrid_redis_stream_length{stream_name}                      gauge
```

**Budgets:**

```
scaigrid_budget_utilization_ratio{scope, scope_id}             gauge (0.0 to > 1.0)
```

**Session / activity:**

```
scaigrid_active_sessions                                       gauge
scaigrid_active_cores                                          gauge
scaigrid_checkpoint_pending_count                              gauge
```

**Webhooks:**

```
scaigrid_webhook_delivery_failures_total{webhook_id, event_type}  counter
```

### Module metrics

Each module contributes its own metrics with `scai{module}_*` naming:

- ScaiBunker: `scaibunker_bunkers_active`, `scaibunker_exec_total`, `scaibunker_placement_duration_seconds`, etc.
- ScaiCore: `scaicore_invocations_total`, `scaicore_llm_calls_total`, `scaicore_plugin_calls_total`
- ScaiQueue: (documented in ScaiQueue's internal spec)

Full list: scrape `/metrics` on a running instance to see what's exposed.

## Logging

ScaiGrid emits structured JSON logs. Every log line has:

```json
{
  "timestamp": "2026-04-22T14:30:01.234Z",
  "level": "info",
  "logger": "app.services.inference",
  "event": "chat_completion",
  "request_id": "req_abc",
  "tenant_id": "tenant_acme",
  "user_id": "user_alice",
  "model": "scailabs/poolnoodle-omni",
  "latency_ms": 842,
  ...
}
```

**Critical fields for tracing:**

- `request_id` — correlates across middleware, handlers, dispatchers, database, accounting pipeline
- `tenant_id` / `user_id` — for per-tenant/per-user investigations
- `event` — the logical event name (snake_case)

Tenant admins can retrieve logs via the admin UI. For platform operators, logs flow to stdout; point them at your log aggregation stack (Loki, Datadog, Elasticsearch, CloudWatch).

## Recommended alerts

**P0 — Wake someone up:**

- `/health/ready` returns non-200 for > 2 minutes
- `scaigrid_backend_health == 0` for > 50% of backends
- `scaigrid_accounting_flush_lag_seconds > 300` — accounting pipeline stuck
- MariaDB cluster has < majority nodes healthy

**P1 — Investigate in business hours:**

- `rate(scaigrid_requests_total{status=~"5.."}[5m]) / rate(scaigrid_requests_total[5m]) > 0.01` — > 1% error rate
- `histogram_quantile(0.99, scaigrid_request_duration_seconds_bucket) > 10` — p99 latency over 10 seconds
- `scaigrid_circuit_breaker_state == 1` for any backend — circuit open

**P2 — Keep an eye on:**

- `scaigrid_budget_utilization_ratio > 0.8` for any budget — approaching limits
- `rate(scaigrid_webhook_delivery_failures_total[1h]) > 0` — webhook delivery issues
- `scaigrid_event_bus_consumer_lag > 1000` — event processing backing up

## Tracing

ScaiGrid propagates request IDs but doesn't ship with OpenTelemetry instrumentation out of the box. For distributed tracing:

1. Set `X-Request-ID` on incoming requests from your frontend load balancer.
2. ScaiGrid passes it through all downstream calls (database, Redis, upstream LLM APIs, webhook deliveries).
3. Your logging pipeline correlates by request ID.

For full OTel spans, plug in via the optional instrumentation hook. Ask your ScaiGrid support contact for the latest integration guide.

## Dashboards

Import our reference Grafana dashboards:

- **ScaiGrid Overview** — request rate, latency, error rate, backend health
- **ScaiGrid Per-Tenant** — same metrics sliced by tenant_id
- **ScaiGrid Modules** — per-module metrics for enabled modules
- **ScaiGrid Accounting** — token consumption, cost, budget utilization

Dashboard JSON files are in the ScaiGrid source repository under `ops/grafana/`.

## What to check first during an incident

1. `GET /health/ready` — is the basic plumbing alive?
2. `GET /health/detailed` — which specific component is unhealthy?
3. Grep logs for recent ERRORs: `level=error`
4. Check backend health: `scaigrid_backend_health{backend_id=...}`
5. Check upstream provider status pages (OpenAI, Anthropic, etc.) if a specific provider is failing
6. Check Redis and MariaDB cluster state

## Related

- [Deployment](./01-deployment.md)
- [Troubleshooting](./03-troubleshooting.md)
- [Errors](../03-core-concepts/07-errors.md)
