---
title: Health and Monitoring
path: operations/health-and-monitoring
status: published
---

# Health and Monitoring

Endpoints and signals for keeping ScaiDNS healthy in production.

## Health endpoints

### GET /health/live

Liveness check. Returns `200` as long as the API process is running. Use for container orchestrator `livenessProbe` configuration.

**Response:**

```json
{"status": "ok"}
```

Public (no auth required).

### GET /health/ready

Readiness check. Returns `200` only when ScaiDNS can serve traffic — database, Redis, PowerDNS reachable; JWKS cache populated.

**Response (healthy):**

```json
{
  "status": "ok",
  "checks": {
    "database": "ok",
    "redis": "ok",
    "pdns": "ok",
    "scaikey_jwks": "ok"
  }
}
```

**Response (degraded):** `503` status code, with per-check failures:

```json
{
  "status": "degraded",
  "checks": {
    "database": "ok",
    "redis": "ok",
    "pdns": "connection refused",
    "scaikey_jwks": "ok"
  }
}
```

Use for `readinessProbe` and for load-balancer health checks.

## Metrics

If `METRICS_ENABLED=true`, Prometheus metrics are served at `/metrics` (public by default — restrict in your reverse proxy).

Key metrics exposed:

| Metric | Type | Notes |
|--------|------|-------|
| `scaidns_http_requests_total` | counter | Labeled by method, path, status |
| `scaidns_http_request_duration_seconds` | histogram | Request latency |
| `scaidns_db_queries_total` | counter | Database query count |
| `scaidns_db_query_duration_seconds` | histogram | Database query latency |
| `scaidns_pdns_requests_total` | counter | PowerDNS API requests |
| `scaidns_pdns_request_duration_seconds` | histogram | PowerDNS API latency |
| `scaidns_validation_checks_total` | counter | DNS validation attempts; labeled by result |
| `scaidns_api_key_usage_total` | counter | API key authentications; labeled by key_id |
| `scaidns_webhook_events_total` | counter | Incoming webhook events; labeled by event_type, status |
| `scaidns_worker_jobs_total` | counter | Background worker job executions |
| `scaidns_worker_jobs_duration_seconds` | histogram | Job latency |

## Logs

Application logs to stdout. Set `LOG_FORMAT=json` for structured logs — each line is a self-contained JSON object suitable for ingestion by ELK, Loki, or Datadog.

Log levels:

- `DEBUG` — verbose, internal state
- `INFO` — request lifecycle, webhook events, major actions
- `WARNING` — recoverable issues (validation skipped, webhook signature missing)
- `ERROR` — unhandled exceptions, dependency failures

Use `INFO` in production. `DEBUG` in development or when troubleshooting — it's noisy.

### Request logging

Every request is logged with:

- `method`, `path`, `status_code`, `duration_ms`
- `user_id`, `tenant_id`, `api_key_id` (whichever applies)
- `ip`, `user_agent`

## Tracing

Optional. If `TRACING_ENABLED=true`, OpenTelemetry traces export to the configured collector. Set:

```
TRACING_ENABLED=true
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317
OTEL_SERVICE_NAME=scaidns
```

Traces include HTTP handlers, database queries, Redis operations, and PowerDNS API calls.

## Dashboards

Things worth alerting on:

### Availability

- **`scaidns_http_requests_total{status=~"5.."}` rate** > a small threshold. 5xx responses indicate real errors.
- **`/health/ready` returning non-200** for more than a minute.
- **PowerDNS unreachable** (`scaidns_pdns_requests_total{status=~"5..|error"}`) — data plane partial outage.

### Latency

- **P99 API latency** > 1 second. Common cause is slow database queries.
- **P99 PowerDNS latency** > 500ms.

### Authentication

- **API key usage dropping to zero** for a key that's normally active — possible credential rotation failure.
- **`scaidns_webhook_events_total{status="error"}` rate** > 0 — ScaiKey events failing to process.

### Workers

- **Background job queue backlog** — via arq's metrics endpoint if enabled, or via Redis directly.
- **Failed jobs** — retries exhausted.

## Database

MariaDB query patterns:

- Hot path: `users` lookups by ID or email. Index exists.
- Hot path: `domains` lookups by tenant + name. Composite index.
- Hot path: `records` lookups by domain. Foreign key with index.
- Audit log writes are batched; spikes are normal.

If you see slow queries, check:

- Missing indexes on custom-added columns.
- Connection pool exhaustion — increase `DATABASE_POOL_SIZE`.
- Lock contention during bulk operations.

## Redis

Used for:

- **JWKS cache.** Refreshed every `SCAIKEY_JWKS_CACHE_TTL` seconds. If Redis is unreachable, JWKS is fetched on every JWT validation — slow but not broken.
- **Token introspection cache.** Short-lived (`SCAIKEY_TOKEN_CACHE_TTL` seconds).
- **Rate limiting.** Per-user and per-key counters.
- **Worker queue.** arq uses Redis as its job store.

Monitor Redis memory pressure; evictions will cause cache misses (not data loss — caches regenerate).

## PowerDNS

ScaiDNS calls PowerDNS's HTTP API on every zone/record/DNSSEC mutation. Watch:

- **PowerDNS API latency.** Should be consistently under 100ms. Slower points to PowerDNS backend issues.
- **Sync status** (`GET /api/v1/admin/sync-status`) — zones with `failed` or stale last_synced_at.

## Common failure patterns

### All validation checks failing

Symptom: new domains stuck in `pending_validation`.

Check:

- Is outbound DNS working from the API host? ScaiDNS queries public resolvers.
- Is the challenge actually published at the user's current DNS provider?
- For self-hosted resolvers, is caching stale NXDOMAIN?

### Webhooks not arriving

Symptom: users added in ScaiKey don't show up in ScaiDNS.

Check:

- ScaiKey's webhook delivery log — any failures?
- `SCAIKEY_WEBHOOK_SECRET` matches on both sides.
- Webhook URL in ScaiKey points to your external URL.
- Users are assigned to the ScaiDNS application in ScaiKey (otherwise events aren't sent).

### PowerDNS sync drift

Symptom: `sync-status` shows zones with `failed`.

Check:

- PowerDNS logs for the failed zone — often a malformed record.
- PowerDNS API key in `.env` is correct.
- Network reachability between ScaiDNS and PowerDNS.

### High API latency

Symptom: P99 > 1s, affecting all endpoints.

Check in order:

1. Database query duration metric — is one query slow?
2. PowerDNS API latency — is the downstream slow?
3. Redis latency — is rate limiting checking slow?
4. JVM-like pauses? (N/A for Python; but Python garbage collection can cause micro-pauses at scale.)

## Backups

Back up these daily:

- **MariaDB database.** Full dumps. Everything ScaiDNS knows lives here.
- **PowerDNS database.** Zone data lives here; ScaiDNS's audit log references it.
- **Configuration files.** `.env` — secrets are the hard part.

Retention: 30 days is typical. Longer for audit compliance.

Recovery: restore both databases to a consistent point, restart services. `scaidns sync` may be needed to catch up on webhook events missed during the outage.

## What's next

- [Deployment](./deployment.md) — first-time setup.
- [Audit Log](../reference/audit-log.md) — action history.
