Health and Monitoring

Endpoints and signals for keeping ScaiDNS healthy in production.

Health endpoints#

GET /health/live#

Liveness check. Returns 200 as long as the API process is running. Use for container orchestrator livenessProbe configuration.

Response:

json
{"status": "ok"}

Public (no auth required).

GET /health/ready#

Readiness check. Returns 200 only when ScaiDNS can serve traffic — database, Redis, PowerDNS reachable; JWKS cache populated.

Response (healthy):

json
{
  "status": "ok",
  "checks": {
    "database": "ok",
    "redis": "ok",
    "pdns": "ok",
    "scaikey_jwks": "ok"
  }
}

Response (degraded): 503 status code, with per-check failures:

json
{
  "status": "degraded",
  "checks": {
    "database": "ok",
    "redis": "ok",
    "pdns": "connection refused",
    "scaikey_jwks": "ok"
  }
}

Use for readinessProbe and for load-balancer health checks.

Metrics#

If METRICS_ENABLED=true, Prometheus metrics are served at /metrics (public by default — restrict in your reverse proxy).

Key metrics exposed:

Metric	Type	Notes
`scaidns_http_requests_total`	counter	Labeled by method, path, status
`scaidns_http_request_duration_seconds`	histogram	Request latency
`scaidns_db_queries_total`	counter	Database query count
`scaidns_db_query_duration_seconds`	histogram	Database query latency
`scaidns_pdns_requests_total`	counter	PowerDNS API requests
`scaidns_pdns_request_duration_seconds`	histogram	PowerDNS API latency
`scaidns_validation_checks_total`	counter	DNS validation attempts; labeled by result
`scaidns_api_key_usage_total`	counter	API key authentications; labeled by key_id
`scaidns_webhook_events_total`	counter	Incoming webhook events; labeled by event_type, status
`scaidns_worker_jobs_total`	counter	Background worker job executions
`scaidns_worker_jobs_duration_seconds`	histogram	Job latency

Logs#

Application logs to stdout. Set LOG_FORMAT=json for structured logs — each line is a self-contained JSON object suitable for ingestion by ELK, Loki, or Datadog.

Log levels:

DEBUG — verbose, internal state
INFO — request lifecycle, webhook events, major actions
WARNING — recoverable issues (validation skipped, webhook signature missing)
ERROR — unhandled exceptions, dependency failures

Use INFO in production. DEBUG in development or when troubleshooting — it's noisy.

Request logging#

Every request is logged with:

method, path, status_code, duration_ms
user_id, tenant_id, api_key_id (whichever applies)
ip, user_agent

Tracing#

Optional. If TRACING_ENABLED=true, OpenTelemetry traces export to the configured collector. Set:

componentpascal
TRACING_ENABLED=true
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317
OTEL_SERVICE_NAME=scaidns

Traces include HTTP handlers, database queries, Redis operations, and PowerDNS API calls.

Dashboards#

Things worth alerting on:

Availability#

scaidns_http_requests_total{status=~"5.."} rate > a small threshold. 5xx responses indicate real errors.
/health/ready returning non-200 for more than a minute.
PowerDNS unreachable (scaidns_pdns_requests_total{status=~"5..|error"}) — data plane partial outage.

Latency#

P99 API latency > 1 second. Common cause is slow database queries.
P99 PowerDNS latency > 500ms.

Authentication#

API key usage dropping to zero for a key that's normally active — possible credential rotation failure.
scaidns_webhook_events_total{status="error"} rate > 0 — ScaiKey events failing to process.

Workers#

Background job queue backlog — via arq's metrics endpoint if enabled, or via Redis directly.
Failed jobs — retries exhausted.

Database#

MariaDB query patterns:

Hot path: users lookups by ID or email. Index exists.
Hot path: domains lookups by tenant + name. Composite index.
Hot path: records lookups by domain. Foreign key with index.
Audit log writes are batched; spikes are normal.

If you see slow queries, check:

Missing indexes on custom-added columns.
Connection pool exhaustion — increase DATABASE_POOL_SIZE.
Lock contention during bulk operations.

Redis#

Used for:

JWKS cache. Refreshed every SCAIKEY_JWKS_CACHE_TTL seconds. If Redis is unreachable, JWKS is fetched on every JWT validation — slow but not broken.
Token introspection cache. Short-lived (SCAIKEY_TOKEN_CACHE_TTL seconds).
Rate limiting. Per-user and per-key counters.
Worker queue. arq uses Redis as its job store.

Monitor Redis memory pressure; evictions will cause cache misses (not data loss — caches regenerate).

PowerDNS#

ScaiDNS calls PowerDNS's HTTP API on every zone/record/DNSSEC mutation. Watch:

PowerDNS API latency. Should be consistently under 100ms. Slower points to PowerDNS backend issues.
Sync status (GET /api/v1/admin/sync-status) — zones with failed or stale last_synced_at.

Common failure patterns#

All validation checks failing#

Symptom: new domains stuck in pending_validation.

Check:

Is outbound DNS working from the API host? ScaiDNS queries public resolvers.
Is the challenge actually published at the user's current DNS provider?
For self-hosted resolvers, is caching stale NXDOMAIN?

Webhooks not arriving#

Symptom: users added in ScaiKey don't show up in ScaiDNS.

Check:

ScaiKey's webhook delivery log — any failures?
SCAIKEY_WEBHOOK_SECRET matches on both sides.
Webhook URL in ScaiKey points to your external URL.
Users are assigned to the ScaiDNS application in ScaiKey (otherwise events aren't sent).

PowerDNS sync drift#

Symptom: sync-status shows zones with failed.

Check:

PowerDNS logs for the failed zone — often a malformed record.
PowerDNS API key in .env is correct.
Network reachability between ScaiDNS and PowerDNS.

High API latency#

Symptom: P99 > 1s, affecting all endpoints.

Check in order:

Database query duration metric — is one query slow?
PowerDNS API latency — is the downstream slow?
Redis latency — is rate limiting checking slow?
JVM-like pauses? (N/A for Python; but Python garbage collection can cause micro-pauses at scale.)

Backups#

Back up these daily:

MariaDB database. Full dumps. Everything ScaiDNS knows lives here.
PowerDNS database. Zone data lives here; ScaiDNS's audit log references it.
Configuration files. .env — secrets are the hard part.

Retention: 30 days is typical. Longer for audit compliance.

Recovery: restore both databases to a consistent point, restart services. scaidns sync may be needed to catch up on webhook events missed during the outage.

What's next#

Deployment — first-time setup.
Audit Log — action history.