Health and Monitoring
Endpoints and signals for keeping ScaiDNS healthy in production.
Health endpoints#
GET /health/live#
Liveness check. Returns 200 as long as the API process is running. Use for container orchestrator livenessProbe configuration.
Response:
1 | |
Public (no auth required).
GET /health/ready#
Readiness check. Returns 200 only when ScaiDNS can serve traffic — database, Redis, PowerDNS reachable; JWKS cache populated.
Response (healthy):
1 2 3 4 5 6 7 8 9 | |
Response (degraded): 503 status code, with per-check failures:
1 2 3 4 5 6 7 8 9 | |
Use for readinessProbe and for load-balancer health checks.
Metrics#
If METRICS_ENABLED=true, Prometheus metrics are served at /metrics (public by default — restrict in your reverse proxy).
Key metrics exposed:
| Metric | Type | Notes |
|---|---|---|
scaidns_http_requests_total |
counter | Labeled by method, path, status |
scaidns_http_request_duration_seconds |
histogram | Request latency |
scaidns_db_queries_total |
counter | Database query count |
scaidns_db_query_duration_seconds |
histogram | Database query latency |
scaidns_pdns_requests_total |
counter | PowerDNS API requests |
scaidns_pdns_request_duration_seconds |
histogram | PowerDNS API latency |
scaidns_validation_checks_total |
counter | DNS validation attempts; labeled by result |
scaidns_api_key_usage_total |
counter | API key authentications; labeled by key_id |
scaidns_webhook_events_total |
counter | Incoming webhook events; labeled by event_type, status |
scaidns_worker_jobs_total |
counter | Background worker job executions |
scaidns_worker_jobs_duration_seconds |
histogram | Job latency |
Logs#
Application logs to stdout. Set LOG_FORMAT=json for structured logs — each line is a self-contained JSON object suitable for ingestion by ELK, Loki, or Datadog.
Log levels:
DEBUG— verbose, internal stateINFO— request lifecycle, webhook events, major actionsWARNING— recoverable issues (validation skipped, webhook signature missing)ERROR— unhandled exceptions, dependency failures
Use INFO in production. DEBUG in development or when troubleshooting — it's noisy.
Request logging#
Every request is logged with:
method,path,status_code,duration_msuser_id,tenant_id,api_key_id(whichever applies)ip,user_agent
Tracing#
Optional. If TRACING_ENABLED=true, OpenTelemetry traces export to the configured collector. Set:
1 2 3 | |
Traces include HTTP handlers, database queries, Redis operations, and PowerDNS API calls.
Dashboards#
Things worth alerting on:
Availability#
scaidns_http_requests_total{status=~"5.."}rate > a small threshold. 5xx responses indicate real errors./health/readyreturning non-200 for more than a minute.- PowerDNS unreachable (
scaidns_pdns_requests_total{status=~"5..|error"}) — data plane partial outage.
Latency#
- P99 API latency > 1 second. Common cause is slow database queries.
- P99 PowerDNS latency > 500ms.
Authentication#
- API key usage dropping to zero for a key that's normally active — possible credential rotation failure.
scaidns_webhook_events_total{status="error"}rate > 0 — ScaiKey events failing to process.
Workers#
- Background job queue backlog — via arq's metrics endpoint if enabled, or via Redis directly.
- Failed jobs — retries exhausted.
Database#
MariaDB query patterns:
- Hot path:
userslookups by ID or email. Index exists. - Hot path:
domainslookups by tenant + name. Composite index. - Hot path:
recordslookups by domain. Foreign key with index. - Audit log writes are batched; spikes are normal.
If you see slow queries, check:
- Missing indexes on custom-added columns.
- Connection pool exhaustion — increase
DATABASE_POOL_SIZE. - Lock contention during bulk operations.
Redis#
Used for:
- JWKS cache. Refreshed every
SCAIKEY_JWKS_CACHE_TTLseconds. If Redis is unreachable, JWKS is fetched on every JWT validation — slow but not broken. - Token introspection cache. Short-lived (
SCAIKEY_TOKEN_CACHE_TTLseconds). - Rate limiting. Per-user and per-key counters.
- Worker queue. arq uses Redis as its job store.
Monitor Redis memory pressure; evictions will cause cache misses (not data loss — caches regenerate).
PowerDNS#
ScaiDNS calls PowerDNS's HTTP API on every zone/record/DNSSEC mutation. Watch:
- PowerDNS API latency. Should be consistently under 100ms. Slower points to PowerDNS backend issues.
- Sync status (
GET /api/v1/admin/sync-status) — zones withfailedor stale last_synced_at.
Common failure patterns#
All validation checks failing#
Symptom: new domains stuck in pending_validation.
Check:
- Is outbound DNS working from the API host? ScaiDNS queries public resolvers.
- Is the challenge actually published at the user's current DNS provider?
- For self-hosted resolvers, is caching stale NXDOMAIN?
Webhooks not arriving#
Symptom: users added in ScaiKey don't show up in ScaiDNS.
Check:
- ScaiKey's webhook delivery log — any failures?
SCAIKEY_WEBHOOK_SECRETmatches on both sides.- Webhook URL in ScaiKey points to your external URL.
- Users are assigned to the ScaiDNS application in ScaiKey (otherwise events aren't sent).
PowerDNS sync drift#
Symptom: sync-status shows zones with failed.
Check:
- PowerDNS logs for the failed zone — often a malformed record.
- PowerDNS API key in
.envis correct. - Network reachability between ScaiDNS and PowerDNS.
High API latency#
Symptom: P99 > 1s, affecting all endpoints.
Check in order:
- Database query duration metric — is one query slow?
- PowerDNS API latency — is the downstream slow?
- Redis latency — is rate limiting checking slow?
- JVM-like pauses? (N/A for Python; but Python garbage collection can cause micro-pauses at scale.)
Backups#
Back up these daily:
- MariaDB database. Full dumps. Everything ScaiDNS knows lives here.
- PowerDNS database. Zone data lives here; ScaiDNS's audit log references it.
- Configuration files.
.env— secrets are the hard part.
Retention: 30 days is typical. Longer for audit compliance.
Recovery: restore both databases to a consistent point, restart services. scaidns sync may be needed to catch up on webhook events missed during the outage.
What's next#
- Deployment — first-time setup.
- Audit Log — action history.