Platform
ScaiWave ScaiGrid ScaiCore ScaiBot ScaiDrive ScaiKey Models Tools & Services
Solutions
Organisations Developers Internet Service Providers Managed Service Providers AI-in-a-Box
Resources
Support Documentation Blog Downloads
Company
About Research Careers Investment Opportunities Contact
Log in

Health and Monitoring

Endpoints and signals for keeping ScaiDNS healthy in production.

Health endpoints#

GET /health/live#

Liveness check. Returns 200 as long as the API process is running. Use for container orchestrator livenessProbe configuration.

Response:

json
1
{"status": "ok"}

Public (no auth required).

GET /health/ready#

Readiness check. Returns 200 only when ScaiDNS can serve traffic — database, Redis, PowerDNS reachable; JWKS cache populated.

Response (healthy):

json
1
2
3
4
5
6
7
8
9
{
  "status": "ok",
  "checks": {
    "database": "ok",
    "redis": "ok",
    "pdns": "ok",
    "scaikey_jwks": "ok"
  }
}

Response (degraded): 503 status code, with per-check failures:

json
1
2
3
4
5
6
7
8
9
{
  "status": "degraded",
  "checks": {
    "database": "ok",
    "redis": "ok",
    "pdns": "connection refused",
    "scaikey_jwks": "ok"
  }
}

Use for readinessProbe and for load-balancer health checks.

Metrics#

If METRICS_ENABLED=true, Prometheus metrics are served at /metrics (public by default — restrict in your reverse proxy).

Key metrics exposed:

Metric Type Notes
scaidns_http_requests_total counter Labeled by method, path, status
scaidns_http_request_duration_seconds histogram Request latency
scaidns_db_queries_total counter Database query count
scaidns_db_query_duration_seconds histogram Database query latency
scaidns_pdns_requests_total counter PowerDNS API requests
scaidns_pdns_request_duration_seconds histogram PowerDNS API latency
scaidns_validation_checks_total counter DNS validation attempts; labeled by result
scaidns_api_key_usage_total counter API key authentications; labeled by key_id
scaidns_webhook_events_total counter Incoming webhook events; labeled by event_type, status
scaidns_worker_jobs_total counter Background worker job executions
scaidns_worker_jobs_duration_seconds histogram Job latency

Logs#

Application logs to stdout. Set LOG_FORMAT=json for structured logs — each line is a self-contained JSON object suitable for ingestion by ELK, Loki, or Datadog.

Log levels:

  • DEBUG — verbose, internal state
  • INFO — request lifecycle, webhook events, major actions
  • WARNING — recoverable issues (validation skipped, webhook signature missing)
  • ERROR — unhandled exceptions, dependency failures

Use INFO in production. DEBUG in development or when troubleshooting — it's noisy.

Request logging#

Every request is logged with:

  • method, path, status_code, duration_ms
  • user_id, tenant_id, api_key_id (whichever applies)
  • ip, user_agent

Tracing#

Optional. If TRACING_ENABLED=true, OpenTelemetry traces export to the configured collector. Set:

componentpascal
1
2
3
TRACING_ENABLED=true
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317
OTEL_SERVICE_NAME=scaidns

Traces include HTTP handlers, database queries, Redis operations, and PowerDNS API calls.

Dashboards#

Things worth alerting on:

Availability#

  • scaidns_http_requests_total{status=~"5.."} rate > a small threshold. 5xx responses indicate real errors.
  • /health/ready returning non-200 for more than a minute.
  • PowerDNS unreachable (scaidns_pdns_requests_total{status=~"5..|error"}) — data plane partial outage.

Latency#

  • P99 API latency > 1 second. Common cause is slow database queries.
  • P99 PowerDNS latency > 500ms.

Authentication#

  • API key usage dropping to zero for a key that's normally active — possible credential rotation failure.
  • scaidns_webhook_events_total{status="error"} rate > 0 — ScaiKey events failing to process.

Workers#

  • Background job queue backlog — via arq's metrics endpoint if enabled, or via Redis directly.
  • Failed jobs — retries exhausted.

Database#

MariaDB query patterns:

  • Hot path: users lookups by ID or email. Index exists.
  • Hot path: domains lookups by tenant + name. Composite index.
  • Hot path: records lookups by domain. Foreign key with index.
  • Audit log writes are batched; spikes are normal.

If you see slow queries, check:

  • Missing indexes on custom-added columns.
  • Connection pool exhaustion — increase DATABASE_POOL_SIZE.
  • Lock contention during bulk operations.

Redis#

Used for:

  • JWKS cache. Refreshed every SCAIKEY_JWKS_CACHE_TTL seconds. If Redis is unreachable, JWKS is fetched on every JWT validation — slow but not broken.
  • Token introspection cache. Short-lived (SCAIKEY_TOKEN_CACHE_TTL seconds).
  • Rate limiting. Per-user and per-key counters.
  • Worker queue. arq uses Redis as its job store.

Monitor Redis memory pressure; evictions will cause cache misses (not data loss — caches regenerate).

PowerDNS#

ScaiDNS calls PowerDNS's HTTP API on every zone/record/DNSSEC mutation. Watch:

  • PowerDNS API latency. Should be consistently under 100ms. Slower points to PowerDNS backend issues.
  • Sync status (GET /api/v1/admin/sync-status) — zones with failed or stale last_synced_at.

Common failure patterns#

All validation checks failing#

Symptom: new domains stuck in pending_validation.

Check:

  • Is outbound DNS working from the API host? ScaiDNS queries public resolvers.
  • Is the challenge actually published at the user's current DNS provider?
  • For self-hosted resolvers, is caching stale NXDOMAIN?

Webhooks not arriving#

Symptom: users added in ScaiKey don't show up in ScaiDNS.

Check:

  • ScaiKey's webhook delivery log — any failures?
  • SCAIKEY_WEBHOOK_SECRET matches on both sides.
  • Webhook URL in ScaiKey points to your external URL.
  • Users are assigned to the ScaiDNS application in ScaiKey (otherwise events aren't sent).

PowerDNS sync drift#

Symptom: sync-status shows zones with failed.

Check:

  • PowerDNS logs for the failed zone — often a malformed record.
  • PowerDNS API key in .env is correct.
  • Network reachability between ScaiDNS and PowerDNS.

High API latency#

Symptom: P99 > 1s, affecting all endpoints.

Check in order:

  1. Database query duration metric — is one query slow?
  2. PowerDNS API latency — is the downstream slow?
  3. Redis latency — is rate limiting checking slow?
  4. JVM-like pauses? (N/A for Python; but Python garbage collection can cause micro-pauses at scale.)

Backups#

Back up these daily:

  • MariaDB database. Full dumps. Everything ScaiDNS knows lives here.
  • PowerDNS database. Zone data lives here; ScaiDNS's audit log references it.
  • Configuration files. .env — secrets are the hard part.

Retention: 30 days is typical. Longer for audit compliance.

Recovery: restore both databases to a consistent point, restart services. scaidns sync may be needed to catch up on webhook events missed during the outage.

What's next#

Updated 2026-05-17 02:38:19 View source (.md) rev 1