Health and Monitoring
How to know ScaiVault is working, and what to watch for when it isn't.
Health endpoints#
| Endpoint | Checks | Auth |
|---|---|---|
GET /v1/health |
Process alive | No |
GET /v1/health/ready |
DB, Redis, encryption all reachable | No |
GET /v1/health/detailed |
Above plus latencies, pool stats, KMS key status | admin |
Liveness probes: /v1/health. Readiness probes: /v1/health/ready. Never put authentication in front of these — LBs and orchestrators can't negotiate auth.
Prometheus metrics#
GET /v1/metrics exposes Prometheus format. Key series:
Request metrics#
scaivault_requests_total{method, path_category, status}— counter.scaivault_request_duration_seconds{method, path_category}— histogram.scaivault_active_requests{path_category}— gauge.
Rate limiting#
scaivault_rate_limit_hits_total{category}— counter of429responses.scaivault_rate_limit_bucket_fill{category, identity}— gauge of current bucket level. Per-identity series are high-cardinality; aggregate with a recording rule.
Secrets#
scaivault_secrets_total{tenant}— gauge of live secret count.scaivault_secret_reads_total{tenant, secret_type}— counter.scaivault_secret_writes_total{tenant}— counter.scaivault_secret_rotations_total{status}— counter (success|failed).
PKI#
scaivault_certificates_active{ca_id}— gauge.scaivault_certificates_expiring_soon{within_days}— gauge, useful for alerting.scaivault_acme_orders_total{provider, status}— counter.
Dynamic secrets#
scaivault_leases_active{engine}— gauge.scaivault_leases_generated_total{engine, role}— counter.scaivault_engine_health{engine}— gauge, 1 healthy / 0 unreachable.
Dependencies#
scaivault_db_latency_seconds— histogram.scaivault_redis_latency_seconds— histogram.scaivault_kms_latency_seconds{operation}— histogram (encrypt,decrypt,sign).scaivault_db_pool_size,scaivault_db_pool_available— gauges.
Background jobs#
scaivault_rotation_queue_depth— gauge.scaivault_webhook_queue_depth— gauge.scaivault_webhook_delivery_duration_seconds— histogram.
Alert rules#
Suggestions. Tune to your SLO.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 | |
Logs#
JSON-formatted when LOG_FORMAT=json (the default). Each log includes:
1 2 3 4 5 6 7 8 9 10 11 | |
Key fields for filtering:
request_id— trace a single call across components.tenant_id,identity_id— scope to customer or account.level—debug,info,warn,error.
Distributed tracing#
ScaiVault emits OpenTelemetry spans if OTEL_EXPORTER_OTLP_ENDPOINT is set.
1 2 3 | |
Spans cover: incoming HTTP requests, DB queries, Redis calls, KMS operations, outbound HTTP (ScaiKey, webhooks, federated backends, ACME). Client-supplied X-Request-ID becomes the trace ID when present.
Audit-driven alerts#
Some signals are only in the audit log:
- Spike in
policy_violationevents. Someone or something is trying to access things they can't. Investigate. - Reads of a "dormant" secret. If a secret hasn't been read in months and suddenly is, find out why.
- New identity reading a sensitive path. Pair with ownership metadata and alert on unexpected readers.
Pull the audit log into your SIEM (POST /v1/audit/export to S3, ingest from there) and run the detection there. ScaiVault's audit endpoints are not designed for high-QPS detection traffic; export and query elsewhere.
Dashboards#
Useful panels to start with:
- Request rate and status — stacked by path category and 2xx/4xx/5xx.
- P50/P95/P99 latency — per endpoint category.
- Active leases — by engine. Watch for runaway growth (usually means a client isn't revoking).
- Rotation queue depth — should be near zero; sustained growth is a misconfiguration somewhere.
- Certificates expiring in the next 30 days — counts and a table of which.
- Webhook success rate (24h) — per webhook. Below 95% is worth investigating.
- Top readers — identity-keyed, last 1h. Catches changes in traffic shape.
What's next#
- Troubleshooting — common issues and fixes.
- Deployment.