Health and Monitoring

How to know ScaiVault is working, and what to watch for when it isn't.

Health endpoints#

Endpoint	Checks	Auth
`GET /v1/health`	Process alive	No
`GET /v1/health/ready`	DB, Redis, encryption all reachable	No
`GET /v1/health/detailed`	Above plus latencies, pool stats, KMS key status	`admin`

Liveness probes: /v1/health. Readiness probes: /v1/health/ready. Never put authentication in front of these — LBs and orchestrators can't negotiate auth.

Prometheus metrics#

GET /v1/metrics exposes Prometheus format. Key series:

Request metrics#

scaivault_requests_total{method, path_category, status} — counter.
scaivault_request_duration_seconds{method, path_category} — histogram.
scaivault_active_requests{path_category} — gauge.

Rate limiting#

scaivault_rate_limit_hits_total{category} — counter of 429 responses.
scaivault_rate_limit_bucket_fill{category, identity} — gauge of current bucket level. Per-identity series are high-cardinality; aggregate with a recording rule.

Secrets#

scaivault_secrets_total{tenant} — gauge of live secret count.
scaivault_secret_reads_total{tenant, secret_type} — counter.
scaivault_secret_writes_total{tenant} — counter.
scaivault_secret_rotations_total{status} — counter (success | failed).

PKI#

scaivault_certificates_active{ca_id} — gauge.
scaivault_certificates_expiring_soon{within_days} — gauge, useful for alerting.
scaivault_acme_orders_total{provider, status} — counter.

Dynamic secrets#

scaivault_leases_active{engine} — gauge.
scaivault_leases_generated_total{engine, role} — counter.
scaivault_engine_health{engine} — gauge, 1 healthy / 0 unreachable.

Dependencies#

scaivault_db_latency_seconds — histogram.
scaivault_redis_latency_seconds — histogram.
scaivault_kms_latency_seconds{operation} — histogram (encrypt, decrypt, sign).
scaivault_db_pool_size, scaivault_db_pool_available — gauges.

Background jobs#

scaivault_rotation_queue_depth — gauge.
scaivault_webhook_queue_depth — gauge.
scaivault_webhook_delivery_duration_seconds — histogram.

Alert rules#

Suggestions. Tune to your SLO.

yaml
# P95 request latency
- alert: ScaiVaultHighLatency
  expr: histogram_quantile(0.95, rate(scaivault_request_duration_seconds_bucket[5m])) > 1.0
  for: 10m
  annotations:
    summary: "ScaiVault P95 > 1s"

# Error rate
- alert: ScaiVaultErrorRate
  expr: sum(rate(scaivault_requests_total{status=~"5.."}[5m])) / sum(rate(scaivault_requests_total[5m])) > 0.01
  for: 5m

# KMS unreachable
- alert: ScaiVaultKMSFailing
  expr: rate(scaivault_kms_latency_seconds_count[5m]) == 0 and sum(scaivault_active_requests) > 0
  for: 2m

# Readiness
- alert: ScaiVaultNotReady
  expr: up{job="scaivault"} == 0
  for: 2m

# Certificates approaching expiry
- alert: ScaiVaultCertsExpiringSoon
  expr: scaivault_certificates_expiring_soon{within_days="14"} > 0
  for: 1h

# Rotation backlog
- alert: ScaiVaultRotationQueueDeep
  expr: scaivault_rotation_queue_depth > 100
  for: 15m

# Webhook delivery failures
- alert: ScaiVaultWebhookDeliveryFailing
  expr: rate(scaivault_webhook_deliveries_total{status="failed"}[15m]) > 0.1
  for: 10m

Logs#

JSON-formatted when LOG_FORMAT=json (the default). Each log includes:

json
{
  "timestamp": "2026-04-23T14:00:00.123Z",
  "level": "info",
  "message": "secret read",
  "request_id": "req_abc",
  "tenant_id": "tnt_xyz",
  "identity_id": "sa:reporting",
  "path": "integrations/salesforce/oauth",
  "duration_ms": 12,
  "status": 200
}

Key fields for filtering:

request_id — trace a single call across components.
tenant_id, identity_id — scope to customer or account.
level — debug, info, warn, error.

Distributed tracing#

ScaiVault emits OpenTelemetry spans if OTEL_EXPORTER_OTLP_ENDPOINT is set.

bash
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317
OTEL_SERVICE_NAME=scaivault
OTEL_RESOURCE_ATTRIBUTES=deployment.environment=production

Spans cover: incoming HTTP requests, DB queries, Redis calls, KMS operations, outbound HTTP (ScaiKey, webhooks, federated backends, ACME). Client-supplied X-Request-ID becomes the trace ID when present.

Audit-driven alerts#

Some signals are only in the audit log:

Spike in policy_violation events. Someone or something is trying to access things they can't. Investigate.
Reads of a "dormant" secret. If a secret hasn't been read in months and suddenly is, find out why.
New identity reading a sensitive path. Pair with ownership metadata and alert on unexpected readers.

Pull the audit log into your SIEM (POST /v1/audit/export to S3, ingest from there) and run the detection there. ScaiVault's audit endpoints are not designed for high-QPS detection traffic; export and query elsewhere.

Dashboards#

Useful panels to start with:

Request rate and status — stacked by path category and 2xx/4xx/5xx.
P50/P95/P99 latency — per endpoint category.
Active leases — by engine. Watch for runaway growth (usually means a client isn't revoking).
Rotation queue depth — should be near zero; sustained growth is a misconfiguration somewhere.
Certificates expiring in the next 30 days — counts and a table of which.
Webhook success rate (24h) — per webhook. Below 95% is worth investigating.
Top readers — identity-keyed, last 1h. Catches changes in traffic shape.

What's next#

Troubleshooting — common issues and fixes.
Deployment.