Platform
ScaiWave ScaiGrid ScaiCore ScaiBot ScaiDrive ScaiKey Models Tools & Services
Solutions
Organisations Developers Internet Service Providers Managed Service Providers AI-in-a-Box
Resources
Support Documentation Blog Downloads
Company
About Research Careers Investment Opportunities Contact
Log in

Health and Monitoring

How to know ScaiVault is working, and what to watch for when it isn't.

Health endpoints#

Endpoint Checks Auth
GET /v1/health Process alive No
GET /v1/health/ready DB, Redis, encryption all reachable No
GET /v1/health/detailed Above plus latencies, pool stats, KMS key status admin

Liveness probes: /v1/health. Readiness probes: /v1/health/ready. Never put authentication in front of these — LBs and orchestrators can't negotiate auth.

Prometheus metrics#

GET /v1/metrics exposes Prometheus format. Key series:

Request metrics#

  • scaivault_requests_total{method, path_category, status} — counter.
  • scaivault_request_duration_seconds{method, path_category} — histogram.
  • scaivault_active_requests{path_category} — gauge.

Rate limiting#

  • scaivault_rate_limit_hits_total{category} — counter of 429 responses.
  • scaivault_rate_limit_bucket_fill{category, identity} — gauge of current bucket level. Per-identity series are high-cardinality; aggregate with a recording rule.

Secrets#

  • scaivault_secrets_total{tenant} — gauge of live secret count.
  • scaivault_secret_reads_total{tenant, secret_type} — counter.
  • scaivault_secret_writes_total{tenant} — counter.
  • scaivault_secret_rotations_total{status} — counter (success | failed).

PKI#

  • scaivault_certificates_active{ca_id} — gauge.
  • scaivault_certificates_expiring_soon{within_days} — gauge, useful for alerting.
  • scaivault_acme_orders_total{provider, status} — counter.

Dynamic secrets#

  • scaivault_leases_active{engine} — gauge.
  • scaivault_leases_generated_total{engine, role} — counter.
  • scaivault_engine_health{engine} — gauge, 1 healthy / 0 unreachable.

Dependencies#

  • scaivault_db_latency_seconds — histogram.
  • scaivault_redis_latency_seconds — histogram.
  • scaivault_kms_latency_seconds{operation} — histogram (encrypt, decrypt, sign).
  • scaivault_db_pool_size, scaivault_db_pool_available — gauges.

Background jobs#

  • scaivault_rotation_queue_depth — gauge.
  • scaivault_webhook_queue_depth — gauge.
  • scaivault_webhook_delivery_duration_seconds — histogram.

Alert rules#

Suggestions. Tune to your SLO.

yaml
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
# P95 request latency
- alert: ScaiVaultHighLatency
  expr: histogram_quantile(0.95, rate(scaivault_request_duration_seconds_bucket[5m])) > 1.0
  for: 10m
  annotations:
    summary: "ScaiVault P95 > 1s"

# Error rate
- alert: ScaiVaultErrorRate
  expr: sum(rate(scaivault_requests_total{status=~"5.."}[5m])) / sum(rate(scaivault_requests_total[5m])) > 0.01
  for: 5m

# KMS unreachable
- alert: ScaiVaultKMSFailing
  expr: rate(scaivault_kms_latency_seconds_count[5m]) == 0 and sum(scaivault_active_requests) > 0
  for: 2m

# Readiness
- alert: ScaiVaultNotReady
  expr: up{job="scaivault"} == 0
  for: 2m

# Certificates approaching expiry
- alert: ScaiVaultCertsExpiringSoon
  expr: scaivault_certificates_expiring_soon{within_days="14"} > 0
  for: 1h

# Rotation backlog
- alert: ScaiVaultRotationQueueDeep
  expr: scaivault_rotation_queue_depth > 100
  for: 15m

# Webhook delivery failures
- alert: ScaiVaultWebhookDeliveryFailing
  expr: rate(scaivault_webhook_deliveries_total{status="failed"}[15m]) > 0.1
  for: 10m

Logs#

JSON-formatted when LOG_FORMAT=json (the default). Each log includes:

json
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
{
  "timestamp": "2026-04-23T14:00:00.123Z",
  "level": "info",
  "message": "secret read",
  "request_id": "req_abc",
  "tenant_id": "tnt_xyz",
  "identity_id": "sa:reporting",
  "path": "integrations/salesforce/oauth",
  "duration_ms": 12,
  "status": 200
}

Key fields for filtering:

  • request_id — trace a single call across components.
  • tenant_id, identity_id — scope to customer or account.
  • leveldebug, info, warn, error.

Distributed tracing#

ScaiVault emits OpenTelemetry spans if OTEL_EXPORTER_OTLP_ENDPOINT is set.

bash
1
2
3
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317
OTEL_SERVICE_NAME=scaivault
OTEL_RESOURCE_ATTRIBUTES=deployment.environment=production

Spans cover: incoming HTTP requests, DB queries, Redis calls, KMS operations, outbound HTTP (ScaiKey, webhooks, federated backends, ACME). Client-supplied X-Request-ID becomes the trace ID when present.

Audit-driven alerts#

Some signals are only in the audit log:

  • Spike in policy_violation events. Someone or something is trying to access things they can't. Investigate.
  • Reads of a "dormant" secret. If a secret hasn't been read in months and suddenly is, find out why.
  • New identity reading a sensitive path. Pair with ownership metadata and alert on unexpected readers.

Pull the audit log into your SIEM (POST /v1/audit/export to S3, ingest from there) and run the detection there. ScaiVault's audit endpoints are not designed for high-QPS detection traffic; export and query elsewhere.

Dashboards#

Useful panels to start with:

  1. Request rate and status — stacked by path category and 2xx/4xx/5xx.
  2. P50/P95/P99 latency — per endpoint category.
  3. Active leases — by engine. Watch for runaway growth (usually means a client isn't revoking).
  4. Rotation queue depth — should be near zero; sustained growth is a misconfiguration somewhere.
  5. Certificates expiring in the next 30 days — counts and a table of which.
  6. Webhook success rate (24h) — per webhook. Below 95% is worth investigating.
  7. Top readers — identity-keyed, last 1h. Catches changes in traffic shape.

What's next#

Updated 2026-05-17 13:26:51 View source (.md) rev 2