---
title: Health and Monitoring
path: operations/health-and-monitoring
status: published
---

# Health and Monitoring

How to know ScaiVault is working, and what to watch for when it isn't.

## Health endpoints

| Endpoint | Checks | Auth |
|----------|--------|------|
| `GET /v1/health` | Process alive | No |
| `GET /v1/health/ready` | DB, Redis, encryption all reachable | No |
| `GET /v1/health/detailed` | Above plus latencies, pool stats, KMS key status | `admin` |

Liveness probes: `/v1/health`. Readiness probes: `/v1/health/ready`. Never put authentication in front of these — LBs and orchestrators can't negotiate auth.

## Prometheus metrics

`GET /v1/metrics` exposes Prometheus format. Key series:

### Request metrics

- `scaivault_requests_total{method, path_category, status}` — counter.
- `scaivault_request_duration_seconds{method, path_category}` — histogram.
- `scaivault_active_requests{path_category}` — gauge.

### Rate limiting

- `scaivault_rate_limit_hits_total{category}` — counter of `429` responses.
- `scaivault_rate_limit_bucket_fill{category, identity}` — gauge of current bucket level. Per-identity series are high-cardinality; aggregate with a recording rule.

### Secrets

- `scaivault_secrets_total{tenant}` — gauge of live secret count.
- `scaivault_secret_reads_total{tenant, secret_type}` — counter.
- `scaivault_secret_writes_total{tenant}` — counter.
- `scaivault_secret_rotations_total{status}` — counter (`success` | `failed`).

### PKI

- `scaivault_certificates_active{ca_id}` — gauge.
- `scaivault_certificates_expiring_soon{within_days}` — gauge, useful for alerting.
- `scaivault_acme_orders_total{provider, status}` — counter.

### Dynamic secrets

- `scaivault_leases_active{engine}` — gauge.
- `scaivault_leases_generated_total{engine, role}` — counter.
- `scaivault_engine_health{engine}` — gauge, 1 healthy / 0 unreachable.

### Dependencies

- `scaivault_db_latency_seconds` — histogram.
- `scaivault_redis_latency_seconds` — histogram.
- `scaivault_kms_latency_seconds{operation}` — histogram (`encrypt`, `decrypt`, `sign`).
- `scaivault_db_pool_size`, `scaivault_db_pool_available` — gauges.

### Background jobs

- `scaivault_rotation_queue_depth` — gauge.
- `scaivault_webhook_queue_depth` — gauge.
- `scaivault_webhook_delivery_duration_seconds` — histogram.

## Alert rules

Suggestions. Tune to your SLO.

```yaml
# P95 request latency
- alert: ScaiVaultHighLatency
  expr: histogram_quantile(0.95, rate(scaivault_request_duration_seconds_bucket[5m])) > 1.0
  for: 10m
  annotations:
    summary: "ScaiVault P95 > 1s"

# Error rate
- alert: ScaiVaultErrorRate
  expr: sum(rate(scaivault_requests_total{status=~"5.."}[5m])) / sum(rate(scaivault_requests_total[5m])) > 0.01
  for: 5m

# KMS unreachable
- alert: ScaiVaultKMSFailing
  expr: rate(scaivault_kms_latency_seconds_count[5m]) == 0 and sum(scaivault_active_requests) > 0
  for: 2m

# Readiness
- alert: ScaiVaultNotReady
  expr: up{job="scaivault"} == 0
  for: 2m

# Certificates approaching expiry
- alert: ScaiVaultCertsExpiringSoon
  expr: scaivault_certificates_expiring_soon{within_days="14"} > 0
  for: 1h

# Rotation backlog
- alert: ScaiVaultRotationQueueDeep
  expr: scaivault_rotation_queue_depth > 100
  for: 15m

# Webhook delivery failures
- alert: ScaiVaultWebhookDeliveryFailing
  expr: rate(scaivault_webhook_deliveries_total{status="failed"}[15m]) > 0.1
  for: 10m
```

## Logs

JSON-formatted when `LOG_FORMAT=json` (the default). Each log includes:

```json
{
  "timestamp": "2026-04-23T14:00:00.123Z",
  "level": "info",
  "message": "secret read",
  "request_id": "req_abc",
  "tenant_id": "tnt_xyz",
  "identity_id": "sa:reporting",
  "path": "integrations/salesforce/oauth",
  "duration_ms": 12,
  "status": 200
}
```

Key fields for filtering:

- `request_id` — trace a single call across components.
- `tenant_id`, `identity_id` — scope to customer or account.
- `level` — `debug`, `info`, `warn`, `error`.

## Distributed tracing

ScaiVault emits OpenTelemetry spans if `OTEL_EXPORTER_OTLP_ENDPOINT` is set.

```bash
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317
OTEL_SERVICE_NAME=scaivault
OTEL_RESOURCE_ATTRIBUTES=deployment.environment=production
```

Spans cover: incoming HTTP requests, DB queries, Redis calls, KMS operations, outbound HTTP (ScaiKey, webhooks, federated backends, ACME). Client-supplied `X-Request-ID` becomes the trace ID when present.

## Audit-driven alerts

Some signals are only in the audit log:

- **Spike in `policy_violation` events.** Someone or something is trying to access things they can't. Investigate.
- **Reads of a "dormant" secret.** If a secret hasn't been read in months and suddenly is, find out why.
- **New identity reading a sensitive path.** Pair with ownership metadata and alert on unexpected readers.

Pull the audit log into your SIEM (`POST /v1/audit/export` to S3, ingest from there) and run the detection there. ScaiVault's audit endpoints are not designed for high-QPS detection traffic; export and query elsewhere.

## Dashboards

Useful panels to start with:

1. **Request rate and status** — stacked by path category and 2xx/4xx/5xx.
2. **P50/P95/P99 latency** — per endpoint category.
3. **Active leases** — by engine. Watch for runaway growth (usually means a client isn't revoking).
4. **Rotation queue depth** — should be near zero; sustained growth is a misconfiguration somewhere.
5. **Certificates expiring in the next 30 days** — counts and a table of which.
6. **Webhook success rate (24h)** — per webhook. Below 95% is worth investigating.
7. **Top readers** — identity-keyed, last 1h. Catches changes in traffic shape.

## What's next

- [Troubleshooting](./troubleshooting) — common issues and fixes.
- [Deployment](./deployment).