Health and Monitoring

Liveness, readiness, metrics, and the signals that actually matter when something breaks.

Health endpoints#

GET /api/v1/health#

Liveness check. No authentication, no dependencies checked.

Response:

json
{"status": "healthy", "version": "0.1.0"}

Returns 200 as long as the process is up and serving HTTP. Use this for container-level liveness probes.

GET /api/v1/ready#

Readiness check. No authentication. Hits every dependency.

Response:

json
{
  "status": "ready",
  "checks": {
    "database": true,
    "redis": true,
    "storage": true,
    "scaikey": true,
    "weaviate": true
  }
}

Returns 200 only when all checks pass; otherwise 503. Use for container-level readiness probes — during startup, during a dependency outage, readiness goes false and the load balancer stops routing.

weaviate is optional; if semantic search is disabled, the field is omitted rather than reported false.

GET /api/v1/search/health#

Vectorization subsystem specifically. Useful when semantic search is intermittent and you want to isolate.

Prometheus metrics#

Exposed at /metrics (authentication depends on your ingress; typically scrape-only).

Key metrics:

Request metrics

scaidrive_requests_total{method,path,status} — counter
scaidrive_request_duration_seconds{method,path} — histogram
scaidrive_rate_limit_rejections_total{dimension} — counter

Sync metrics

scaidrive_sync_changes_fetched_total{share_id} — counter
scaidrive_sync_conflicts_total{share_id,type} — counter
scaidrive_sync_websocket_connections — gauge
scaidrive_sync_cursor_lag_seconds{device_id} — gauge

Storage metrics

scaidrive_uploads_bytes_total{tenant_id} — counter
scaidrive_downloads_bytes_total{tenant_id} — counter
scaidrive_chunks_deduplicated_total{tenant_id} — counter
scaidrive_storage_bytes{tenant_id,type} — gauge. type = files, versions, trash

Queue metrics (workers)

scaidrive_queue_depth{queue} — gauge
scaidrive_queue_jobs_processed_total{queue,result} — counter. result = ok, error, retry
scaidrive_queue_job_duration_seconds{queue,task} — histogram

Vectorization metrics

scaidrive_vectorization_chunks_indexed_total — counter
scaidrive_vectorization_pending — gauge
scaidrive_vectorization_errors_total{provider} — counter

Connector metrics

scaidrive_connector_syncs_total{type,status} — counter. type = smb, sharepoint
scaidrive_connector_files_synced_total{connector_id} — counter
scaidrive_connector_errors_total{connector_id} — counter

Logs#

Structured logs to stdout in JSON (production) or pretty-printed (dev, SCAIDRIVE_LOG_LEVEL=DEBUG).

Every log record includes:

request_id — correlates with X-Request-Id response header.
tenant_id, user_id — when a request has them.
level — debug, info, warning, error, critical.
event — a terse event name like file.uploaded or auth.rejected.

Searchability: in a log aggregator, you should be able to go from a support ticket's request ID to every log line for that request in one query.

What to alert on#

Tier-1 alerts (page on-call):

/api/v1/ready returns non-200 for >2 minutes.
P95 request latency >5s for >5 minutes.
Error rate (5xx) >1% for >5 minutes.
Queue depth on high priority >1000 and growing for >10 minutes.
MariaDB or Redis unreachable from the API pod.

Tier-2 alerts (ticket, no page):

Connector sync failures on any connector for >1 hour.
Vectorization queue growing for >6 hours without drain.
Any tenant hitting QUOTA_EXCEEDED >100 times in 10 minutes (indicates a stuck client).
WebSocket connection count dropping >30% in 5 minutes (possible LB config issue).

Dashboards#

A minimum-viable production dashboard:

Traffic — request rate, latency P50/P95/P99, error rate.
Uploads — bytes/sec in, dedup ratio, session age distribution.
Downloads — bytes/sec out, cache hit rate (if a CDN is in front).
Sync — active WebSocket connections, changes/sec, cursor lag.
Queue — depth per queue, job duration P95, failure rate.
Storage — total used, per-tenant pie chart, dedup savings.

The repository's deploy/grafana/ directory ships starter dashboards matching the metric names above.

Tracing#

ScaiDrive emits OpenTelemetry spans when SCAIDRIVE_OTEL_ENABLED=true. Configure the exporter:

bash
SCAIDRIVE_OTEL_ENDPOINT=https://otel-collector.example.com:4317
SCAIDRIVE_OTEL_SERVICE_NAME=scaidrive-api

Spans cover HTTP handlers, DB queries, S3 operations, and outbound calls (ScaiKey, Weaviate). The span carrying the request ID ties into your log aggregator.

Audit events as monitoring signal#

Audit events aren't just for compliance. They're also an operational signal:

Sudden spike in file_access.download by one user → possible data exfiltration.
compliance.dlp_violation with severity high → should page security.
admin_action from non-admin IP → suspicious.

See Enterprise Compliance for the audit API. Stream events to your SIEM and alert there.