Health and Monitoring
Liveness, readiness, metrics, and the signals that actually matter when something breaks.
Health endpoints#
GET /api/v1/health#
Liveness check. No authentication, no dependencies checked.
Response:
1 | |
Returns 200 as long as the process is up and serving HTTP. Use this for container-level liveness probes.
GET /api/v1/ready#
Readiness check. No authentication. Hits every dependency.
Response:
1 2 3 4 5 6 7 8 9 10 | |
Returns 200 only when all checks pass; otherwise 503. Use for container-level readiness probes — during startup, during a dependency outage, readiness goes false and the load balancer stops routing.
weaviate is optional; if semantic search is disabled, the field is omitted rather than reported false.
GET /api/v1/search/health#
Vectorization subsystem specifically. Useful when semantic search is intermittent and you want to isolate.
Prometheus metrics#
Exposed at /metrics (authentication depends on your ingress; typically scrape-only).
Key metrics:
Request metrics
scaidrive_requests_total{method,path,status}— counterscaidrive_request_duration_seconds{method,path}— histogramscaidrive_rate_limit_rejections_total{dimension}— counter
Sync metrics
scaidrive_sync_changes_fetched_total{share_id}— counterscaidrive_sync_conflicts_total{share_id,type}— counterscaidrive_sync_websocket_connections— gaugescaidrive_sync_cursor_lag_seconds{device_id}— gauge
Storage metrics
scaidrive_uploads_bytes_total{tenant_id}— counterscaidrive_downloads_bytes_total{tenant_id}— counterscaidrive_chunks_deduplicated_total{tenant_id}— counterscaidrive_storage_bytes{tenant_id,type}— gauge.type=files,versions,trash
Queue metrics (workers)
scaidrive_queue_depth{queue}— gaugescaidrive_queue_jobs_processed_total{queue,result}— counter.result=ok,error,retryscaidrive_queue_job_duration_seconds{queue,task}— histogram
Vectorization metrics
scaidrive_vectorization_chunks_indexed_total— counterscaidrive_vectorization_pending— gaugescaidrive_vectorization_errors_total{provider}— counter
Connector metrics
scaidrive_connector_syncs_total{type,status}— counter.type=smb,sharepointscaidrive_connector_files_synced_total{connector_id}— counterscaidrive_connector_errors_total{connector_id}— counter
Logs#
Structured logs to stdout in JSON (production) or pretty-printed (dev, SCAIDRIVE_LOG_LEVEL=DEBUG).
Every log record includes:
request_id— correlates withX-Request-Idresponse header.tenant_id,user_id— when a request has them.level—debug,info,warning,error,critical.event— a terse event name likefile.uploadedorauth.rejected.
Searchability: in a log aggregator, you should be able to go from a support ticket's request ID to every log line for that request in one query.
What to alert on#
Tier-1 alerts (page on-call):
/api/v1/readyreturns non-200 for >2 minutes.- P95 request latency >5s for >5 minutes.
- Error rate (5xx) >1% for >5 minutes.
- Queue depth on
highpriority >1000 and growing for >10 minutes. - MariaDB or Redis unreachable from the API pod.
Tier-2 alerts (ticket, no page):
- Connector sync failures on any connector for >1 hour.
- Vectorization queue growing for >6 hours without drain.
- Any tenant hitting
QUOTA_EXCEEDED>100 times in 10 minutes (indicates a stuck client). - WebSocket connection count dropping >30% in 5 minutes (possible LB config issue).
Dashboards#
A minimum-viable production dashboard:
- Traffic — request rate, latency P50/P95/P99, error rate.
- Uploads — bytes/sec in, dedup ratio, session age distribution.
- Downloads — bytes/sec out, cache hit rate (if a CDN is in front).
- Sync — active WebSocket connections, changes/sec, cursor lag.
- Queue — depth per queue, job duration P95, failure rate.
- Storage — total used, per-tenant pie chart, dedup savings.
The repository's deploy/grafana/ directory ships starter dashboards matching the metric names above.
Tracing#
ScaiDrive emits OpenTelemetry spans when SCAIDRIVE_OTEL_ENABLED=true. Configure the exporter:
1 2 | |
Spans cover HTTP handlers, DB queries, S3 operations, and outbound calls (ScaiKey, Weaviate). The span carrying the request ID ties into your log aggregator.
Audit events as monitoring signal#
Audit events aren't just for compliance. They're also an operational signal:
- Sudden spike in
file_access.downloadby one user → possible data exfiltration. compliance.dlp_violationwith severityhigh→ should page security.admin_actionfrom non-admin IP → suspicious.
See Enterprise Compliance for the audit API. Stream events to your SIEM and alert there.