Platform
ScaiWave ScaiGrid ScaiCore ScaiBot ScaiDrive ScaiKey Models Tools & Services
Solutions
Organisations Developers Internet Service Providers Managed Service Providers AI-in-a-Box
Resources
Support Documentation Blog Downloads
Company
About Research Careers Investment Opportunities Contact
Log in

Health and Monitoring

Liveness, readiness, metrics, and the signals that actually matter when something breaks.

Health endpoints#

GET /api/v1/health#

Liveness check. No authentication, no dependencies checked.

Response:

json
1
{"status": "healthy", "version": "0.1.0"}

Returns 200 as long as the process is up and serving HTTP. Use this for container-level liveness probes.

GET /api/v1/ready#

Readiness check. No authentication. Hits every dependency.

Response:

json
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
{
  "status": "ready",
  "checks": {
    "database": true,
    "redis": true,
    "storage": true,
    "scaikey": true,
    "weaviate": true
  }
}

Returns 200 only when all checks pass; otherwise 503. Use for container-level readiness probes — during startup, during a dependency outage, readiness goes false and the load balancer stops routing.

weaviate is optional; if semantic search is disabled, the field is omitted rather than reported false.

GET /api/v1/search/health#

Vectorization subsystem specifically. Useful when semantic search is intermittent and you want to isolate.

Prometheus metrics#

Exposed at /metrics (authentication depends on your ingress; typically scrape-only).

Key metrics:

Request metrics

  • scaidrive_requests_total{method,path,status} — counter
  • scaidrive_request_duration_seconds{method,path} — histogram
  • scaidrive_rate_limit_rejections_total{dimension} — counter

Sync metrics

  • scaidrive_sync_changes_fetched_total{share_id} — counter
  • scaidrive_sync_conflicts_total{share_id,type} — counter
  • scaidrive_sync_websocket_connections — gauge
  • scaidrive_sync_cursor_lag_seconds{device_id} — gauge

Storage metrics

  • scaidrive_uploads_bytes_total{tenant_id} — counter
  • scaidrive_downloads_bytes_total{tenant_id} — counter
  • scaidrive_chunks_deduplicated_total{tenant_id} — counter
  • scaidrive_storage_bytes{tenant_id,type} — gauge. type = files, versions, trash

Queue metrics (workers)

  • scaidrive_queue_depth{queue} — gauge
  • scaidrive_queue_jobs_processed_total{queue,result} — counter. result = ok, error, retry
  • scaidrive_queue_job_duration_seconds{queue,task} — histogram

Vectorization metrics

  • scaidrive_vectorization_chunks_indexed_total — counter
  • scaidrive_vectorization_pending — gauge
  • scaidrive_vectorization_errors_total{provider} — counter

Connector metrics

  • scaidrive_connector_syncs_total{type,status} — counter. type = smb, sharepoint
  • scaidrive_connector_files_synced_total{connector_id} — counter
  • scaidrive_connector_errors_total{connector_id} — counter

Logs#

Structured logs to stdout in JSON (production) or pretty-printed (dev, SCAIDRIVE_LOG_LEVEL=DEBUG).

Every log record includes:

  • request_id — correlates with X-Request-Id response header.
  • tenant_id, user_id — when a request has them.
  • leveldebug, info, warning, error, critical.
  • event — a terse event name like file.uploaded or auth.rejected.

Searchability: in a log aggregator, you should be able to go from a support ticket's request ID to every log line for that request in one query.

What to alert on#

Tier-1 alerts (page on-call):

  • /api/v1/ready returns non-200 for >2 minutes.
  • P95 request latency >5s for >5 minutes.
  • Error rate (5xx) >1% for >5 minutes.
  • Queue depth on high priority >1000 and growing for >10 minutes.
  • MariaDB or Redis unreachable from the API pod.

Tier-2 alerts (ticket, no page):

  • Connector sync failures on any connector for >1 hour.
  • Vectorization queue growing for >6 hours without drain.
  • Any tenant hitting QUOTA_EXCEEDED >100 times in 10 minutes (indicates a stuck client).
  • WebSocket connection count dropping >30% in 5 minutes (possible LB config issue).

Dashboards#

A minimum-viable production dashboard:

  1. Traffic — request rate, latency P50/P95/P99, error rate.
  2. Uploads — bytes/sec in, dedup ratio, session age distribution.
  3. Downloads — bytes/sec out, cache hit rate (if a CDN is in front).
  4. Sync — active WebSocket connections, changes/sec, cursor lag.
  5. Queue — depth per queue, job duration P95, failure rate.
  6. Storage — total used, per-tenant pie chart, dedup savings.

The repository's deploy/grafana/ directory ships starter dashboards matching the metric names above.

Tracing#

ScaiDrive emits OpenTelemetry spans when SCAIDRIVE_OTEL_ENABLED=true. Configure the exporter:

bash
1
2
SCAIDRIVE_OTEL_ENDPOINT=https://otel-collector.example.com:4317
SCAIDRIVE_OTEL_SERVICE_NAME=scaidrive-api

Spans cover HTTP handlers, DB queries, S3 operations, and outbound calls (ScaiKey, Weaviate). The span carrying the request ID ties into your log aggregator.

Audit events as monitoring signal#

Audit events aren't just for compliance. They're also an operational signal:

  • Sudden spike in file_access.download by one user → possible data exfiltration.
  • compliance.dlp_violation with severity high → should page security.
  • admin_action from non-admin IP → suspicious.

See Enterprise Compliance for the audit API. Stream events to your SIEM and alert there.

What's next#

Updated 2026-05-18 15:04:23 View source (.md) rev 2