---
title: Health and Monitoring
path: operations/health-and-monitoring
status: published
---

Liveness, readiness, metrics, and the signals that actually matter when something breaks.

## Health endpoints

### GET /api/v1/health

Liveness check. No authentication, no dependencies checked.

**Response:**

```json
{"status": "healthy", "version": "0.1.0"}
```

Returns 200 as long as the process is up and serving HTTP. Use this for container-level liveness probes.

### GET /api/v1/ready

Readiness check. No authentication. Hits every dependency.

**Response:**

```json
{
  "status": "ready",
  "checks": {
    "database": true,
    "redis": true,
    "storage": true,
    "scaikey": true,
    "weaviate": true
  }
}
```

Returns 200 only when all checks pass; otherwise 503. Use for container-level readiness probes — during startup, during a dependency outage, readiness goes false and the load balancer stops routing.

`weaviate` is optional; if semantic search is disabled, the field is omitted rather than reported false.

### GET /api/v1/search/health

Vectorization subsystem specifically. Useful when semantic search is intermittent and you want to isolate.

## Prometheus metrics

Exposed at `/metrics` (authentication depends on your ingress; typically scrape-only).

Key metrics:

**Request metrics**

- `scaidrive_requests_total{method,path,status}` — counter
- `scaidrive_request_duration_seconds{method,path}` — histogram
- `scaidrive_rate_limit_rejections_total{dimension}` — counter

**Sync metrics**

- `scaidrive_sync_changes_fetched_total{share_id}` — counter
- `scaidrive_sync_conflicts_total{share_id,type}` — counter
- `scaidrive_sync_websocket_connections` — gauge
- `scaidrive_sync_cursor_lag_seconds{device_id}` — gauge

**Storage metrics**

- `scaidrive_uploads_bytes_total{tenant_id}` — counter
- `scaidrive_downloads_bytes_total{tenant_id}` — counter
- `scaidrive_chunks_deduplicated_total{tenant_id}` — counter
- `scaidrive_storage_bytes{tenant_id,type}` — gauge. `type` = `files`, `versions`, `trash`

**Queue metrics** (workers)

- `scaidrive_queue_depth{queue}` — gauge
- `scaidrive_queue_jobs_processed_total{queue,result}` — counter. `result` = `ok`, `error`, `retry`
- `scaidrive_queue_job_duration_seconds{queue,task}` — histogram

**Vectorization metrics**

- `scaidrive_vectorization_chunks_indexed_total` — counter
- `scaidrive_vectorization_pending` — gauge
- `scaidrive_vectorization_errors_total{provider}` — counter

**Connector metrics**

- `scaidrive_connector_syncs_total{type,status}` — counter. `type` = `smb`, `sharepoint`
- `scaidrive_connector_files_synced_total{connector_id}` — counter
- `scaidrive_connector_errors_total{connector_id}` — counter

## Logs

Structured logs to stdout in JSON (production) or pretty-printed (dev, `SCAIDRIVE_LOG_LEVEL=DEBUG`).

Every log record includes:

- `request_id` — correlates with `X-Request-Id` response header.
- `tenant_id`, `user_id` — when a request has them.
- `level` — `debug`, `info`, `warning`, `error`, `critical`.
- `event` — a terse event name like `file.uploaded` or `auth.rejected`.

Searchability: in a log aggregator, you should be able to go from a support ticket's request ID to every log line for that request in one query.

## What to alert on

Tier-1 alerts (page on-call):

- `/api/v1/ready` returns non-200 for >2 minutes.
- P95 request latency >5s for >5 minutes.
- Error rate (5xx) >1% for >5 minutes.
- Queue depth on `high` priority >1000 and growing for >10 minutes.
- MariaDB or Redis unreachable from the API pod.

Tier-2 alerts (ticket, no page):

- Connector sync failures on any connector for >1 hour.
- Vectorization queue growing for >6 hours without drain.
- Any tenant hitting `QUOTA_EXCEEDED` >100 times in 10 minutes (indicates a stuck client).
- WebSocket connection count dropping >30% in 5 minutes (possible LB config issue).

## Dashboards

A minimum-viable production dashboard:

1. **Traffic** — request rate, latency P50/P95/P99, error rate.
2. **Uploads** — bytes/sec in, dedup ratio, session age distribution.
3. **Downloads** — bytes/sec out, cache hit rate (if a CDN is in front).
4. **Sync** — active WebSocket connections, changes/sec, cursor lag.
5. **Queue** — depth per queue, job duration P95, failure rate.
6. **Storage** — total used, per-tenant pie chart, dedup savings.

The repository's `deploy/grafana/` directory ships starter dashboards matching the metric names above.

## Tracing

ScaiDrive emits OpenTelemetry spans when `SCAIDRIVE_OTEL_ENABLED=true`. Configure the exporter:

```bash
SCAIDRIVE_OTEL_ENDPOINT=https://otel-collector.example.com:4317
SCAIDRIVE_OTEL_SERVICE_NAME=scaidrive-api
```

Spans cover HTTP handlers, DB queries, S3 operations, and outbound calls (ScaiKey, Weaviate). The span carrying the request ID ties into your log aggregator.

## Audit events as monitoring signal

Audit events aren't just for compliance. They're also an operational signal:

- Sudden spike in `file_access.download` by one user → possible data exfiltration.
- `compliance.dlp_violation` with severity `high` → should page security.
- `admin_action` from non-admin IP → suspicious.

See [Enterprise Compliance](/docs/scaidrive/advanced/enterprise-compliance) for the audit API. Stream events to your SIEM and alert there.

## What's next

- [Deployment](/docs/scaidrive/operations/deployment)
- [Troubleshooting](/docs/scaidrive/operations/troubleshooting)
- [Rate Limiting](/docs/scaidrive/advanced/rate-limiting)