---
title: Health and Monitoring
path: troubleshooting/health-and-monitoring
status: published
---

# Health and Monitoring

What to observe in a running ScaiSend deployment, and how to tell when something's wrong. The short version: three process families to watch, four metric families to alert on, and a handful of health endpoints.

## Health endpoints

| Endpoint | Returns |
|----------|---------|
| `GET /health` | `200 {"status": "healthy"}` if the API process is alive |
| `GET /ready` | `200 {"status": "ready"}` if DB is reachable; `503` otherwise |

Point your load-balancer health probes at `/ready`. In Kubernetes:

```yaml
livenessProbe:
  httpGet:
    path: /health
    port: 8000
readinessProbe:
  httpGet:
    path: /ready
    port: 8000
  periodSeconds: 10
```

The Worker and SMTP services don't expose HTTP. Rely on process supervisors (systemd, Kubernetes) to restart them.

## Process liveness

Every service logs to stdout. Keep an eye on:

- **API:** request volume, 5xx rate, latency percentiles.
- **Worker:** queue depth on `email:process`, job completion rate, render errors.
- **SMTP:** outbound connection success rate, retry volume, bounce rate.

If you're running under systemd, `journalctl -u scaisend-api -f` is the minimal tail. For real monitoring, ship logs to your log aggregator and add structured alerts.

## Core metrics to alert on

Four signals cover most real incidents:

### 1. API error rate

```
5xx_responses / total_responses
```

Alert if > 0.5% over 5 minutes. A steady stream of 5xx usually means MySQL or Redis is unhealthy, or the API is OOMing.

### 2. Queue depth

```
llen(email:process) + llen(smtp:deliver) + llen(webhook:deliver)
```

Alert if queue depth is growing monotonically for 15+ minutes, or if absolute depth exceeds a ceiling (e.g., 100k). Growing queues mean workers or SMTP services can't keep up; check CPU, connection pool saturation, SMTP retry storms.

### 3. Bounce rate

```
GET /v3/stats/sum?start_date=today
# then compute: bounces / requests
```

Alert if > 2% for any single tenant. High bounce rate is a deliverability emergency — ISPs start rate-limiting you within a day or two.

### 4. Spam-report rate

```
spam_reports / delivered
```

Alert if > 0.1% for any single tenant. Spam reports damage sender reputation fast. A sudden spike indicates either a mailing-list blunder (unexpected recipients) or a compromised API key (someone else sending from your tenant).

## Queue depth probing

If you have Redis access:

```bash
redis-cli LLEN arq:queue:default       # default arq queue
redis-cli LLEN smtp:deliver            # outbound SMTP queue
redis-cli LLEN webhook:deliver         # webhook delivery queue
```

Exact key names depend on deployment config (check `REDIS_QUEUE_PREFIX` if you've customized). If you're running arq in cluster mode, you'll have a queue key per worker.

## Database query hot-spots

MySQL will show you where time is spent:

```sql
SELECT SUBSTRING(info, 1, 80) AS query, COUNT(*), AVG(time)
FROM information_schema.processlist
WHERE command = 'Query'
GROUP BY query
ORDER BY AVG(time) DESC
LIMIT 10;
```

Common issues:

- **`email_events` inserts bottleneck.** Every `delivered` / `open` / `click` is a row. For high-volume tenants, partition by month or archive old events.
- **`email_messages` list queries with unfiltered subject search.** Add an index on `tenant_id, created_at`.
- **Stats aggregation under load.** If `/v3/stats/rebuild` is running concurrently with live sends, writes to `daily_stats` can contend. Run rebuilds off-peak.

## SMTP deliverability metrics

Beyond the ScaiSend API metrics, watch:

- **Postmaster Tools (Gmail):** spam rate, reputation, authentication success.
- **Microsoft SNDS:** reputation bucket (Green / Yellow / Red), complaint rate.
- **Your outbound IP's RBL status:** use an [MXToolbox-like service](https://mxtoolbox.com/blacklists.aspx) to check whether you've been added to any blocklist.

Drops in reputation are leading indicators — they show up days before your bounce rate climbs. Catch them early.

## Logging conventions

ScaiSend logs structured JSON by default. Every log entry includes:

| Field | Purpose |
|-------|---------|
| `level` | debug, info, warning, error, critical |
| `message` | Human-readable |
| `request_id` | Set on API requests; correlate with `X-Request-ID` header |
| `tenant_id` | When in tenant context |
| `message_id` | When relevant to a specific ScaiSend message |

Search by `request_id` when triaging a user-reported issue — they've likely captured the header.

## Key operational metrics by service

### API service

| Metric | Source | Normal |
|--------|--------|--------|
| RPS | access log | depends on fleet size; track trend |
| p99 latency | access log | `/v3/mail/send` < 200ms; reads < 100ms |
| 5xx rate | access log | < 0.1% |
| Active connections | load balancer | < max_connections |
| Memory RSS | process stats | < 2 GB per process |

### Worker service

| Metric | Source | Normal |
|--------|--------|--------|
| Jobs/sec processed | log output | depends on tenant volume |
| Render errors/sec | log output | < 0.01% |
| Webhook delivery success | `webhook_deliveries` table | > 99% 2xx |
| Average job time | log output | < 500ms per message (template render) |

### SMTP service

| Metric | Source | Normal |
|--------|--------|--------|
| Outbound connections/sec | log output | depends on send volume |
| MX resolution failures | log output | < 0.01% |
| TLS handshake failures | log output | < 0.1% |
| Retry queue depth | Redis | stable; growing means upstream problems |
| DKIM signing latency | log output | < 10ms |
| Inbound DSN rate | log output | track trend; sudden spikes = recent send went bad |
| Inbound FBL rate | log output | track trend; spikes = spam-complaint problem |

## Dashboards you'll want

Build at minimum:

1. **Send rate and delivery rate** — requests/sec, delivered/sec, as side-by-side lines.
2. **Bounce rate, per tenant** — stacked area by tenant.
3. **Spam report rate, per tenant** — same, below 0.1% line highlighted.
4. **Queue depth** — all three queues on one chart.
5. **API latency** — p50, p95, p99 for `/v3/mail/send` specifically.
6. **Open/click rates** — aggregate tracking response.

Grafana with MySQL and Redis datasources is enough. If you're on Prometheus, expose an exporter for the `daily_stats` table and scrape.

## Related

- [Deployment](deployment) — what you're monitoring.
- [Troubleshooting](index) — what to do when an alert fires.
- [Rate Limiting](../concepts/rate-limiting) — 429s as a signal.
