Health and Monitoring
What to observe in a running ScaiSend deployment, and how to tell when something's wrong. The short version: three process families to watch, four metric families to alert on, and a handful of health endpoints.
Health endpoints#
| Endpoint | Returns |
|---|---|
GET /health |
200 {"status": "healthy"} if the API process is alive |
GET /ready |
200 {"status": "ready"} if DB is reachable; 503 otherwise |
Point your load-balancer health probes at /ready. In Kubernetes:
1 2 3 4 5 6 7 8 9 | |
The Worker and SMTP services don't expose HTTP. Rely on process supervisors (systemd, Kubernetes) to restart them.
Process liveness#
Every service logs to stdout. Keep an eye on:
- API: request volume, 5xx rate, latency percentiles.
- Worker: queue depth on
email:process, job completion rate, render errors. - SMTP: outbound connection success rate, retry volume, bounce rate.
If you're running under systemd, journalctl -u scaisend-api -f is the minimal tail. For real monitoring, ship logs to your log aggregator and add structured alerts.
Core metrics to alert on#
Four signals cover most real incidents:
1. API error rate#
1 | |
Alert if > 0.5% over 5 minutes. A steady stream of 5xx usually means MySQL or Redis is unhealthy, or the API is OOMing.
2. Queue depth#
1 | |
Alert if queue depth is growing monotonically for 15+ minutes, or if absolute depth exceeds a ceiling (e.g., 100k). Growing queues mean workers or SMTP services can't keep up; check CPU, connection pool saturation, SMTP retry storms.
3. Bounce rate#
1 2 | |
Alert if > 2% for any single tenant. High bounce rate is a deliverability emergency — ISPs start rate-limiting you within a day or two.
4. Spam-report rate#
1 | |
Alert if > 0.1% for any single tenant. Spam reports damage sender reputation fast. A sudden spike indicates either a mailing-list blunder (unexpected recipients) or a compromised API key (someone else sending from your tenant).
Queue depth probing#
If you have Redis access:
1 2 3 | |
Exact key names depend on deployment config (check REDIS_QUEUE_PREFIX if you've customized). If you're running arq in cluster mode, you'll have a queue key per worker.
Database query hot-spots#
MySQL will show you where time is spent:
1 2 3 4 5 6 | |
Common issues:
email_eventsinserts bottleneck. Everydelivered/open/clickis a row. For high-volume tenants, partition by month or archive old events.email_messageslist queries with unfiltered subject search. Add an index ontenant_id, created_at.- Stats aggregation under load. If
/v3/stats/rebuildis running concurrently with live sends, writes todaily_statscan contend. Run rebuilds off-peak.
SMTP deliverability metrics#
Beyond the ScaiSend API metrics, watch:
- Postmaster Tools (Gmail): spam rate, reputation, authentication success.
- Microsoft SNDS: reputation bucket (Green / Yellow / Red), complaint rate.
- Your outbound IP's RBL status: use an MXToolbox-like service to check whether you've been added to any blocklist.
Drops in reputation are leading indicators — they show up days before your bounce rate climbs. Catch them early.
Logging conventions#
ScaiSend logs structured JSON by default. Every log entry includes:
| Field | Purpose |
|---|---|
level |
debug, info, warning, error, critical |
message |
Human-readable |
request_id |
Set on API requests; correlate with X-Request-ID header |
tenant_id |
When in tenant context |
message_id |
When relevant to a specific ScaiSend message |
Search by request_id when triaging a user-reported issue — they've likely captured the header.
Key operational metrics by service#
API service#
| Metric | Source | Normal |
|---|---|---|
| RPS | access log | depends on fleet size; track trend |
| p99 latency | access log | /v3/mail/send < 200ms; reads < 100ms |
| 5xx rate | access log | < 0.1% |
| Active connections | load balancer | < max_connections |
| Memory RSS | process stats | < 2 GB per process |
Worker service#
| Metric | Source | Normal |
|---|---|---|
| Jobs/sec processed | log output | depends on tenant volume |
| Render errors/sec | log output | < 0.01% |
| Webhook delivery success | webhook_deliveries table |
> 99% 2xx |
| Average job time | log output | < 500ms per message (template render) |
SMTP service#
| Metric | Source | Normal |
|---|---|---|
| Outbound connections/sec | log output | depends on send volume |
| MX resolution failures | log output | < 0.01% |
| TLS handshake failures | log output | < 0.1% |
| Retry queue depth | Redis | stable; growing means upstream problems |
| DKIM signing latency | log output | < 10ms |
| Inbound DSN rate | log output | track trend; sudden spikes = recent send went bad |
| Inbound FBL rate | log output | track trend; spikes = spam-complaint problem |
Dashboards you'll want#
Build at minimum:
- Send rate and delivery rate — requests/sec, delivered/sec, as side-by-side lines.
- Bounce rate, per tenant — stacked area by tenant.
- Spam report rate, per tenant — same, below 0.1% line highlighted.
- Queue depth — all three queues on one chart.
- API latency — p50, p95, p99 for
/v3/mail/sendspecifically. - Open/click rates — aggregate tracking response.
Grafana with MySQL and Redis datasources is enough. If you're on Prometheus, expose an exporter for the daily_stats table and scrape.
Related#
- Deployment — what you're monitoring.
- Troubleshooting — what to do when an alert fires.
- Rate Limiting — 429s as a signal.