Health and Monitoring

What to observe in a running ScaiSend deployment, and how to tell when something's wrong. The short version: three process families to watch, four metric families to alert on, and a handful of health endpoints.

Health endpoints#

Endpoint	Returns
`GET /health`	`200 {"status": "healthy"}` if the API process is alive
`GET /ready`	`200 {"status": "ready"}` if DB is reachable; `503` otherwise

Point your load-balancer health probes at /ready. In Kubernetes:

yaml
livenessProbe:
  httpGet:
    path: /health
    port: 8000
readinessProbe:
  httpGet:
    path: /ready
    port: 8000
  periodSeconds: 10

The Worker and SMTP services don't expose HTTP. Rely on process supervisors (systemd, Kubernetes) to restart them.

Process liveness#

Every service logs to stdout. Keep an eye on:

API: request volume, 5xx rate, latency percentiles.
Worker: queue depth on email:process, job completion rate, render errors.
SMTP: outbound connection success rate, retry volume, bounce rate.

If you're running under systemd, journalctl -u scaisend-api -f is the minimal tail. For real monitoring, ship logs to your log aggregator and add structured alerts.

Core metrics to alert on#

Four signals cover most real incidents:

1. API error rate#

cbmbas
5xx_responses / total_responses

Alert if > 0.5% over 5 minutes. A steady stream of 5xx usually means MySQL or Redis is unhealthy, or the API is OOMing.

2. Queue depth#

text

1	`llen(email:process) + llen(smtp:deliver) + llen(webhook:deliver)`

Alert if queue depth is growing monotonically for 15+ minutes, or if absolute depth exceeds a ceiling (e.g., 100k). Growing queues mean workers or SMTP services can't keep up; check CPU, connection pool saturation, SMTP retry storms.

3. Bounce rate#

scdoc

1 2	`GET /v3/stats/sum?start_date=today # then compute: bounces / requests`

Alert if > 2% for any single tenant. High bounce rate is a deliverability emergency — ISPs start rate-limiting you within a day or two.

4. Spam-report rate#

scdoc

1	`spam_reports / delivered`

Alert if > 0.1% for any single tenant. Spam reports damage sender reputation fast. A sudden spike indicates either a mailing-list blunder (unexpected recipients) or a compromised API key (someone else sending from your tenant).

Queue depth probing#

If you have Redis access:

bash
redis-cli LLEN arq:queue:default       # default arq queue
redis-cli LLEN smtp:deliver            # outbound SMTP queue
redis-cli LLEN webhook:deliver         # webhook delivery queue

Exact key names depend on deployment config (check REDIS_QUEUE_PREFIX if you've customized). If you're running arq in cluster mode, you'll have a queue key per worker.

Database query hot-spots#

MySQL will show you where time is spent:

sql
SELECT SUBSTRING(info, 1, 80) AS query, COUNT(*), AVG(time)
FROM information_schema.processlist
WHERE command = 'Query'
GROUP BY query
ORDER BY AVG(time) DESC
LIMIT 10;

Common issues:

email_events inserts bottleneck. Every delivered / open / click is a row. For high-volume tenants, partition by month or archive old events.
email_messages list queries with unfiltered subject search. Add an index on tenant_id, created_at.
Stats aggregation under load. If /v3/stats/rebuild is running concurrently with live sends, writes to daily_stats can contend. Run rebuilds off-peak.

SMTP deliverability metrics#

Beyond the ScaiSend API metrics, watch:

Postmaster Tools (Gmail): spam rate, reputation, authentication success.
Microsoft SNDS: reputation bucket (Green / Yellow / Red), complaint rate.
Your outbound IP's RBL status: use an MXToolbox-like service to check whether you've been added to any blocklist.

Drops in reputation are leading indicators — they show up days before your bounce rate climbs. Catch them early.

Logging conventions#

ScaiSend logs structured JSON by default. Every log entry includes:

Field	Purpose
`level`	debug, info, warning, error, critical
`message`	Human-readable
`request_id`	Set on API requests; correlate with `X-Request-ID` header
`tenant_id`	When in tenant context
`message_id`	When relevant to a specific ScaiSend message

Search by request_id when triaging a user-reported issue — they've likely captured the header.

Key operational metrics by service#

API service#

Metric	Source	Normal
RPS	access log	depends on fleet size; track trend
p99 latency	access log	`/v3/mail/send` < 200ms; reads < 100ms
5xx rate	access log	< 0.1%
Active connections	load balancer	< max_connections
Memory RSS	process stats	< 2 GB per process

Worker service#

Metric	Source	Normal
Jobs/sec processed	log output	depends on tenant volume
Render errors/sec	log output	< 0.01%
Webhook delivery success	`webhook_deliveries` table	> 99% 2xx
Average job time	log output	< 500ms per message (template render)

SMTP service#

Metric	Source	Normal
Outbound connections/sec	log output	depends on send volume
MX resolution failures	log output	< 0.01%
TLS handshake failures	log output	< 0.1%
Retry queue depth	Redis	stable; growing means upstream problems
DKIM signing latency	log output	< 10ms
Inbound DSN rate	log output	track trend; sudden spikes = recent send went bad
Inbound FBL rate	log output	track trend; spikes = spam-complaint problem

Dashboards you'll want#

Build at minimum:

Send rate and delivery rate — requests/sec, delivered/sec, as side-by-side lines.
Bounce rate, per tenant — stacked area by tenant.
Spam report rate, per tenant — same, below 0.1% line highlighted.
Queue depth — all three queues on one chart.
API latency — p50, p95, p99 for /v3/mail/send specifically.
Open/click rates — aggregate tracking response.

Grafana with MySQL and Redis datasources is enough. If you're on Prometheus, expose an exporter for the daily_stats table and scrape.

Deployment — what you're monitoring.
Troubleshooting — what to do when an alert fires.
Rate Limiting — 429s as a signal.