Platform
ScaiWave ScaiGrid ScaiCore ScaiBot ScaiDrive ScaiKey Models Tools & Services
Solutions
Organisations Developers Internet Service Providers Managed Service Providers AI-in-a-Box
Resources
Support Documentation Blog Downloads
Company
About Research Careers Investment Opportunities Contact
Log in

Health and Monitoring

What to observe in a running ScaiSend deployment, and how to tell when something's wrong. The short version: three process families to watch, four metric families to alert on, and a handful of health endpoints.

Health endpoints#

Endpoint Returns
GET /health 200 {"status": "healthy"} if the API process is alive
GET /ready 200 {"status": "ready"} if DB is reachable; 503 otherwise

Point your load-balancer health probes at /ready. In Kubernetes:

yaml
1
2
3
4
5
6
7
8
9
livenessProbe:
  httpGet:
    path: /health
    port: 8000
readinessProbe:
  httpGet:
    path: /ready
    port: 8000
  periodSeconds: 10

The Worker and SMTP services don't expose HTTP. Rely on process supervisors (systemd, Kubernetes) to restart them.

Process liveness#

Every service logs to stdout. Keep an eye on:

  • API: request volume, 5xx rate, latency percentiles.
  • Worker: queue depth on email:process, job completion rate, render errors.
  • SMTP: outbound connection success rate, retry volume, bounce rate.

If you're running under systemd, journalctl -u scaisend-api -f is the minimal tail. For real monitoring, ship logs to your log aggregator and add structured alerts.

Core metrics to alert on#

Four signals cover most real incidents:

1. API error rate#

cbmbas
1
5xx_responses / total_responses

Alert if > 0.5% over 5 minutes. A steady stream of 5xx usually means MySQL or Redis is unhealthy, or the API is OOMing.

2. Queue depth#

text
1
llen(email:process) + llen(smtp:deliver) + llen(webhook:deliver)

Alert if queue depth is growing monotonically for 15+ minutes, or if absolute depth exceeds a ceiling (e.g., 100k). Growing queues mean workers or SMTP services can't keep up; check CPU, connection pool saturation, SMTP retry storms.

3. Bounce rate#

scdoc
1
2
GET /v3/stats/sum?start_date=today
# then compute: bounces / requests

Alert if > 2% for any single tenant. High bounce rate is a deliverability emergency — ISPs start rate-limiting you within a day or two.

4. Spam-report rate#

scdoc
1
spam_reports / delivered

Alert if > 0.1% for any single tenant. Spam reports damage sender reputation fast. A sudden spike indicates either a mailing-list blunder (unexpected recipients) or a compromised API key (someone else sending from your tenant).

Queue depth probing#

If you have Redis access:

bash
1
2
3
redis-cli LLEN arq:queue:default       # default arq queue
redis-cli LLEN smtp:deliver            # outbound SMTP queue
redis-cli LLEN webhook:deliver         # webhook delivery queue

Exact key names depend on deployment config (check REDIS_QUEUE_PREFIX if you've customized). If you're running arq in cluster mode, you'll have a queue key per worker.

Database query hot-spots#

MySQL will show you where time is spent:

sql
1
2
3
4
5
6
SELECT SUBSTRING(info, 1, 80) AS query, COUNT(*), AVG(time)
FROM information_schema.processlist
WHERE command = 'Query'
GROUP BY query
ORDER BY AVG(time) DESC
LIMIT 10;

Common issues:

  • email_events inserts bottleneck. Every delivered / open / click is a row. For high-volume tenants, partition by month or archive old events.
  • email_messages list queries with unfiltered subject search. Add an index on tenant_id, created_at.
  • Stats aggregation under load. If /v3/stats/rebuild is running concurrently with live sends, writes to daily_stats can contend. Run rebuilds off-peak.

SMTP deliverability metrics#

Beyond the ScaiSend API metrics, watch:

  • Postmaster Tools (Gmail): spam rate, reputation, authentication success.
  • Microsoft SNDS: reputation bucket (Green / Yellow / Red), complaint rate.
  • Your outbound IP's RBL status: use an MXToolbox-like service to check whether you've been added to any blocklist.

Drops in reputation are leading indicators — they show up days before your bounce rate climbs. Catch them early.

Logging conventions#

ScaiSend logs structured JSON by default. Every log entry includes:

Field Purpose
level debug, info, warning, error, critical
message Human-readable
request_id Set on API requests; correlate with X-Request-ID header
tenant_id When in tenant context
message_id When relevant to a specific ScaiSend message

Search by request_id when triaging a user-reported issue — they've likely captured the header.

Key operational metrics by service#

API service#

Metric Source Normal
RPS access log depends on fleet size; track trend
p99 latency access log /v3/mail/send < 200ms; reads < 100ms
5xx rate access log < 0.1%
Active connections load balancer < max_connections
Memory RSS process stats < 2 GB per process

Worker service#

Metric Source Normal
Jobs/sec processed log output depends on tenant volume
Render errors/sec log output < 0.01%
Webhook delivery success webhook_deliveries table > 99% 2xx
Average job time log output < 500ms per message (template render)

SMTP service#

Metric Source Normal
Outbound connections/sec log output depends on send volume
MX resolution failures log output < 0.01%
TLS handshake failures log output < 0.1%
Retry queue depth Redis stable; growing means upstream problems
DKIM signing latency log output < 10ms
Inbound DSN rate log output track trend; sudden spikes = recent send went bad
Inbound FBL rate log output track trend; spikes = spam-complaint problem

Dashboards you'll want#

Build at minimum:

  1. Send rate and delivery rate — requests/sec, delivered/sec, as side-by-side lines.
  2. Bounce rate, per tenant — stacked area by tenant.
  3. Spam report rate, per tenant — same, below 0.1% line highlighted.
  4. Queue depth — all three queues on one chart.
  5. API latency — p50, p95, p99 for /v3/mail/send specifically.
  6. Open/click rates — aggregate tracking response.

Grafana with MySQL and Redis datasources is enough. If you're on Prometheus, expose an exporter for the daily_stats table and scrape.

Updated 2026-05-17 01:33:27 View source (.md) rev 1