Architecture
ScaiQueue is a thin product layer over MariaDB and Redis. Messages are durable rows; the per-queue order and the claim locks live in Redis; system agents enforce visibility timeouts, retries, archival, and dead-lettering on a cron schedule.
Components#
A producer publishes over HTTP. ScaiGrid's MessageService writes the row, updates Redis, and (optionally) invokes the routing engine. A consumer claims via HTTP; on completion ScaiGrid emits an event on its internal event bus. System agents do the housekeeping out-of-band.
There is no separate ScaiQueue daemon. The HTTP endpoints, the routing engine, and the cron-driven system agents all run inside ScaiGrid's existing FastAPI and arq processes.
Lifecycle of one message#
- Publish.
MessageService.publishvalidates scope and queue state (paused,archived, ordrainingreject), writes the message row withstate=pending, computes a sort score based on the queue's ordering mode, andZADDs the id into the per-queue pending sorted set. The routing engine then evaluatesmessage_published-trigger rules; if one matches it may route, transform, escalate, or archive. - Claim. A consumer calls claim. The service
ZPOPMINs the lowest-score id, takes a RedisSET NX EXlock keyed by the message id withvisibility_timeout_sTTL, then updates the row tostate=claimed, setsclaimed_by_*, incrementsattempts. The queue'sdepth_pendingdecrements anddepth_claimedincrements. - Complete or fail. Complete moves the row to
completed, releases the Redis lock, decrementsdepth_claimed, and emits ascaiqueue.message.completedevent on ScaiGrid's event bus with the originalcorrelation_id. Fail recordsfailure_reason; ifattempts < max_retriesthe message returns to pending with a backoff, otherwise it is moved to the scope's_dead_letterqueue. - Reclaim on crash. If the consumer never calls complete or fail, the claim lock expires. The
visibility_timeout_enforcercron job (runs every second) flips the row back topendingand re-adds it to the pending sorted set.
Ordering modes#
The score in the Redis sorted set depends on the queue's ordering:
fifo—created_atas epoch seconds. Lowest is oldest, soZPOPMINreturns the oldest first.priority—-(tier*1000 + score)thencreated_at. Higher tier-plus-score wins; ties broken by age.deadline—deadlineas epoch seconds. Earliest deadline wins.
The consumer_mode field is independent: competing (default — one consumer claims each message), broadcast (every subscriber sees a copy), or other modes that may exist on system queues.
State#
- Scopes, queues, messages, routing rules, schemas, HITL patterns, ACLs, API keys, audit log, system-agent registry — all in MariaDB tables prefixed
mod_scaiqueue_. - Pending order, claim locks, visibility timeouts, subscription dedup sets — Redis keys under
sq:{tenant}:{scope}:.... - Archived messages — moved from
mod_scaiqueue_messagestomod_scaiqueue_message_archiveby themessage_archiveragent for terminal messages older than one hour.
If Redis is unavailable, MessageService falls back to a DB-only claim path that scans pending rows in created_at order. Slower; still correct.
System agents#
These run inside ScaiGrid's arq workers on a cron schedule. You don't start them; they appear in GET /system-agents.
| Agent | Schedule | Role |
|---|---|---|
visibility_timeout_enforcer |
every second | Reclaims expired claims, returns messages to pending. |
expiry_enforcer |
every minute | Marks TTL- and deadline-expired pending messages as terminal. |
message_archiver |
every 5 minutes | Moves terminal messages older than 1 hour to the archive table. |
dead_letter_monitor |
every minute | Emits structured warnings when _dead_letter queues grow. |
priority_aging_worker |
every 5 minutes | Bumps priority_score on long-pending messages so low-priority work doesn't starve. |
How it differs from rolling your own#
Building this on raw Redis Streams or a hosted broker gives you a topic and a delivery guarantee. ScaiQueue adds the typed payload, the routing engine, the HITL spec, the dead-letter queue, the audit log, and the system agents — all wired into ScaiGrid's auth, accounting, and admin UI. If you only need fire-and-forget pub/sub, you don't need ScaiQueue. If you need durable, retryable, observable work distribution with optional human steps, you do.