Troubleshooting

A short list of things that go wrong with ScaiQueue and how to fix them. If none of these match, check the request id in the response envelope and the audit log at GET /scopes/{scope_id}/audit?correlation_id=....

Publish returns `SCAIQUEUE_SCOPE_PAUSED` or `SCAIQUEUE_QUEUE_PAUSED`#

Someone paused the scope or queue (or it's archived / draining). Check state via GET /scopes/{scope_id} and GET /scopes/{scope_id}/queues/{queue_id}. Resume with POST /scopes/{scope_id}/resume or POST /scopes/{scope_id}/queues/{queue_id}/resume. Archived scopes must be unarchived first.

Publish returns `SCAIQUEUE_QUEUE_FULL`#

The queue's max_depth is set and reached. Options: drain the queue (consume more aggressively), raise max_depth, or change overflow_policy away from reject.

Publish returns `SCAIQUEUE_IDEMPOTENCY_CONFLICT`#

You sent the same idempotency_key twice with a different body. Either re-use the same body (which returns the original message id) or use a fresh key.

Claim returns an empty list when you know there's work#

A few causes, in priority order:

Queue is paused — GET /queues/{queue_id} and check state.
Scope is paused or archived — checking only the queue isn't enough.
Message is not yet visible — delay_until or visible_at set in the future.
Redis is unhealthy — claim falls back to the DB path, which is slower; check ScaiGrid logs for Redis errors.
Another consumer claimed it first — claims are atomic; in a competing-consumer queue, only one wins. Increase batch_size if you have many consumers and small messages.

Messages keep being reclaimed by other consumers#

Your consumer isn't calling complete or fail (or extend) before visibility_timeout_s expires. The visibility_timeout_enforcer runs every second and flips abandoned claims back to pending. Either:

Raise visibility_timeout_s on publish (or on the claim call) to cover the long tail of your processing time, or
Call POST /scopes/{scope_id}/messages/{msg_id}/extend periodically from long-running workers.

Messages land in `_dead_letter`#

A message landed in dead-letter because it was failed max_retries times (default 3). Find it via GET /scopes/{scope_id}/queues/<dead-letter-queue-id>/messages and inspect failure_reason. After fixing the root cause, you can republish the body to the original queue manually.

Routing rule never fires#

In priority order:

Rule is disabled. GET /scopes/{scope_id}/routing-rules/{rule_id} — check enabled.
Trigger mismatch. Make sure the rule's trigger event matches what you expect (default rules fire on message_published).
Conditions don't match. Run POST /scopes/{scope_id}/routing-rules/test with a realistic test message and see which rules match.
A higher-priority rule wins first. Rules are evaluated in priority-ascending order and first-match wins. Lower the rule's priority to make it more selective, or change conditions on the rules ahead of it.

Routing loop / "circuit_breaker" audit entries#

The routing engine refuses to apply more than 5 hops per message and writes a routing.circuit_breaker audit entry. Inspect your rule graph for cycles — typically a rule routes back into a queue that's a source for another rule.

Stream never completes#

The stream is stuck open. Causes:

The producer never published a final chunk (stream_final=1).
expected_chunks was set but not all sequences arrived.
timeout_seconds (default 300) elapsed and the stream is technically expired but assembly still works on what arrived.

Either cancel via POST /scopes/{scope_id}/streams/{stream_id}/cancel or fetch what you have via GET /scopes/{scope_id}/streams/{stream_id}/assembled.

API key rejected after rotation#

Old keys remain valid for grace_period_seconds after rotation (default 300). After that, they are revoked. If a consumer fails right after a rotation, check whether it picked up the new key.

"Unauthorized" / `PERMISSION_DENIED` from a tenant-admin#

Every ScaiQueue endpoint requires the caller to have a tenant_id on their token. A super_admin without a tenant context (cross-tenant operator) is forbidden. Either re-issue the token scoped to a tenant, or operate via that tenant's admin user.

System agents say `status: idle` but timeouts aren't being enforced#

idle is the resting state between runs. Check last_run_at and total_runs — if they're not advancing, the arq worker isn't running. Check ScaiGrid's worker mode is up (SCAIGRID_MODE=worker).

Troubleshooting

Publish returns SCAIQUEUE_SCOPE_PAUSED or SCAIQUEUE_QUEUE_PAUSED#

Publish returns SCAIQUEUE_QUEUE_FULL#

Publish returns SCAIQUEUE_IDEMPOTENCY_CONFLICT#