Webhooks Deep Dive

Advanced patterns for production webhook consumers. For basics, see Events and Webhooks and Webhooks Reference.

Idempotency is your responsibility#

ScaiGrid delivers events at-least-once, not exactly-once. A network blip during your 2xx response means we retry, and you receive the event twice. Your webhook handler must be idempotent.

Use the event_id as a deduplication key:

python
import hashlib

def handle_webhook(event):
    event_id = event["event_id"]
    if already_processed(event_id):
        return  # duplicate; ignore
    with transaction():
        do_work(event)
        mark_processed(event_id)

Store processed event IDs with a TTL matching your retention needs (typically 7–14 days; after that, duplicates shouldn't arrive).

Verify signatures before parsing#

python
from hmac import new, compare_digest
from hashlib import sha256

def verify_signature(raw_body: bytes, header_value: str, secret: str) -> bool:
    expected = "sha256=" + new(
        secret.encode(), raw_body, sha256
    ).hexdigest()
    return compare_digest(expected, header_value)

@app.post("/webhooks")
async def webhook(request: Request):
    body = await request.body()
    sig = request.headers.get("X-ScaiGrid-Signature", "")
    if not verify_signature(body, sig, WEBHOOK_SECRET):
        raise HTTPException(401, "Invalid signature")
    event = json.loads(body)
    # now safe to act on event

Never skip this step. An unverified webhook could come from anyone.

Replay protection#

Beyond signature verification, check the timestamp to prevent replay attacks:

python
from datetime import datetime, timezone, timedelta

def check_timestamp(header_value: str, max_age_s: int = 300) -> bool:
    ts = datetime.fromtimestamp(int(header_value), tz=timezone.utc)
    now = datetime.now(tz=timezone.utc)
    return abs((now - ts).total_seconds()) <= max_age_s

Reject events older than 5 minutes. An attacker replaying a valid-signature event hours later is blocked.

Fast ack, async work#

Your webhook handler must return 2xx within 10 seconds. Don't do expensive work inline:

python
# BAD — holding the connection during long work
@app.post("/webhooks")
async def webhook(request: Request):
    event = verify_and_parse(request)
    await send_emails(event)      # could take 30 seconds
    await update_crm(event)       # could take another 20
    return {"ok": True}

# GOOD — ack immediately, queue for async processing
@app.post("/webhooks")
async def webhook(request: Request):
    event = verify_and_parse(request)
    await queue.publish(event)    # Redis, ScaiQueue, SQS, whatever
    return {"ok": True}

A worker pulls from the queue and does the real work. Your webhook endpoint stays fast and never times out.

Handling burst events#

During a model incident, you might get hundreds of request.failed events in a minute. Rate-limit your processing to avoid cascading failures:

python
import asyncio
from collections import deque

RECENT_EVENTS = deque(maxlen=1000)

async def process_with_backoff(event):
    now = time.time()
    # Drop events older than 60 seconds
    while RECENT_EVENTS and now - RECENT_EVENTS[0] > 60:
        RECENT_EVENTS.popleft()
    if len(RECENT_EVENTS) > 100:  # > 100 events/minute
        await asyncio.sleep(0.1)  # slow down
    RECENT_EVENTS.append(now)
    await do_work(event)

Or better, use a queue with a controlled-concurrency worker pool.

Don't wildcard-subscribe to everything. Each event type has a different semantics; generic handlers become spaghetti:

bash
# GOOD — narrow subscriptions per responsibility
POST /v1/webhooks
{
  "url": ".../billing",
  "events": ["budget.soft_limit_reached", "budget.hard_limit_reached"]
}

POST /v1/webhooks
{
  "url": ".../incidents",
  "events": ["request.failed", "webhook.auto_disabled"]
}

Separate webhooks isolate failures — your billing endpoint being down doesn't block incident events.

Delivery inspection#

When a webhook mysteriously misses events, check delivery history:

bash
curl "https://scaigrid.scailabs.ai/v1/webhooks/{webhook_id}/deliveries?status=failed&limit=50" \
  -H "Authorization: Bearer $TOKEN"

Look at status_code, error_message, duration_ms. Common patterns:

status_code: 408 and high duration_ms — your endpoint is too slow; ack faster
status_code: 401 — signature verification failing; check your shared secret
status_code: 500 — your endpoint is throwing; check your logs
status_code: 0, error_message: connection refused — endpoint is down

Replay when endpoint was down#

If your endpoint was unreachable during an outage:

bash
curl -X POST "https://scaigrid.scailabs.ai/v1/webhooks/{webhook_id}/replay?since=2026-04-22T08:00:00Z" \
  -H "Authorization: Bearer $TOKEN"

Events are replayed from the event bus retention window (7 days by default). If you need longer retention, consume via a different mechanism (Redis Streams directly, or subscribe a queue you control that persists indefinitely).

Circuit breaker on your side#

If your downstream system (CRM, email service, Slack) goes down, don't let it cascade:

python
from circuitbreaker import circuit

@circuit(failure_threshold=5, recovery_timeout=30)
async def send_to_crm(event):
    await crm_client.post(...)

async def handle_event(event):
    try:
        await send_to_crm(event)
    except CircuitBreakerError:
        # CRM is down; queue for retry later
        await retry_queue.publish(event)

Return 2xx to ScaiGrid regardless — the retry is your problem, not ScaiGrid's.

Webhook auto-disable#

After 50 consecutive failed deliveries across all events to a webhook, ScaiGrid marks it failing and stops sending. You get a webhook.auto_disabled event (to another webhook, if one subscribes) and an admin-UI alert.

Re-enable after fixing:

bash
curl -X PUT https://scaigrid.scailabs.ai/v1/webhooks/{webhook_id} \
  -H "Authorization: Bearer $TOKEN" \
  -d '{"status": "active"}'

Missed events during the disabled window need replay or alternative recovery.

Monitoring your webhook endpoint#

Track:

Delivery success rate (from our delivery history API) — should be > 99%
Your handler latency (your metrics) — p99 under 5 seconds
Queue depth if you're queueing for async work — alert on growth

If your success rate drops, investigate before auto-disable kicks in.