---
title: Webhooks Deep Dive
path: advanced/webhooks-deep-dive
status: published
---

# Webhooks Deep Dive

Advanced patterns for production webhook consumers. For basics, see [Events and Webhooks](../03-core-concepts/06-events-and-webhooks.md) and [Webhooks Reference](../06-reference/08-webhooks.md).

## Idempotency is your responsibility

ScaiGrid delivers events **at-least-once**, not exactly-once. A network blip during your `2xx` response means we retry, and you receive the event twice. Your webhook handler must be idempotent.

Use the `event_id` as a deduplication key:

```python
import hashlib

def handle_webhook(event):
    event_id = event["event_id"]
    if already_processed(event_id):
        return  # duplicate; ignore
    with transaction():
        do_work(event)
        mark_processed(event_id)
```

Store processed event IDs with a TTL matching your retention needs (typically 7–14 days; after that, duplicates shouldn't arrive).

## Verify signatures before parsing

```python
from hmac import new, compare_digest
from hashlib import sha256

def verify_signature(raw_body: bytes, header_value: str, secret: str) -> bool:
    expected = "sha256=" + new(
        secret.encode(), raw_body, sha256
    ).hexdigest()
    return compare_digest(expected, header_value)

@app.post("/webhooks")
async def webhook(request: Request):
    body = await request.body()
    sig = request.headers.get("X-ScaiGrid-Signature", "")
    if not verify_signature(body, sig, WEBHOOK_SECRET):
        raise HTTPException(401, "Invalid signature")
    event = json.loads(body)
    # now safe to act on event
```

Never skip this step. An unverified webhook could come from anyone.

## Replay protection

Beyond signature verification, check the timestamp to prevent replay attacks:

```python
from datetime import datetime, timezone, timedelta

def check_timestamp(header_value: str, max_age_s: int = 300) -> bool:
    ts = datetime.fromtimestamp(int(header_value), tz=timezone.utc)
    now = datetime.now(tz=timezone.utc)
    return abs((now - ts).total_seconds()) <= max_age_s
```

Reject events older than 5 minutes. An attacker replaying a valid-signature event hours later is blocked.

## Fast ack, async work

Your webhook handler must return `2xx` within 10 seconds. Don't do expensive work inline:

```python
# BAD — holding the connection during long work
@app.post("/webhooks")
async def webhook(request: Request):
    event = verify_and_parse(request)
    await send_emails(event)      # could take 30 seconds
    await update_crm(event)       # could take another 20
    return {"ok": True}

# GOOD — ack immediately, queue for async processing
@app.post("/webhooks")
async def webhook(request: Request):
    event = verify_and_parse(request)
    await queue.publish(event)    # Redis, ScaiQueue, SQS, whatever
    return {"ok": True}
```

A worker pulls from the queue and does the real work. Your webhook endpoint stays fast and never times out.

## Handling burst events

During a model incident, you might get hundreds of `request.failed` events in a minute. Rate-limit your processing to avoid cascading failures:

```python
import asyncio
from collections import deque

RECENT_EVENTS = deque(maxlen=1000)

async def process_with_backoff(event):
    now = time.time()
    # Drop events older than 60 seconds
    while RECENT_EVENTS and now - RECENT_EVENTS[0] > 60:
        RECENT_EVENTS.popleft()
    if len(RECENT_EVENTS) > 100:  # > 100 events/minute
        await asyncio.sleep(0.1)  # slow down
    RECENT_EVENTS.append(now)
    await do_work(event)
```

Or better, use a queue with a controlled-concurrency worker pool.

## Subscribe to specific event types

Don't wildcard-subscribe to everything. Each event type has a different semantics; generic handlers become spaghetti:

```bash
# GOOD — narrow subscriptions per responsibility
POST /v1/webhooks
{
  "url": ".../billing",
  "events": ["budget.soft_limit_reached", "budget.hard_limit_reached"]
}

POST /v1/webhooks
{
  "url": ".../incidents",
  "events": ["request.failed", "webhook.auto_disabled"]
}
```

Separate webhooks isolate failures — your billing endpoint being down doesn't block incident events.

## Delivery inspection

When a webhook mysteriously misses events, check delivery history:

```bash
curl "https://scaigrid.scailabs.ai/v1/webhooks/{webhook_id}/deliveries?status=failed&limit=50" \
  -H "Authorization: Bearer $TOKEN"
```

Look at `status_code`, `error_message`, `duration_ms`. Common patterns:

- `status_code: 408` and high `duration_ms` — your endpoint is too slow; ack faster
- `status_code: 401` — signature verification failing; check your shared secret
- `status_code: 500` — your endpoint is throwing; check your logs
- `status_code: 0, error_message: connection refused` — endpoint is down

## Replay when endpoint was down

If your endpoint was unreachable during an outage:

```bash
curl -X POST "https://scaigrid.scailabs.ai/v1/webhooks/{webhook_id}/replay?since=2026-04-22T08:00:00Z" \
  -H "Authorization: Bearer $TOKEN"
```

Events are replayed from the event bus retention window (7 days by default). If you need longer retention, consume via a different mechanism (Redis Streams directly, or subscribe a queue you control that persists indefinitely).

## Circuit breaker on your side

If your downstream system (CRM, email service, Slack) goes down, don't let it cascade:

```python
from circuitbreaker import circuit

@circuit(failure_threshold=5, recovery_timeout=30)
async def send_to_crm(event):
    await crm_client.post(...)

async def handle_event(event):
    try:
        await send_to_crm(event)
    except CircuitBreakerError:
        # CRM is down; queue for retry later
        await retry_queue.publish(event)
```

Return `2xx` to ScaiGrid regardless — the retry is your problem, not ScaiGrid's.

## Webhook auto-disable

After 50 consecutive failed deliveries across all events to a webhook, ScaiGrid marks it `failing` and stops sending. You get a `webhook.auto_disabled` event (to another webhook, if one subscribes) and an admin-UI alert.

Re-enable after fixing:

```bash
curl -X PUT https://scaigrid.scailabs.ai/v1/webhooks/{webhook_id} \
  -H "Authorization: Bearer $TOKEN" \
  -d '{"status": "active"}'
```

Missed events during the disabled window need replay or alternative recovery.

## Monitoring your webhook endpoint

Track:

- Delivery success rate (from our delivery history API) — should be > 99%
- Your handler latency (your metrics) — p99 under 5 seconds
- Queue depth if you're queueing for async work — alert on growth

If your success rate drops, investigate before auto-disable kicks in.

## Related

- [Events and Webhooks (concepts)](../03-core-concepts/06-events-and-webhooks.md)
- [Webhooks Reference](../06-reference/08-webhooks.md)
- [Errors](../03-core-concepts/07-errors.md)
