Platform
ScaiWave ScaiGrid ScaiCore ScaiBot ScaiDrive ScaiKey Models Tools & Services
Solutions
Organisations Developers Internet Service Providers Managed Service Providers AI-in-a-Box
Resources
Support Documentation Blog Downloads
Company
About Research Careers Investment Opportunities Contact
Log in

Webhooks Deep Dive

Advanced patterns for production webhook consumers. For basics, see Events and Webhooks and Webhooks Reference.

Idempotency is your responsibility#

ScaiGrid delivers events at-least-once, not exactly-once. A network blip during your 2xx response means we retry, and you receive the event twice. Your webhook handler must be idempotent.

Use the event_id as a deduplication key:

python
1
2
3
4
5
6
7
8
9
import hashlib

def handle_webhook(event):
    event_id = event["event_id"]
    if already_processed(event_id):
        return  # duplicate; ignore
    with transaction():
        do_work(event)
        mark_processed(event_id)

Store processed event IDs with a TTL matching your retention needs (typically 7–14 days; after that, duplicates shouldn't arrive).

Verify signatures before parsing#

python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
from hmac import new, compare_digest
from hashlib import sha256

def verify_signature(raw_body: bytes, header_value: str, secret: str) -> bool:
    expected = "sha256=" + new(
        secret.encode(), raw_body, sha256
    ).hexdigest()
    return compare_digest(expected, header_value)

@app.post("/webhooks")
async def webhook(request: Request):
    body = await request.body()
    sig = request.headers.get("X-ScaiGrid-Signature", "")
    if not verify_signature(body, sig, WEBHOOK_SECRET):
        raise HTTPException(401, "Invalid signature")
    event = json.loads(body)
    # now safe to act on event

Never skip this step. An unverified webhook could come from anyone.

Replay protection#

Beyond signature verification, check the timestamp to prevent replay attacks:

python
1
2
3
4
5
6
from datetime import datetime, timezone, timedelta

def check_timestamp(header_value: str, max_age_s: int = 300) -> bool:
    ts = datetime.fromtimestamp(int(header_value), tz=timezone.utc)
    now = datetime.now(tz=timezone.utc)
    return abs((now - ts).total_seconds()) <= max_age_s

Reject events older than 5 minutes. An attacker replaying a valid-signature event hours later is blocked.

Fast ack, async work#

Your webhook handler must return 2xx within 10 seconds. Don't do expensive work inline:

python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
# BAD — holding the connection during long work
@app.post("/webhooks")
async def webhook(request: Request):
    event = verify_and_parse(request)
    await send_emails(event)      # could take 30 seconds
    await update_crm(event)       # could take another 20
    return {"ok": True}

# GOOD — ack immediately, queue for async processing
@app.post("/webhooks")
async def webhook(request: Request):
    event = verify_and_parse(request)
    await queue.publish(event)    # Redis, ScaiQueue, SQS, whatever
    return {"ok": True}

A worker pulls from the queue and does the real work. Your webhook endpoint stays fast and never times out.

Handling burst events#

During a model incident, you might get hundreds of request.failed events in a minute. Rate-limit your processing to avoid cascading failures:

python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
import asyncio
from collections import deque

RECENT_EVENTS = deque(maxlen=1000)

async def process_with_backoff(event):
    now = time.time()
    # Drop events older than 60 seconds
    while RECENT_EVENTS and now - RECENT_EVENTS[0] > 60:
        RECENT_EVENTS.popleft()
    if len(RECENT_EVENTS) > 100:  # > 100 events/minute
        await asyncio.sleep(0.1)  # slow down
    RECENT_EVENTS.append(now)
    await do_work(event)

Or better, use a queue with a controlled-concurrency worker pool.

Subscribe to specific event types#

Don't wildcard-subscribe to everything. Each event type has a different semantics; generic handlers become spaghetti:

bash
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
# GOOD — narrow subscriptions per responsibility
POST /v1/webhooks
{
  "url": ".../billing",
  "events": ["budget.soft_limit_reached", "budget.hard_limit_reached"]
}

POST /v1/webhooks
{
  "url": ".../incidents",
  "events": ["request.failed", "webhook.auto_disabled"]
}

Separate webhooks isolate failures — your billing endpoint being down doesn't block incident events.

Delivery inspection#

When a webhook mysteriously misses events, check delivery history:

bash
1
2
curl "https://scaigrid.scailabs.ai/v1/webhooks/{webhook_id}/deliveries?status=failed&limit=50" \
  -H "Authorization: Bearer $TOKEN"

Look at status_code, error_message, duration_ms. Common patterns:

  • status_code: 408 and high duration_ms — your endpoint is too slow; ack faster
  • status_code: 401 — signature verification failing; check your shared secret
  • status_code: 500 — your endpoint is throwing; check your logs
  • status_code: 0, error_message: connection refused — endpoint is down

Replay when endpoint was down#

If your endpoint was unreachable during an outage:

bash
1
2
curl -X POST "https://scaigrid.scailabs.ai/v1/webhooks/{webhook_id}/replay?since=2026-04-22T08:00:00Z" \
  -H "Authorization: Bearer $TOKEN"

Events are replayed from the event bus retention window (7 days by default). If you need longer retention, consume via a different mechanism (Redis Streams directly, or subscribe a queue you control that persists indefinitely).

Circuit breaker on your side#

If your downstream system (CRM, email service, Slack) goes down, don't let it cascade:

python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
from circuitbreaker import circuit

@circuit(failure_threshold=5, recovery_timeout=30)
async def send_to_crm(event):
    await crm_client.post(...)

async def handle_event(event):
    try:
        await send_to_crm(event)
    except CircuitBreakerError:
        # CRM is down; queue for retry later
        await retry_queue.publish(event)

Return 2xx to ScaiGrid regardless — the retry is your problem, not ScaiGrid's.

Webhook auto-disable#

After 50 consecutive failed deliveries across all events to a webhook, ScaiGrid marks it failing and stops sending. You get a webhook.auto_disabled event (to another webhook, if one subscribes) and an admin-UI alert.

Re-enable after fixing:

bash
1
2
3
curl -X PUT https://scaigrid.scailabs.ai/v1/webhooks/{webhook_id} \
  -H "Authorization: Bearer $TOKEN" \
  -d '{"status": "active"}'

Missed events during the disabled window need replay or alternative recovery.

Monitoring your webhook endpoint#

Track:

  • Delivery success rate (from our delivery history API) — should be > 99%
  • Your handler latency (your metrics) — p99 under 5 seconds
  • Queue depth if you're queueing for async work — alert on growth

If your success rate drops, investigate before auto-disable kicks in.

Updated 2026-05-18 15:01:28 View source (.md) rev 17