Webhooks Deep Dive
Advanced patterns for production webhook consumers. For basics, see Events and Webhooks and Webhooks Reference.
Idempotency is your responsibility#
ScaiGrid delivers events at-least-once, not exactly-once. A network blip during your 2xx response means we retry, and you receive the event twice. Your webhook handler must be idempotent.
Use the event_id as a deduplication key:
1 2 3 4 5 6 7 8 9 | |
Store processed event IDs with a TTL matching your retention needs (typically 7–14 days; after that, duplicates shouldn't arrive).
Verify signatures before parsing#
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | |
Never skip this step. An unverified webhook could come from anyone.
Replay protection#
Beyond signature verification, check the timestamp to prevent replay attacks:
1 2 3 4 5 6 | |
Reject events older than 5 minutes. An attacker replaying a valid-signature event hours later is blocked.
Fast ack, async work#
Your webhook handler must return 2xx within 10 seconds. Don't do expensive work inline:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | |
A worker pulls from the queue and does the real work. Your webhook endpoint stays fast and never times out.
Handling burst events#
During a model incident, you might get hundreds of request.failed events in a minute. Rate-limit your processing to avoid cascading failures:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | |
Or better, use a queue with a controlled-concurrency worker pool.
Subscribe to specific event types#
Don't wildcard-subscribe to everything. Each event type has a different semantics; generic handlers become spaghetti:
1 2 3 4 5 6 7 8 9 10 11 12 | |
Separate webhooks isolate failures — your billing endpoint being down doesn't block incident events.
Delivery inspection#
When a webhook mysteriously misses events, check delivery history:
1 2 | |
Look at status_code, error_message, duration_ms. Common patterns:
status_code: 408and highduration_ms— your endpoint is too slow; ack fasterstatus_code: 401— signature verification failing; check your shared secretstatus_code: 500— your endpoint is throwing; check your logsstatus_code: 0, error_message: connection refused— endpoint is down
Replay when endpoint was down#
If your endpoint was unreachable during an outage:
1 2 | |
Events are replayed from the event bus retention window (7 days by default). If you need longer retention, consume via a different mechanism (Redis Streams directly, or subscribe a queue you control that persists indefinitely).
Circuit breaker on your side#
If your downstream system (CRM, email service, Slack) goes down, don't let it cascade:
1 2 3 4 5 6 7 8 9 10 11 12 | |
Return 2xx to ScaiGrid regardless — the retry is your problem, not ScaiGrid's.
Webhook auto-disable#
After 50 consecutive failed deliveries across all events to a webhook, ScaiGrid marks it failing and stops sending. You get a webhook.auto_disabled event (to another webhook, if one subscribes) and an admin-UI alert.
Re-enable after fixing:
1 2 3 | |
Missed events during the disabled window need replay or alternative recovery.
Monitoring your webhook endpoint#
Track:
- Delivery success rate (from our delivery history API) — should be > 99%
- Your handler latency (your metrics) — p99 under 5 seconds
- Queue depth if you're queueing for async work — alert on growth
If your success rate drops, investigate before auto-disable kicks in.