Never lose a webhook: exponential backoff and DLQ architecture
Webhooks are the nervous system of modern SaaS integrations — but they're inherently unreliable. The sender fires and forgets. The receiver must be up, fast, and correct at the exact moment of delivery. When it isn't, the event is gone.
For an automation platform like ZapRail, a lost webhook is a lost workflow execution. That's an unfulfilled order, a missing invoice, a customer notification that never went out. We can't accept that.
Here's how we built a delivery guarantee layer on top of an inherently best-effort protocol.
The three failure modes
Before designing the retry system, we catalogued the failure modes we needed to handle:
1. Downstream unavailability. The target API (NetSuite, Salesforce, etc.) is temporarily down or rate-limiting. These are transient — a retry after a short delay will usually succeed.
2. Mapping errors. The payload structure doesn't match what the action expects — missing fields, wrong types, schema drift. These won't resolve on their own; they need human attention.
3. Connector misconfiguration. Invalid credentials, expired OAuth tokens, wrong endpoint URLs. Again, not self-healing.
The retry system should handle case 1 automatically and surface cases 2 and 3 as fast as possible so the user can fix them.
The retry queue
ZapRail's retry queue is implemented as a Redis sorted set. When a webhook execution fails with a transient error (HTTP 429, 500, 502, 503, 504), the execution ID is added to the sorted set with a score equal to the next retry timestamp.
A background worker polls the sorted set every 10 seconds, pulls all entries with score ≤ now, and re-dispatches them to the Temporal workflow worker.
The retry schedule uses truncated exponential backoff with jitter:
Attempt 1: immediate
Attempt 2: 30 seconds + jitter
Attempt 3: 2 minutes + jitter
Attempt 4: 10 minutes + jitter
Attempt 5: 30 minutes + jitterThe jitter (±20% of the base interval) prevents thundering herd behaviour when many workflows fail simultaneously during a downstream outage.
The dead letter queue
After 5 failed attempts, the execution moves to the dead letter queue — a separate Redis sorted set with a 7-day TTL. The user receives an email alert (via Resend) with a link to the execution detail page, where they can inspect the full error history and manually re-trigger the execution after fixing the underlying cause.
The DLQ is also queryable via the ZapRail API, so enterprise customers can build their own alerting or monitoring pipelines on top of it.
Webhook signature verification
Reliability isn't just about retries — it's also about trust. We verify the authenticity of every incoming webhook using HMAC-SHA256 signature verification. The sender signs the payload with a shared secret; ZapRail recomputes the signature and rejects payloads that don't match.
For connectors that don't support HMAC (a surprising number of enterprise systems), we fall back to bearer token authentication on the ingress URL.
The result
Since rolling out this system, we've reduced silent workflow failures by 94%. The remaining 6% are genuine non-transient errors (schema mismatches, expired credentials) that correctly end up in the DLQ with a clear error message. No more lost events.
Ready to try ZapRail?
Start automating your workflows in minutes. No credit card required.
Start free →