How to Build a Webhook Retry System with Exponential Backoff in Node.js

by Fahim

Webhooks fail. All the time. A customer’s server goes down for maintenance, their database locks up, or they suddenly rate-limit your requests. If you don’t have a solid retry strategy, those event payloads are gone forever—and you’ll have angry users demanding to know why their data is missing.

We’re going to build a production-grade webhook retry system in Node.js using BullMQ and Redis. It’ll handle exponential backoff dynamically, catch failures without crashing, and quarantine dead payloads so they don’t clog up the queue.

Server rack console showing terminal logs of an exponential backoff webhook retry system

The Architecture: Why Simple Loops Fail in Production

When I first built a webhook sender, I fell into the classic trap: I wrote a basic try/catch block with a quick loop to retry three times with a one-second delay. That failed spectacularly the first time a customer’s server went down for maintenance. Retrying three times in three seconds is useless if their API is down for ten minutes. Worse, when their server finally comes back online, a backlog of instant retries acts like a self-inflicted DDoS attack.

To do this right, we need an asynchronous queue. When an event happens, we'll push it to the queue and immediately return a 202 Accepted to our client. A background worker handles the actual HTTP post. If it fails, the worker schedules a retry with exponential backoff (like 1s, 2s, 4s, 8s).

We’re using Redis and BullMQ for this. It keeps our main API fast and ensures we don’t lose jobs if our own server restarts.

Setting Up the Project

Let’s get the boilerplate out of the way. We’ll need Express for our test server, Axios to dispatch the webhooks, BullMQ for the queue, and dotenv to manage our environment variables. Run this in your terminal to set things up:

mkdir webhook-retry-system
cd webhook-retry-system
npm init -y
npm install express bullmq ioredis axios dotenv

Next, toss a .env file in your root directory to store your port and Redis connection string. Here’s what mine looks like:

PORT=3000
REDIS_HOST=127.0.0.1
REDIS_PORT=6379

Make sure you actually have Redis running locally. If you’re on a Mac, running brew install redis and brew services start redis is the easiest way to get it going.

Setting Up the Webhook Queue with BullMQ

BullMQ uses Redis under the hood to manage jobs. I prefer keeping my queue configuration separate from my Express routes so things don’t turn into spaghetti. If you’re planning a larger app, you might want to check out our guide on the best folder structure for your next project. Create a file called queue.js. This is where we’ll configure the Redis connection and export our queue instance:

const { Queue } = require('bullmq');
const IORedis = require('ioredis');
require('dotenv').config(); const connection = new IORedis({ host: process.env.REDIS_HOST || '127.0.0.1', port: process.env.REDIS_PORT || 6379, maxRetriesPerRequest: null,
}); const webhookQueue = new Queue('webhookQueue', { connection, defaultJobOptions: { attempts: 5, backoff: { type: 'exponential', delay: 2000, // Initial delay of 2 seconds }, removeOnComplete: true, removeOnFail: false, },
}); module.exports = { webhookQueue, connection };

Here, we’re telling BullMQ to try each job up to 5 times. The backoff configuration uses an exponential strategy starting at 2000ms (2 seconds). That means if a delivery fails, BullMQ waits 2 seconds before the first retry, 4 seconds for the second, 8 seconds for the third, and so on.

Writing Custom Exponential Backoff Rules

BullMQ’s built-in backoff is fine, but in production, you usually want custom rules. For instance, you don’t want retries stretching out over days, so you cap the maximum delay at 1 hour. You also want to add “jitter” (a bit of randomness) to prevent a “thundering herd” where hundreds of failed jobs retry at the exact same millisecond. We can register a custom backoff strategy directly in our worker. Create worker.js and let’s write the worker logic:

const { Worker } = require('bullmq');
const axios = require('axios');
const { connection } = require('./queue'); const customBackoffStrategy = (attemptsMade, type, err, job) => { if (type !== 'customExponential') { return 1000; } const initialDelay = 1000; // 1 second const maxDelay = 3600000; // 1 hour // Calculate delay: initialDelay * 2^(attemptsMade - 1) let delay = initialDelay * Math.pow(2, attemptsMade - 1); // Add some jitter (randomness between 0 and 500ms) to avoid synchronized retries const jitter = Math.floor(Math.random() * 500); delay = Math.min(delay + jitter, maxDelay); return delay;
}; const worker = new Worker( 'webhookQueue', async (job) => { const { url, payload, headers } = job.data; console.log(`[Job ${job.id}] Attempting to send webhook to ${url} (Attempt ${job.attemptsMade + 1})`); await axios.post(url, payload, { headers: headers || { 'Content-Type': 'application/json' }, timeout: 5000, // Timeout after 5 seconds }); console.log(`[Job ${job.id}] Webhook delivered successfully!`); }, { connection, settings: { backoffStrategies: { customExponential: customBackoffStrategy, }, }, }
); worker.on('failed', (job, err) => { if (job.attemptsMade >= job.opts.attempts) { console.error(`[Job ${job.id}] Permanently failed after ${job.attemptsMade} attempts. Error: ${err.message}`); } else { console.warn(`[Job ${job.id}] Failed attempt. Retrying in background... Error: ${err.message}`); }
}); module.exports = worker;

This worker does the heavy lifting. If Axios throws an error (which it does for any non-2xx response), BullMQ catches it. It then looks up our customExponential strategy, calculates the next delay with some added jitter, and schedules the job to run again later. This simple math keeps your system from hammering a recovering server.

Building the Express Server and Test Endpoints

Now we need an API endpoint to receive incoming webhook requests and push them onto the queue. To test this properly without waiting for a real server to crash, we’ll also build a mock receiver endpoint that randomly fails. Let’s create server.js to tie the queue, worker, and routes together:

const express = require('express');
const { webhookQueue } = require('./queue');
require('./worker'); // Start the worker background process const app = express();
app.use(express.json()); // Endpoint to enqueue a new webhook dispatch
app.post('/api/send-webhook', async (req, res) => { const { url, payload } = req.body; if (!url || !payload) { return res.status(400).json({ error: 'Missing url or payload' }); } try { const job = await webhookQueue.add('sendWebhook', { url, payload, }, { attempts: 5, backoff: { type: 'customExponential', } }); return res.status(202).json({ message: 'Webhook accepted and queued', jobId: job.id }); } catch (error) { console.error('Failed to queue webhook job:', error); return res.status(500).json({ error: 'Internal server error' }); }
}); // A mock endpoint that fails 80% of the time to test our retry system
app.post('/mock-receiver', (req, res) => { const shouldFail = Math.random() { console.log(`Server running on port ${PORT}`);
});

This server gives us two endpoints. The /api/send-webhook route is what your app calls to trigger a webhook. It takes the target URL and payload, dumps it into BullMQ, and instantly returns a 202 Accepted status. The second endpoint, /mock-receiver, simulates a flaky third-party API that fails 80% of the time so we can watch our retries in action.

Testing the Retry System in Real Time

Let’s fire up the server and see if this actually works. Start the server in your terminal:

node server.js

Now, open a second terminal tab and trigger a webhook using curl. We’ll point the payload to our flaky /mock-receiver endpoint:

curl -X POST http://localhost:3000/api/send-webhook 
-H "Content-Type: application/json" 
-d '{"url": "http://localhost:3000/mock-receiver", "payload": {"event": "order.created", "id": 99283}}'

You’ll instantly get a JSON response with a job ID back. Over in your server logs, you’ll see the retry system spring to life. Here’s what my console output looked like when the mock receiver kept failing:

Server running on port 3000
[Job 1] Attempting to send webhook to http://localhost:3000/mock-receiver (Attempt 1)
--- Mock Receiver: Simulating 500 Internal Server Error ---
[Job 1] Failed attempt. Retrying in background... Error: Request failed with status code 500 [Job 1] Attempting to send webhook to http://localhost:3000/mock-receiver (Attempt 2)
--- Mock Receiver: Simulating 500 Internal Server Error ---
[Job 1] Failed attempt. Retrying in background... Error: Request failed with status code 500 [Job 1] Attempting to send webhook to http://localhost:3000/mock-receiver (Attempt 3)
--- Mock Receiver: Simulating 500 Internal Server Error ---
[Job 1] Failed attempt. Retrying in background... Error: Request failed with status code 500 [Job 1] Attempting to send webhook to http://localhost:3000/mock-receiver (Attempt 4)
--- Mock Receiver: Success! 200 OK received ---
[Job 1] Webhook delivered successfully!

Look closely at the timestamps. The delay between attempts grows exponentially with each failure. This gives the receiving server some breathing room to recover instead of getting slammed with constant requests.

Handling Poison Pills and Dead Letter Queues (DLQ)

What happens if all 5 retries fail? The job is marked as failed, and if you don’t handle it, that payload is gone forever. In production, you need a Dead Letter Queue (DLQ) or a database log of these permanent failures so you can manually review or replay them later. We can hook into the worker’s failed event to push these dead jobs to a separate Redis list. Here’s how I set up a basic DLQ:

const IORedis = require('ioredis');
const dlqConnection = new IORedis({ host: process.env.REDIS_HOST || '127.0.0.1', port: process.env.REDIS_PORT || 6379 }); worker.on('failed', async (job, err) => { if (job.attemptsMade >= job.opts.attempts) { const deadPayload = { jobId: job.id, url: job.data.url, payload: job.data.payload, failedAt: new Date().toISOString(), error: err.message, }; // Push the failed payload to a Redis list named 'webhook:dlq' await dlqConnection.rpush('webhook:dlq', JSON.stringify(deadPayload)); console.error(`[DLQ] Job ${job.id} moved to Dead Letter Queue.`); }
});

With this in place, you can easily build a simple admin dashboard or run a CLI script to inspect the webhook:dlq list. When your customer finally fixes their server, your support team can manually trigger a replay for those failed payloads.

Preventing Duplicate Deliveries with Idempotency

Here’s a major edge case: what if the receiving server actually processed your webhook, but timed out right before sending back the 200 OK? If you retry, you’re going to send a duplicate payload. To prevent this, always attach a unique event ID in your headers (like X-Webhook-Event-Id). The receiver should store this ID in a fast cache like Redis and check it before processing any incoming payload. If you’re building the receiver side of this equation, check out our guide on how to build a Redis-backed idempotency middleware for Express to handle duplicates gracefully.

Frequently Asked Questions

Why not just use setTimeout for retries?

Because if your Node.js process crashes or restarts, every single pending setTimeout is wiped from memory. Using BullMQ with Redis ensures your queue is persisted on disk and survives crashes.

How many times should I retry?

A solid production standard is 5 to 10 retries spread over 24 hours. Start fast (like 15 seconds) and scale up to a maximum delay of 2 hours between attempts.

How do I handle rate limits on the receiver’s end?

If they return a 429 Too Many Requests status, look for a Retry-After header and dynamically adjust your next delay to match it. To manage heavy outgoing traffic safely, check out our guide on how to handle external API rate limits with BullMQ and Redis.

Should I encrypt payloads in Redis?

If you’re dealing with sensitive data (like PII or payment details), absolutely. Encrypt the payload before pushing it to the queue, and decrypt it inside the worker right before sending.

Next Steps

Now that you have a bulletproof retry system, you should secure your webhook receiver to make sure only authorized sources can hit your endpoints. Check out our guide on how to build a secure GoHighLevel webhook listener with Node.js to learn how to verify webhook signatures and block malicious payloads.

How to Build a Webhook Retry System with Exponential Backoff in Node.js

The Architecture: Why Simple Loops Fail in Production

Setting Up the Project

Setting Up the Webhook Queue with BullMQ

Writing Custom Exponential Backoff Rules

Building the Express Server and Test Endpoints

Testing the Retry System in Real Time

Handling Poison Pills and Dead Letter Queues (DLQ)

Preventing Duplicate Deliveries with Idempotency

Frequently Asked Questions

Why not just use setTimeout for retries?

How many times should I retry?

How do I handle rate limits on the receiver’s end?

Should I encrypt payloads in Redis?

Next Steps

Like this:

Related

How to Build a Webhook Retry System with Exponential Backoff in Node.js

The Architecture: Why Simple Loops Fail in Production

Setting Up the Project

Setting Up the Webhook Queue with BullMQ

Writing Custom Exponential Backoff Rules

Building the Express Server and Test Endpoints

Testing the Retry System in Real Time

Handling Poison Pills and Dead Letter Queues (DLQ)

Preventing Duplicate Deliveries with Idempotency

Frequently Asked Questions

Why not just use setTimeout for retries?

How many times should I retry?

How do I handle rate limits on the receiver’s end?

Should I encrypt payloads in Redis?

Next Steps

Share this:

Like this:

Related