Community Article

Webhook Design: Retries, Signatures, and Replay Protection

Sign requests. Dedupe by event id. Apply idempotently by resource id. Ack fast, process async. Tolerate out-of-order. Five concerns that turn a webhook into critical infrastructure.

Webhook Design: Retries, Signatures, and Replay Protection

Sign requests. Dedupe by event id. Apply idempotently by resource id. Ack fast, process async. Tolerate out-of-order. Five concerns that turn a webhook into critical infrastructure.

webhooks
security
reliability
idempotency
api-design
oliviadelgado

By @oliviadelgado

December 29, 2025

·

Updated May 18, 2026

1,043 views

31

4.3 (11)

The webhook receiver I inherited at one job had three properties I would now consider unshippable. It accepted any POST to /webhooks/payment. It returned 200 OK instantly without persisting anything. It had no signature verification. The team's reasoning was reasonable: "the sender is trusted, we just need to react fast". The reasoning held up until the day the payments provider had a routing bug and replayed three days of events in fifteen minutes. We processed every charge twice, refunded customers in confused triplicate, and spent a week reconciling.

Webhooks are easy to bolt on and hard to do correctly. Every webhook receiver I have shipped since that day handles five concerns: signature verification, replay protection, idempotency, retry semantics, and ordering. My stance: if your webhook receiver does not handle all five, it is not production-ready, regardless of how well it works on a happy day. The cost of getting any one of them wrong is not "occasional bug" but "data corruption that takes a quarter to clean up".

The webhook contract, in one line

A webhook is "the sender promises to POST you a JSON event when something happens, eventually, possibly more than once, possibly out of order, possibly with the original payload tampered with by a network attacker". Each of those qualifiers is a thing your receiver has to handle. Build for the worst behavior the contract allows, not the best.

What the sender promises (loosely)
  - Eventually deliver each event at least once
  - Sign the body with a secret you both share
  - Include event metadata (id, type, created_at)
  - Retry on failure for some bounded period

What the sender does NOT promise
  - Exactly once
  - In-order delivery
  - That you will not be replayed days later
  - That every retry has the same payload (idempotency keys help here, see below)

That contract shape is true for Stripe, GitHub, Slack, Twilio, and most webhook providers I have integrated. The exact retry policy and signature scheme differ, but the shape is universal.

Concern 1: signature verification

If your endpoint is on the public internet, anyone can POST to it. Without signature verification, any attacker can fabricate events. The fix is HMAC over the raw request body (and usually a timestamp), with a secret you share with the sender out of band.

The general shape (Stripe-style):

import crypto from 'node:crypto';

function verifyWebhook(rawBody: string, header: string, secret: string): boolean {
    const parts = Object.fromEntries(header.split(',').map(p => p.split('=')));
    const timestamp = parts.t;
    const expectedSig = parts.v1;

    const signedPayload = `${timestamp}.${rawBody}`;
    const computedSig = crypto.createHmac('sha256', secret)
        .update(signedPayload)
        .digest('hex');

    if (!crypto.timingSafeEqual(Buffer.from(expectedSig), Buffer.from(computedSig))) {
        return false;
    }

    // Reject events older than 5 minutes (replay protection)
    const ageSeconds = Date.now() / 1000 - parseInt(timestamp);
    if (ageSeconds > 300) return false;

    return true;
}

The four things to get right:

  1. Verify against the raw body, not the parsed JSON. Most signature schemes hash the bytes the sender actually sent, before any framework re-serialization. In Express, this means using a raw body parser on the webhook route. In Next.js API routes, read the buffer before any JSON parse.
  2. Use timingSafeEqual (constant-time comparison). A naive === between strings can leak the secret one byte at a time through timing differences. The attack is real and well-documented.
  3. Include the timestamp in the signed payload. Without it, a captured request can be replayed against your endpoint forever. The timestamp window (commonly 5 minutes) bounds replay attacks.
  4. Reject any event whose timestamp is too old or too far in the future. Future-dated events are usually a clock-skew bug; very old events are usually replay attempts.

The worst sin in webhook receivers is to log the secret. I have seen this in two production codebases. The secret ends up in error tracker logs along with the request body. Do not log the secret. Mask it like any other credential.

Concern 2: replay protection beyond the timestamp

The timestamp window in step 4 above protects against a replay an hour later. It does not protect against a replay one minute later. A network attacker who captures a webhook can replay it within the window before your endpoint expires it.

The defense is to track every event id you have seen. The sender includes a unique id (evt_abc123); you store it the moment you accept the event:

CREATE TABLE webhook_events_seen (
    event_id    TEXT PRIMARY KEY,
    received_at TIMESTAMPTZ NOT NULL DEFAULT now(),
    processed   BOOLEAN NOT NULL DEFAULT false
);

The processing flow:

async function handleWebhook(eventId: string, body: any) {
    // Atomic insert: if it exists, this is a replay
    const inserted = await db.query(
        'INSERT INTO webhook_events_seen (event_id) VALUES ($1) ON CONFLICT DO NOTHING RETURNING event_id',
        [eventId]
    );

    if (inserted.rowCount === 0) {
        // Already seen; safe to ack but do not process
        return { status: 'duplicate', acked: true };
    }

    // First time we see this event; process it
    await processEvent(body);
    await db.query('UPDATE webhook_events_seen SET processed = true WHERE event_id = $1', [eventId]);
    return { status: 'processed' };
}

Three things matter here:

  1. The insert and the check are atomic. A naive "check if exists, then process, then mark seen" has a race condition where two concurrent receivers both pass the check and both process the event.
  2. You ack the duplicate. Returning 200 OK for a duplicate prevents the sender from retrying. If you returned 4xx, the sender would keep retrying, and you would keep recognizing it as a duplicate. Both sides spend forever in that loop.
  3. The table needs a TTL. Events older than the sender's retry window can be safely deleted. For Stripe (3-day retry), I keep two weeks of history; for systems with shorter retry windows, less.

Concern 3: idempotency at the business level

Replay protection at the event-id level is necessary but not sufficient. You also need idempotency at the business level: "if the same payment success event is delivered three times, our user gets credited once".

The pattern: every effect of a webhook is keyed by something stable. For a payment.succeeded event, the effect is "credit the user with this payment". The key is the payment id, not the event id. If two different events both report success for the same payment, you still want the credit to apply once.

async function applyPaymentCredit(paymentId: string, amount: number) {
    await db.query(
        `INSERT INTO payment_credits (payment_id, amount, applied_at)
         VALUES ($1, $2, now())
         ON CONFLICT (payment_id) DO NOTHING`,
        [paymentId, amount]
    );
}

The ON CONFLICT (payment_id) DO NOTHING is what makes this idempotent. The unique constraint on payment_id is what enforces it. The combination of replay protection (event-id deduplication) and business-level idempotency (resource-id-keyed effects) is what makes the receiver safe under any combination of duplicates and reorderings.

Concern 4: respond fast, process async

A webhook receiver that does the work synchronously inside the HTTP handler is going to time out under load and cause the sender to retry, which compounds into duplicate processing and angry pages. The pattern is: validate, persist the raw event, ack, then process asynchronously.

Webhook handler shape
  1. Read raw body
  2. Verify signature (reject 400 on failure, no retry)
  3. Insert into webhook_events_seen (deduplicate)
  4. Insert raw event payload into a queue or outbox table
  5. Return 200 OK to the sender
  6. Worker reads from queue, processes, marks done

The handler stays under 50 milliseconds. The processing happens in the background, where retries are within your control. The outbox table lets you replay events if your worker crashes mid-process; the queue lets you scale processing horizontally without scaling your HTTP frontend.

The traps I have hit:

  • Returning 200 before signature verification. Always verify first.
  • Returning 200 before persistence. If you ack before the event is in the queue or table, a crash between ack and persist drops the event silently.
  • Putting the heavy work in the handler. Sender timeouts (often 30 seconds, sometimes 10) make this fragile.

Concern 5: ordering, or the lack of it

Most webhook providers do not guarantee in-order delivery. A payment.succeeded event and a payment.refunded event for the same payment can arrive in either order, and the latter can be retried multiple times.

Your processing logic has to be robust to out-of-order events. Two patterns I use:

Use the resource state as the source of truth, not the event sequence. When you receive payment.refunded, do not assume payment.succeeded was processed first. Look up the payment in the source-of-truth API (Stripe, your payments provider) and re-fetch its current state. Apply your local effect based on that state, not based on the inferred history from the event stream.

Embed a monotonically-increasing version on the resource. When the sender includes a version (object_version: 42), and you have already applied version 45, ignore the older event. This is the strategy Stripe uses with previous_attributes on update events.

Designing your webhook handler as "stateless reactor that re-syncs the resource on every event" is more robust than "stateful machine that walks the resource through a state diagram driven by event types". The first one tolerates any combination of out-of-order, duplicate, and missing events. The second is fragile to all three.

A test plan I run before launch

Before I ship a webhook receiver, I run through this checklist:

TestWhat I check
Send a valid event200 OK, event processed
Send an event with a bad signature400 or 401, no processing
Send the same event twice within 1 secondFirst is processed, second is acked but not double-processed
Send an event with a 6-minute-old timestampRejected as too old
Send an event and crash the worker mid-processEvent re-processed cleanly when worker restarts
Send 100 events in 1 secondAll accepted, queued, processed (no timeouts)
Send a payment.refunded before payment.succeededFinal state is correct (re-sync from source of truth)
Send the same effect from different event typesResource-keyed idempotency prevents double-apply

If any one of these fails, I do not ship. Each failure is a real production incident waiting to happen.

What changed about webhooks in the last few years

Two trends worth noting. First, more providers are adopting standardized signing schemes (some flavor of HMAC SHA-256 over timestamp.body is now near-universal). The differences are in header names and parameter order, not in cryptographic shape. Libraries that handle the verification correctly are easy to find for every major language.

Second, Server-Sent Events and WebSocket-based event streams are becoming more popular as alternatives for use cases where you want low-latency event delivery and you can keep a persistent connection. They sidestep some of the webhook problems (no public endpoint to attack, ordered stream by default) but introduce new ones (connection management, reconnect on failure, missed events when disconnected). If your event volume is high or latency-critical, evaluate them; if your event volume is low and you have a stable HTTPS endpoint, webhooks are still the right shape.

The receiver I build today

The webhook receiver I would build today is straightforward to describe, even though there is real engineering inside each layer:

Receiver architecture I keep reusing
  Public POST /webhooks/{provider}
    -> Read raw body (no JSON parse first)
    -> HMAC verify with provider secret + timestamp window
    -> Insert event_id into webhook_events_seen (atomic, on conflict ack)
    -> Insert event payload into webhook_outbox table
    -> Return 200 OK in <50ms
  Background worker
    -> SELECT FOR UPDATE SKIP LOCKED on webhook_outbox
    -> Apply effect with resource-keyed idempotency
    -> Re-fetch resource from source of truth if state-dependent
    -> Mark outbox row as processed
    -> Retry with exponential backoff on transient failures

That shape handles every failure mode I have seen in production, and it is genuinely not much code. The first time I shipped it I was surprised how much webhook handling shrinks once you separate "accept the event" from "do the work". Everything in the second category becomes a regular background job, with all the patterns and tooling you already have for those.

Treat webhook receivers as critical infrastructure

The receiver I inherited at the start of this article was not unusual. I have seen many like it, and the pattern is always the same: a webhook is bolted on as a side feature, written by someone in an afternoon, never load-tested, and never reviewed for the failure modes the contract permits. Then the day comes, and the receiver is the deepest path to data corruption in your whole system.

Build it like critical infrastructure from the start. Sign every request. Deduplicate by event id. Apply idempotently by resource id. Ack fast, process async. Test the failure modes deliberately. The five concerns at the top of this article are not optional best practices, they are the minimum bar for any receiver that is going to live on the public internet. Below that bar, you do not have a webhook handler, you have a financial bug waiting for the right replay event to surface it.

Back to Articles