POS SystemsPayment InfrastructureWebhooksReliabilityArchitecture

Building Payment Webhooks That Don't Lose Events

By Farnaz Bagheri··14 min read

The first webhook implementation I ever shipped was 40 lines of code. It received POST requests from the processor, parsed the JSON, and updated the transaction in the database. It worked perfectly in testing. It worked perfectly for the first week in production. Then we had a deploy that took the webhook endpoint down for 90 seconds, and the processor delivered 200 webhooks during that window, and the processor's retry policy happened to be "fire once, log the failure, move on." Those 200 events were gone. We reconstructed them over the next three days from settlement files and API polls.

I have rewritten webhook handlers more times than any other component of a payment system, and each rewrite has taught me something about what the pattern actually requires. The short version is that webhook handling is a distributed systems problem disguised as an HTTP handler, and treating it as the latter produces systems that look fine until they aren't.

Why webhooks are harder than they look

A webhook is an HTTP endpoint you expose to a processor. The processor POSTs to it when something happens — a transaction settles, a chargeback arrives, a payment method is updated. Your endpoint receives the request, processes it, returns 200, and moves on. That's the happy path, and it works approximately 99% of the time.

The other 1% is where the entire design lives. Webhooks can fail at every layer: DNS, TLS, load balancer, application, database. They can duplicate. They can arrive out of order. They can be signed with keys that rotated before your server learned about the rotation. They can be spoofed by attackers. They can be delayed by minutes or hours. They can reference resources that don't exist yet in your system, or resources that have been superseded. Each of these failure modes has a specific mitigation, and a webhook handler that handles none of them will eventually produce wrong state.

The thing that makes webhooks especially challenging in payment systems is that they're often the only notification you'll get about events that affect money. A settlement that you miss is a reconciliation problem. A chargeback notification that you miss is a missed response deadline and a lost dispute. A payment method update that you miss is a future transaction that declines because the card data is stale. Each missed webhook is a concrete financial consequence.

Signature validation is not optional

Before anything else, the webhook handler has to verify that the request is actually from the processor. Without this, anyone on the internet can POST fake events to your endpoint and potentially confuse your system into a state the processor doesn't know about.

Every major processor provides a signature mechanism: the processor signs the request body with a shared secret (or uses a public/private key pair), and includes the signature in a header. Your handler computes the expected signature from the body and compares it to the received signature. If they don't match, reject the request — not with a 4xx error, but with silence, to avoid giving attackers information.

The signature check has to be on the raw body, not the parsed version. This matters because JSON serialization is not canonical — {"a":1,"b":2} and {"b":2,"a":1} produce different signatures but the same parsed object. If you parse first and re-serialize, the signature will never match. The handler reads the raw bytes, verifies the signature against them, and only then parses.

A few processors use headers for the signature, some use query parameters, a few use a separate authentication header. Read the docs carefully. Some processors rotate keys automatically; your handler has to support multiple active keys and try each one until a match is found, then prefer the most recent.

Idempotency: the webhook's first rule

Every webhook should be processed idempotently. The processor will retry. Network issues will produce duplicates. Your load balancer may replay requests. If processing a webhook twice is not safe, you have a bug, and it will eventually manifest.

The pattern:

CREATE TABLE processed_webhooks (
  processor_id  TEXT NOT NULL,
  event_id      TEXT NOT NULL,
  received_at   TIMESTAMPTZ NOT NULL DEFAULT now(),
  processed_at  TIMESTAMPTZ,
  result        JSONB,
  PRIMARY KEY (processor_id, event_id)
);

When a webhook arrives, the handler:

  1. Verifies the signature.
  2. Parses the body and extracts the event ID.
  3. Attempts to insert into processed_webhooks with an ON CONFLICT DO NOTHING.
  4. If the insert succeeded, this is a new event. Process it. On success, update the row with the result and return 200.
  5. If the insert failed (conflict), this is a duplicate. Look up the prior result and return 200.

The event ID comes from the processor. Every processor provides one — sometimes it's the event's own ID, sometimes it's a delivery ID, sometimes it's the resource ID. You have to pick the right one for each processor, because they have different semantics. An event ID that changes across retries is not useful for deduplication.

Step 5 — returning 200 for duplicates — is important. If you return an error, the processor will retry, generating more duplicates. Return 200 with the prior result, and the processor's retry logic calms down.

Don't process inline

The naive webhook handler processes the event inline: verify signature, parse, update state, return 200. This is wrong in subtle ways. The problem is that the processing can take seconds, and while it's running, the processor's request is held open. If your processing takes longer than the processor's timeout (often 10-30 seconds), the processor thinks the webhook failed and retries. Now you're processing two instances of the same event in parallel.

The correct pattern is to persist the webhook synchronously and process it asynchronously:

  1. Verify signature.
  2. Parse the body.
  3. Atomically insert into processed_webhooks and into a processing queue.
  4. Return 200.
  5. A separate worker picks up the queue and processes each event.

The queue insert and the processed_webhooks insert happen in the same database transaction (or, better, processed_webhooks contains the queue). This preserves at-least-once semantics — the event is durably persisted before the 200 is returned. If the worker crashes mid-processing, it picks up again and retries. If the processing itself is idempotent (which it must be, since duplicates are possible), this is safe.

rendering…

This inversion is essential for reliability. The handler's job is to receive and persist; the worker's job is to process. Coupling the two produces fragile behavior.

Order is not guaranteed

Webhooks do not arrive in the order events happened. This sounds like something that shouldn't be true — the processor processes events in order and sends them in order, right? — but the path between the processor's event bus and your handler has network reordering, retry delays, and parallel delivery workers. By the time events reach your handler, they may be well out of order.

The practical consequence is that your handler must not assume that earlier events have been processed when handling a later event. You might receive a payment.succeeded event before the payment.authorized event it refers to. You might receive a refund.completed before the original payment.succeeded. You might receive a transaction.settled weeks before the transaction record is otherwise updated.

There are two approaches to handling this:

Event-sourced state with event reordering. Append every webhook event to a per-transaction event log, ordered by the event timestamp (not arrival time). The current state of the transaction is derived from the sorted event log. Late-arriving events are inserted at their correct position, and the derived state is recomputed.

Eventual consistency with deferred processing. When a webhook arrives that references a resource that doesn't exist yet, defer processing the webhook and set an alarm. If the expected prior event arrives, process the deferred webhook. If it doesn't arrive within a timeout, escalate — it's a real problem.

I lean toward the first approach because it's more robust, but it requires event-sourced infrastructure. The second approach works with traditional state-update patterns and is easier to add to an existing system, but it has more failure modes.

Retry and backoff

Webhooks fail. Not all failures are retryable — signature validation failures, malformed payloads, and some business errors are poisoned messages that will never succeed no matter how many times you retry. Other failures are transient — database unavailability, temporary network issues, race conditions — and will succeed on retry.

The retry policy has to distinguish:

  • Do not retry: signature failures, malformed payloads, events for resources that are permanently invalid. These go to a dead letter queue for human review.
  • Retry with exponential backoff: transient infrastructure failures, deadlocks, timeouts. These retry several times with increasing delays, and if they continue to fail, they go to a dead letter queue.
  • Retry indefinitely: nothing. Infinite retry is a bug waiting to happen.

The exponential backoff should have jitter. Without jitter, retries from many workers cluster together and produce thundering herd problems. With jitter (randomized delay additions), retries spread out.

A common mistake is to retry too aggressively. A webhook that failed once and succeeded on retry is a good outcome. A webhook that gets retried 50 times over an hour while the underlying system is down is pathological — you're generating load against a broken system while deferring the real problem. Exponential backoff with a reasonable cap (usually 5-10 retries over 30-60 minutes) is the right shape.

The reconciliation backstop

Webhooks are not sufficient on their own. They can be lost — by the processor, by the network, by your infrastructure — and if you rely on them as your only source of event truth, every lost webhook is a permanent gap in your data.

The backstop is reconciliation. The processor's records are authoritative, and you can query them (via API or settlement files) to verify that your webhook-driven state matches the processor's state. If they diverge, you fetch the missing events and process them.

The pattern:

  1. Webhook handler processes events in near-real-time, updating state.
  2. Reconciliation pipeline runs daily (or hourly), comparing your records to the processor's records.
  3. Missing events are identified by comparing IDs. For each missing event, fetch it from the processor's API and feed it to the same processing logic as a webhook.
  4. Divergent events (same ID, different state) are flagged for investigation.

This backstop is what makes webhook-based systems robust. Without it, missed webhooks are invisible until a merchant notices a discrepancy. With it, missed webhooks are caught automatically and filled in.

Some teams treat the reconciliation backstop as optional. It isn't. Every processor I've worked with has lost or delayed webhooks at some point. The question is not whether your webhook pipeline will miss events; it's whether your reconciliation pipeline will catch the misses.

Multi-tenant webhook routing

If your platform is multi-tenant, you have an additional problem: when a webhook arrives, which merchant does it belong to?

The answer depends on how you've configured the processor. Some processors send per-merchant webhooks with the merchant's account ID in the payload. Others send aggregated webhooks for all your merchants, and you have to identify the right one from fields inside the event. Others let you register separate webhook URLs per merchant.

Each approach has tradeoffs:

Aggregated webhooks to a single endpoint is simplest to manage but concentrates load — a spike from any one merchant affects processing for all of them. The payload has to carry enough context to identify the merchant, and your handler has to route internally.

Per-merchant webhook URLs scale better and provide natural isolation, but require configuration per merchant. When a merchant signs up, you create a URL, register it with the processor, and make sure it's monitored.

Path-based routing (e.g., /webhooks/{merchant_id}) is a middle ground. One deployment, one codebase, but each merchant has their own URL, and routing happens at the path level. This is usually the right default.

Whichever you choose, the handler has to derive the merchant context early — before processing — and carry it through all downstream operations. Misrouting a webhook to the wrong merchant is a data isolation failure, which is a bigger problem than losing the webhook entirely.

Dead letter handling is real work

Dead letter queues for webhooks are not theoretical. They accumulate. Every poisoned event, every persistent failure, every weird edge case your code doesn't handle ends up there. Ignoring the DLQ is how payment systems slowly rot.

What works:

  • The DLQ has an owner. Someone on the team is responsible for draining it weekly.
  • Each entry in the DLQ has context: the original event, the error, the stack trace, the timestamp, the retry history. Enough information that someone can diagnose without digging through logs.
  • The DLQ has alerting on growth. A DLQ that's adding faster than it's being drained is a signal that something is broken.
  • The DLQ items can be replayed. Once the underlying issue is fixed, the operator resubmits the events through the normal processing pipeline.
  • There's a retention policy. DLQ items older than N days are archived or deleted, with an audit trail.

A DLQ that's treated as a black hole becomes a graveyard of unresolved financial events. A DLQ that's actively managed is a forcing function for fixing real bugs.

Webhook replay tools

Eventually, you will need to replay webhooks. Either because a bug corrupted state and you need to rebuild it, or because your processor pushed a correction, or because you're migrating to new infrastructure.

Build the tools for this before you need them. The capability to take an event from any source — a DLQ entry, a historical record, a manually provided payload — and run it through the normal processing pipeline is one of the most valuable operational tools a payment platform can have. It should:

  • Require authentication and produce an audit trail.
  • Verify signatures (for events that should be signed) or explicitly mark events as "replay" (for events that come from internal sources).
  • Run through the same idempotency checks as normal webhooks, so replays of already-processed events don't duplicate work.
  • Be rate-limited, because bulk replays can hammer the database.

Without a replay tool, every historical correction is bespoke engineering work. With one, corrections are operational work that anyone on the team can execute.

The deeper point

Webhook handling is where payment systems meet distributed systems reality. The processor is a remote service you don't control, sending events through networks you don't control, on a schedule you don't control. Your job is to receive those events reliably, process them idempotently, recover from failures gracefully, and keep your state consistent with the processor's state.

None of this is implied by the words "HTTP endpoint." The implementation has to be deliberate at every step. Signature validation, idempotency, async processing, retry with backoff, dead letter handling, reconciliation backstop, multi-tenant routing — these are all required, not optional. A webhook handler that skips any of them is a ticking clock.

The teams that get this right treat the webhook handler as one of the most important components in the payment platform. The teams that get it wrong treat it as a 40-line HTTP handler and spend years debugging the consequences.

If your webhook handler is less code than your reconciliation pipeline, one of them is wrong. Usually the webhook handler.


This is part of a series on payment systems architecture. See also event-driven architecture for payment processing and idempotency in payment systems.