Event-Driven Architecture for Payment Processing

There is a version of event-driven architecture that is almost exactly right for payment systems, and there is a version of it that is almost exactly wrong, and they look identical on an architecture diagram. The difference is in how events are produced, how they're consumed, what guarantees are preserved at each hop, and what happens when something fails. Payment systems that get this right are robust in ways that no synchronous system can match. Payment systems that get it wrong drop transactions, double-charge customers, and produce audit logs that don't match reality.

I want to be concrete about what the right version looks like, because the generic advice — "use Kafka," "publish events," "decouple producers from consumers" — is true but unhelpful. Payment systems need specific patterns that most event-driven tutorials don't mention and many event-driven tools don't support well.

Events versus messages versus commands

The first clarifying distinction is that "event" has at least three different meanings in a payment context, and they behave differently.

Domain events are records that something has already happened. PaymentAuthorized, TransactionSettled, RefundIssued. They are past tense, immutable, and broadcast to any number of consumers. A domain event is a fact. You don't reject it, you don't negotiate with it. If it happened, it's in the log.

Commands are instructions to do something. AuthorizePayment, IssueRefund, CaptureTransaction. They are future tense, addressed to a specific handler, and carry intent. A command can be rejected. It can fail. Commands are not events, even though they often flow through the same infrastructure, and treating them identically is where most architectural confusion starts.

Integration messages are envelopes that carry domain events or commands across service or vendor boundaries. They add concerns like ordering, delivery guarantees, and encoding. A single domain event might be wrapped in different integration messages for different consumers.

When engineers say "we're event-driven," they usually mean they have an infrastructure that transports serialized data between services. That infrastructure has to be used differently for domain events, commands, and integration messages. A system that conflates them ends up with commands treated as facts (and therefore never rejected properly), or with events treated as commands (and therefore never delivered to multiple consumers correctly).

rendering…

The outbox pattern is non-negotiable

The most important pattern in event-driven payment architecture is the transactional outbox. If you take one thing from this article, make it this.

The problem: your handler has to do two things atomically — write to the database and publish an event. The database write has to succeed (otherwise the state is lost) and the event publication has to succeed (otherwise downstream consumers don't know about the state change). But these are two different systems. If you write first and then publish, a crash in between drops the event. If you publish first and then write, a crash in between creates an event for state that doesn't exist.

The naive solution — "just retry on failure" — doesn't work, because you can't know from outside whether the failure happened before or after the other side succeeded. You end up with either lost events or duplicate events, depending on which way you retry.

The outbox pattern solves this by moving the event into the same transaction as the state change:

BEGIN;
  INSERT INTO transaction_events (...) VALUES (...);
  INSERT INTO outbox (event_type, payload, created_at, status)
    VALUES ('PaymentAuthorized', '{...}', now(), 'pending');
COMMIT;

Both writes commit atomically. Either both happen or neither does. A separate publisher process reads the outbox, sends events to the message broker, and marks them sent. If the publisher crashes, it restarts and picks up where it left off. Events are at-least-once delivered (duplicates are possible), but never lost.

Every event-driven payment system I've ever seen that didn't use the outbox pattern had a class of bug where occasional events went missing and nobody could reproduce it. The bug is real; it just fires rarely enough to be dismissed as "a transient glitch" until it happens during a high-volume day and produces real financial discrepancies.

Consumer idempotency is the other non-negotiable

At-least-once delivery means consumers will see some events twice. The consumer has to tolerate this. If the consumer updates state by processing the event, it has to update state idempotently — processing the same event twice must produce the same result as processing it once.

The simplest pattern is a processed-events table:

CREATE TABLE processed_events (
  consumer_id  TEXT NOT NULL,
  event_id     UUID NOT NULL,
  processed_at TIMESTAMPTZ NOT NULL DEFAULT now(),
  PRIMARY KEY (consumer_id, event_id)
);

Before processing an event, the consumer attempts to insert a row. If the insert succeeds, this is a new event — process it. If the insert fails due to the primary key constraint, this is a duplicate — skip it. The processing itself happens inside a database transaction that includes this insert, so partial failures don't leave the tracking table out of sync with the actual work.

This is more disciplined than using the event broker's native deduplication mechanisms, which tend to have narrow windows and weak guarantees. The processed-events table is explicit, auditable, and works identically regardless of what broker is underneath.

Ordering guarantees are usually lies

Kafka provides partition-level ordering. RabbitMQ provides queue-level ordering. SQS FIFO provides message group ordering. All of these are useful. None of them provide the ordering guarantee you actually want for payment events.

The guarantee you want is: for a single transaction, events are processed in the order they happened. What the infrastructure gives you is: within a partition, events are delivered in the order they were produced — but events about the same transaction can end up in different partitions if your partitioning key is wrong, and the network between producer and broker can reorder events even within the same partition under certain failure conditions.

The practical answer is to partition by the thing whose ordering you care about, usually transaction_id or merchant_id. This guarantees that all events about a given transaction go to the same partition and are processed in order. Events about different transactions can interleave, which is fine.

But even this isn't enough, because events can fail to process and get retried. Naive retry moves the event to the end of the queue, and now it's out of order. The fix is either to stop the partition until the failing event succeeds (which blocks all events behind it) or to treat ordering as a consumer-side concern and reorder on receive.

I prefer the latter. The consumer uses the event's sequence number (emitted as part of the event payload by the producer) to detect gaps and buffer events until they can be processed in order. When a gap persists beyond a timeout, it's a real problem — something was dropped — and it alerts rather than silently skipping.

rendering…

Event schemas have to be versionable from day one

Events are stored for compliance and replayed for testing and audit. They live longer than the code that produced them. The schemas used to serialize them will change — new fields, removed fields, renamed fields. If you don't plan for evolution, you end up in the nightmare situation of trying to replay a six-month-old event stream through code that expects fields that didn't exist then.

Two things matter:

Explicit schema versions. Every event carries a schema version in its header. Consumers know how to interpret each version, or they reject what they don't understand. Adding a version is not optional — the cost of not having one shows up the first time you need to change a field and realize your data is full of un-versioned events that nobody can safely migrate.

Additive-first changes. Never remove a field. Never rename a field. Never change the meaning of a field. When requirements change, add a new field and leave the old one populated with a best-effort value. Deprecate fields slowly and on a known timeline. The cost of maintaining old fields is lower than the cost of breaking event replay.

Schema registries (Confluent, Apicurio, home-grown) are useful for enforcing compatibility rules at publish time. The rule I enforce is "backward compatible only" — a new schema must be readable by old consumers. This limits the kinds of changes you can make, which is the point.

Dead letters are for bugs, not for noise

Dead letter queues are often misused as a catch-all for any event the consumer couldn't process. That's wrong. A dead letter should mean "we've tried, we can't process this, a human needs to look." It should not mean "we failed once and gave up."

The retry policy has to distinguish between failure types:

Transient failures (network, timeout, temporary database issues): retry with exponential backoff, up to some maximum.
Poisoned messages (malformed payload, schema violation, unknown event type): fail fast, dead letter immediately, alert.
Business failures (consistency violation, referential integrity error): these are usually bugs and need investigation. Dead letter with full context.

A dead letter queue that's growing is a leading indicator of a bug. A dead letter queue that's at zero means either your system is perfect (unlikely) or your retry policy is swallowing real failures. Neither extreme is healthy.

For payment systems specifically, dead letters need manual handling more often than in other domains. An event that can't be processed might represent real money that's now in an ambiguous state. The operations team needs to see it, understand it, and decide what to do — possibly contact the processor, possibly manually intervene, possibly mark the transaction as under investigation and continue processing others. Automating this away is a mistake.

Event sourcing as the deepest form of event-driven

Event sourcing is event-driven architecture taken to its logical conclusion: the events are not just notifications about state changes, they are the only record of state changes. The current state is derived by replaying events.

For payment systems, this is the right model. The event log is the system of record. Everything else — the ledger, the reconciliation tables, the merchant dashboard — is a projection built from the event log. If a projection is ever wrong, you rebuild it from the events. You never rebuild the events from a projection, because the events are truth.

This changes how you reason about the system. Bugs don't corrupt the data permanently — they produce incorrect projections, which can be rebuilt. Schema changes don't require destructive migrations — they require writing a new projection. Audit is trivial — the event log contains everything that ever happened, with timestamps, causality, and context.

The tradeoff is that event sourcing is more work upfront. You have to think carefully about event design, handle schema evolution, manage projection rebuilds, and tolerate the increased storage cost of keeping everything. For systems where correctness matters more than convenience — payment systems, accounting systems, audit systems — the tradeoff is worth it. For systems where most data is transient and the current state is all that matters, event sourcing is overkill.

rendering…

The hybrid model most teams actually end up with

Pure event sourcing is rare. Most payment systems I've worked on use a hybrid: events are the source of truth for transaction state specifically, but other parts of the system (user accounts, merchant configuration, product catalog) use traditional CRUD models. This is fine. The discipline is in knowing which is which and not conflating them.

The payment flow lives entirely in event-sourced territory. Every authorization, capture, refund, dispute, and settlement is an event. The transaction state is derived from the events. The ledger is a projection.

Around this, the rest of the application uses normal patterns. A merchant updating their business name does not need an event log — it's just a row update. A report query doesn't need to replay events — it queries the projection. The boundary between event-sourced and CRUD is deliberate and narrow.

What to use for infrastructure

I'm agnostic about the specific technology. Kafka, Pulsar, NATS, RabbitMQ, SQS, Google Pub/Sub — they all work for payment systems if used correctly. The choice depends on operational preferences, existing expertise, and scale requirements, not on any technical superiority for this domain.

What matters more than the broker is that you use it with the patterns above: outbox on the producer side, idempotency and ordering on the consumer side, schema versioning, disciplined dead letters. A team with RabbitMQ and these patterns will build a more robust system than a team with Kafka and none of them.

The one thing I'll say firmly: don't use your database as a message queue. Building your own pub/sub on top of Postgres (LISTEN/NOTIFY, polling tables, etc.) is tempting because it seems simpler than adding infrastructure. It's not. The failure modes are worse, the observability is weaker, and the performance doesn't scale. Use a real broker.

The deeper point

Event-driven architecture for payment systems is not a pattern. It's a discipline. The patterns — outbox, idempotency, ordering, versioning, dead letters — are the minimum requirements for correctness. Skipping any of them produces a system that looks event-driven, passes the architecture review, and silently loses or corrupts events in production.

The reason this matters is that the events are your audit trail, your recovery mechanism, and your source of truth. A system whose event log is incomplete or inconsistent is a system that cannot answer the question "what happened?" — and for payment systems, that's the only question that matters when something goes wrong.

Build the event log like it's evidence. Because when there's a dispute, an audit, or an incident, it is.

This is part of a series on payment systems architecture. See also database patterns for payment systems that actually work and idempotency in payment systems.