Why Payment State Machines Are Harder Than You Think

If you've worked on payment systems for any length of time, you've drawn a state machine on a whiteboard. It usually starts with three or four boxes — Authorized, Captured, Refunded, Voided — connected by a few arrows. It looks correct. It is also wrong, and the wrongness is not a matter of missing details. It's a matter of missing entire dimensions of state.

The state machine you draw on the first day is the state machine of the happy path. The state machine you discover after two years of production traffic is something else entirely: a graph with dozens of nodes, conditional transitions, asynchronous events that arrive out of order, and edges that exist only for some processors and not others. The gap between the two is the source of an enormous fraction of payment system bugs, and the gap is not closable through better upfront design. It is closable only through experience.

The state machine you draw first

Every payment system starts with something like this:

rendering…

This is a fine model of what a payment is supposed to do. It maps cleanly to the operations exposed by the processor's API: authorize, capture, void, refund. Engineers build their data model around it, store a status column on the transaction record, and write business logic that branches on the value of that column.

The problems start the day a real customer does something the diagram doesn't account for.

What the diagram is missing

Consider some events that real payment systems must handle:

An authorization is sent to the processor. The processor never responds. Is the transaction Authorized, Failed, or something else?
A capture is requested. The processor returns a successful response, but the settlement file the next day shows the transaction was rejected at settlement time. Is it Captured or Failed?
A refund is issued. The processor returns success. Three days later, the customer disputes the original charge anyway. Is the transaction Refunded or Disputed?
An authorization is partially captured for $30 of a $100 hold. The remaining $70 expires. The system needs to know that no further captures are possible. What state is that?
A void request is sent for a transaction that has already settled. The processor accepts it but converts it to a refund automatically. Is the transaction Voided or Refunded?
A chargeback arrives for a transaction that was already refunded. Is that valid? What state should it produce?

Each of these scenarios is common — not edge cases, common — and none of them fit cleanly into the four-state model. The state machine has to grow to accommodate them, and the way it grows is rarely planned. It accretes.

The dimensions you didn't know existed

A real payment state machine has at least three dimensions of state, not one.

Lifecycle state. This is what most people mean when they say "transaction state." Authorized, captured, refunded, voided, disputed, settled. It tracks where the transaction is in its operational journey.

Money state. This is what the merchant cares about. How much money was authorized? How much was captured? How much has been refunded? How much is currently held? How much has been disputed? Money state is not the same as lifecycle state — a transaction can be Captured with $50 refunded and $25 disputed, leaving $25 of the original $100 still owed to the merchant. A single status field cannot represent this.

Knowledge state. This is what your system knows for sure versus what it is uncertain about. After a network timeout on a capture request, the transaction's lifecycle state is Capture Pending — Unknown. Your system does not know whether the capture succeeded. This is a real state, and treating it as either Captured or Authorized produces wrong behavior.

rendering…

Naive state machines collapse all three dimensions into one column and end up with states like partially_refunded_with_pending_dispute_unknown_capture, which is what happens when you encode three orthogonal axes into a single enum.

The asynchronous event problem

Payment events don't arrive in the order they happen. A settlement notification might arrive before the original capture confirmation, because settlement files are batched and the capture confirmation is delayed by a retry. A dispute notification can arrive months after the transaction. A reversal can land before the original authorization has been fully processed by your system.

Your state machine needs to handle events that arrive in any order. The naive design — "when event X arrives, transition from state A to state B" — fails the moment X arrives before A has been reached. The system either rejects the event (losing data) or applies it incorrectly (corrupting state).

The fix is to model state transitions as the result of event reconciliation, not direct application. You don't transition from Authorized to Captured because a capture event arrived. You receive the capture event, append it to an event log, and recompute the transaction's state from the entire event history. If a settlement event arrives before the capture event, you store it. When the capture event finally arrives, you recompute the state and discover that the transaction is now both Captured and Settled. The order of arrival did not matter.

This is a fundamentally different architectural pattern from the one most teams start with. It requires that transactions be modeled as event logs, not as mutable rows. It requires that state be derived rather than stored. And it requires careful thought about how to handle events that contradict each other — because those happen too.

When events contradict each other

A processor sends a webhook saying a transaction was approved. The next day's settlement file says the same transaction was declined. Which is correct?

The honest answer is that you don't know. The settlement file is usually authoritative, but not always — sometimes the file is wrong, and the webhook was right. Sometimes both are right, and the transaction was approved and then administratively reversed. Sometimes one of them is referring to a different transaction with a similar reference number.

Your state machine has to model this. You need a state that means "we received conflicting information and are not sure what's true." Not as a transient state, but as a real state that the system can be in for hours or days while operators investigate. If you don't have this state, you'll either pick one event arbitrarily and be wrong, or pick the most recent event and create flapping behavior as new evidence arrives.

I have seen production systems where transactions silently flipped between Approved and Declined every time a new event arrived, because the state machine had no concept of conflict. The merchant's dashboard showed different totals depending on when they refreshed.

The state machine as a function of processor

Here is something most people miss: payment state machines are not universal. Every processor has its own set of legal transitions, and they don't agree.

For one processor, you can void a transaction any time before settlement. For another, the void window is a fixed two hours. For a third, voids are converted to refunds automatically after a different threshold. For a fourth, the void operation doesn't exist at all — you have to use a credit transaction that the system treats differently for reporting purposes.

If your state machine is universal, it has to model the union of all processor capabilities, and it has to know which transitions are legal for which processor. This means the state machine isn't a static diagram — it's a function of the processor that handled the transaction. The legal transitions for transaction T are determined by querying the processor profile that handled it.

rendering…

Even this diagram is incomplete, because it doesn't show that the same transition has different timing rules and side effects per processor. To represent it fully, you'd need a separate state machine per processor, plus a translation layer that maps between them when transactions migrate or when the abstraction layer mediates between merchant intent and processor capability.

Time as a state

Most state machines treat time as something that happens between transitions. Payment state machines have to treat time as a first-class part of the state itself.

A transaction in state Authorized is not the same after eight hours as it was at minute one. The authorization is decaying — it has a hold on the customer's funds, and that hold expires after a window that varies by card brand, by issuer, and by transaction type. After expiration, the funds are released back to the customer, and any subsequent capture attempt will fail in a way that's specific to the processor and possibly silent.

Your state machine has to know about these temporal constraints and act on them. Some transactions need to be auto-captured before their hold expires. Some need to be voided automatically if they haven't been captured by a deadline. Some need to be flagged for review when they cross a timing threshold. None of this can be expressed as a simple state diagram — it requires a scheduler that's coupled to the state machine and aware of every transaction's lifecycle clock.

State versioning across deployments

The state machine evolves. You add new states. You add new transitions. You modify the rules. And when you do, you need to handle every transaction that's currently in the old version of the state machine, including transactions that have been sitting in unusual states for months.

Most teams don't think about this until they're in the middle of a migration and discover that the new state machine has no opinion about transactions that started in a state that no longer exists. The fix is to version your state machine explicitly. Every transaction records the version of the state machine that governs it, and the system can run multiple versions concurrently. New transactions use the new version. Old transactions complete their lifecycle under the old version. Migration happens when transactions reach a terminal state and can be marked archived.

This sounds like overkill until you've shipped a state machine change and watched it break a long-running transaction that was waiting for an event that the new code no longer handles.

Why this is hard to get right upfront

Payment state machines accumulate complexity over time because the complexity comes from the real world, not from the design. You cannot anticipate the edge cases that will matter to your business, because they depend on which processors you use, which merchants you serve, which payment methods are involved, and which weird operational scenarios occur in production. The state machine has to grow as you discover them.

What you can do upfront is build the state machine in a way that allows it to grow safely. Three things matter:

Events, not status updates. Model transitions as events appended to a log, not as writes to a status column. This makes new states cheap to add, new event types easy to handle, and historical state reconstructible.

Derived state, not stored state. The current state of a transaction should be a function of its event history, not a field that gets updated. This eliminates the class of bugs where state and history disagree.

Explicit uncertainty. Model knowledge state as a first-class concept. A transaction can be Authorized and known, Authorized and pending confirmation, or Authorized and contradicted by a later event. These are different situations and need different handling.

The state machine as the system of record

The deepest mistake teams make is treating the state machine as an implementation detail of the payment system. It isn't. The state machine is the system. Everything else — the API, the database schema, the reconciliation logic, the support tools, the dashboards — is a view on top of the state machine. If the state machine is wrong, nothing built on top of it can be right.

This means the state machine needs to be the most carefully designed, most rigorously tested, and most precisely documented part of the entire codebase. It needs property tests that verify invariants under arbitrary event sequences. It needs replay tests that run real production event streams through new versions to verify they produce the same outputs. It needs documentation that an operator can read at three in the morning and understand what state a transaction is in and what's allowed to happen next.

Most teams don't treat it this way. They treat it as a status enum on a database table, modify it casually, and pay the cost in production incidents.

What I tell engineers starting on payments

If I'm onboarding an engineer to a payment system, the first thing I want them to internalize is this: the state machine is going to surprise you. Whatever model you have of how transactions move through states, real transactions will violate it within days. The only defense is humility — assume your model is incomplete, build in mechanisms for discovering what you missed, and treat every production anomaly as a signal that the state machine needs another state.

Payment state machines are not a problem you solve once. They're a discipline you maintain forever. The teams that get this right build their state machine as a living document, refined by every incident and every weird transaction. The teams that don't end up with a status column nobody trusts and a database full of transactions in states that nobody can explain.

If you can name the exact set of states a transaction can be in, you've either built a very simple system or you haven't run it in production long enough.

This is part of a series on payment systems architecture. See also why every POS platform needs a payment abstraction layer and the hardest part of payment systems is reconciliation.