A State Machine Approach to Reliable Payment Processing in POS Systems

Abstract

Payment processing in Point-of-Sale (POS) systems imposes correctness requirements that are uncommon in other transactional domains: every operation must be recoverable from partial failure, attributable to a deterministic outcome, and reconcilable against an external financial ledger that the system does not control. Ad-hoc implementations typically represent transaction state as a column on a database row, mutated imperatively as processor responses arrive. This approach degrades under the conditions that define production payment systems — network partitions, processor timeouts, duplicate webhooks, and ambiguous intermediate responses.

This paper presents a state machine approach to payment processing in POS SaaS platforms. We define an explicit transaction lifecycle, enumerate the permissible transitions under both normal and adversarial conditions, and describe the idempotency, persistence, and recovery mechanisms that make the model correct under realistic failure. The approach does not eliminate payment-system complexity; it localizes that complexity within a structure that can be reasoned about, tested, and evolved without continual archaeology of prior incidents.

Keywords: payment processing; state machine; Point-of-Sale; POS; idempotency; distributed systems; reliability; failure recovery.

1. Introduction

POS platforms integrate with external payment processors over networks that are unreliable by construction. Even in steady state, a non-trivial fraction of requests encounter timeouts, duplicate responses, or delayed settlement notifications. These conditions are routine, not exceptional; a production payment system that does not model them explicitly will silently accumulate incorrect state, and the correction of that state is invariably more expensive than the upfront modeling would have been.

The canonical failure mode is the implicit state machine. A transaction is represented as a mutable record whose status field is updated by whichever code path handles the latest processor response. Over time, branches accumulate, invariants drift, and the system's true behavior exists only in the memory of engineers who have debugged prior incidents. The diagram that described the system at day zero bears little resemblance to the system at year two. The cost of this drift is not borne by the engineers; it is borne by merchants who see phantom charges, missing refunds, or transactions that settle for the wrong amount.

This paper argues for an alternative: an explicit, first-class state machine as the authoritative representation of transaction state. Transitions are defined as total functions from (state, event) pairs to a next state, with rejection of invalid pairs. Persistence occurs at transition time, not in response to processor calls. Recovery logic operates on state, not on the imperative history that produced it.

Our contribution is pragmatic rather than theoretical. We describe the state model, the transitions, the idempotency discipline, and the recovery behaviors that we have found to be correct under production conditions in multi-processor POS SaaS systems. We make no claim to formal verification; we claim only that the model, as described, survives the failure modes that an implicit state machine does not.

2. Problem Definition

A payment transaction in a POS context originates as an intent: a merchant, via a terminal, presents a payment method and an amount for authorization. The system must produce one of three authoritative outcomes: success (funds guaranteed, settlement pending), failure (no financial obligation), or an acknowledged ambiguous state that requires subsequent resolution. Anything short of these three outcomes — for instance, a "maybe" encoded as a logged error with no corresponding record — constitutes a correctness bug. The merchant and the customer receive financial consequences; the system cannot afford to have intermediate conditions that neither party can observe.

The problem is rendered non-trivial by several structural conditions:

Network unreliability. The processor call may time out, the response may be lost in transit, or both the request and the response may be delivered with the latter significantly delayed.
Processor ambiguity. Not all processors guarantee idempotency for all operations. A naive retry may therefore produce duplicate captures or refunds.
Asynchronous settlement. Authorization and settlement are decoupled by minutes to days; the settlement file may disagree with the original authorization outcome.
Out-of-order events. Webhooks, polling results, and batch settlement files may arrive in orders that contradict the observed sequence of processor calls.
Partial success. A capture may succeed at the processor but fail to persist locally; a refund may persist locally but fail at the processor.

Each of these conditions admits a local fix. What they defeat collectively is an implementation style that treats each condition as an exception handler bolted onto a happy-path implementation. A state machine approach does not eliminate the conditions; it forces them to be considered as first-class transitions in a shared model, where their interactions can be reasoned about rather than discovered.

3. Transaction Lifecycle Model

We define ten states, each representing a distinct observable condition of the transaction.

INITIATED. A transaction record exists; no processor call has been made. The system has committed to attempting the payment but has not yet exposed itself to network failure.
PENDING. A processor call has been issued. The outcome is not yet known.
AUTHORIZED. The processor has confirmed that funds are reserved on the customer's payment instrument. No debit has occurred.
CAPTURED. A capture request has been confirmed. A debit is queued for settlement.
SETTLED. The processor's settlement file confirms the capture. The transaction is financially final from the system's perspective.
VOIDED. An authorization has been released prior to capture.
REFUNDED. A settled or captured transaction has been reversed, in whole or in part.
DECLINED. The processor has rejected the authorization. The transaction is terminal and the merchant has no obligation.
FAILED. The transaction cannot proceed and no processor-side obligation has been incurred (for example, a local validation error before any network call).
UNCERTAIN. The processor's state is unknown to the system. Resolution is pending a status check, a subsequent webhook, or reconciliation against a settlement file.

The UNCERTAIN state is the most consequential. It is the state produced by timeouts, malformed responses, and other conditions in which the processor may or may not have acted on the request. Conflating UNCERTAIN with DECLINED produces phantom declines and, on retry, double charges. Conflating it with AUTHORIZED produces phantom obligations. Making it a first-class state forces the system to confront the condition explicitly rather than resolve it by convenient default.

Terminal states are SETTLED, VOIDED, REFUNDED, DECLINED, and FAILED. UNCERTAIN is non-terminal by construction; the system is obligated to resolve it into one of the terminal states. Each state also carries side-effect obligations that are part of its definition, not incidental consequences: AUTHORIZED implies a hold on the customer's funds that must be released within a processor-specific window if capture does not follow; SETTLED implies a corresponding entry in the merchant's ledger; REFUNDED implies a reversal record with traceable lineage to the original capture.

4. State Machine Design

The permissible transitions are as follows. Each is annotated by the event that triggers it.

INITIATED → PENDING — authorization request dispatched.
PENDING → AUTHORIZED — successful authorization response received.
PENDING → DECLINED — explicit processor decline.
PENDING → UNCERTAIN — timeout or ambiguous response.
PENDING → FAILED — local error before the request reached the network.
AUTHORIZED → CAPTURED — successful capture response.
AUTHORIZED → VOIDED — successful void response.
AUTHORIZED → UNCERTAIN — capture or void outcome unknown.
CAPTURED → SETTLED — settlement file confirms capture.
CAPTURED → REFUNDED — successful refund response.
CAPTURED → UNCERTAIN — refund outcome unknown.
SETTLED → REFUNDED — post-settlement reversal.
UNCERTAIN → {AUTHORIZED, CAPTURED, DECLINED, FAILED, VOIDED} — resolution via status query, webhook, or reconciliation.

rendering…

Transitions from terminal states are rejected. An attempt to capture a DECLINED transaction, or to void a SETTLED one, must raise an error at the state-machine layer, before any processor call is attempted. This is the primary guardrail against the class of bugs in which an out-of-order event triggers a protocol operation that the processor will reject — or worse, silently accept and reconcile incorrectly.

Each transition is a function (state, event, payload) → next_state. Invalid (state, event) combinations return an explicit rejection. Valid transitions persist the new state atomically with a transition record capturing the timestamp, the prior state, the event type, the processor response data (where applicable), and the actor (system, user, or reconciliation). The state machine is thus both a control structure — governing which processor calls may be issued — and an audit log, recording the full lineage of each transaction in a form that survives crashes and enables deterministic recovery.

5. Idempotency and Retry Handling

Retries are necessary for reliability under network failure; idempotency is necessary to prevent retries from producing duplicate side effects. The two are coupled: a retry without idempotency is a double charge waiting to happen.

We adopt the following discipline. Each externally visible operation (authorize, capture, void, refund) is associated with a client-supplied idempotency key, scoped to the tuple (merchant_id, operation_type). The system stores, against this key, the canonical request fingerprint and the canonical response. A retry bearing the same key and a matching fingerprint returns the stored response without issuing a new processor call. A retry bearing the same key but a different fingerprint is rejected — such a retry represents a client bug, not a retry, and silently accepting it produces a class of correctness failures that are indistinguishable from malice.

interface IdempotencyRecord {
  key: string;                  // scoped to (merchantId, operationType)
  requestFingerprint: string;   // hash of canonical request body
  response: StoredResponse;     // canonical response to replay
  createdAt: Date;
  expiresAt: Date;
}

function handleRequest(req: Request): Response {
  const existing = idempotencyStore.get(req.idempotencyKey);
  if (existing) {
    if (existing.requestFingerprint !== fingerprint(req)) {
      throw new IdempotencyConflict();
    }
    return existing.response;
  }
  const response = executeAndPersist(req);
  idempotencyStore.put(req.idempotencyKey, {
    requestFingerprint: fingerprint(req),
    response,
    // ...
  });
  return response;
}

At the processor boundary, each adapter is responsible for translating the system-layer idempotency key into the processor-specific idempotency mechanism — an HTTP header, a request field, or a stateful session identifier, depending on the processor. Where a processor does not guarantee idempotency for a given operation, the adapter is responsible for implementing a status-check-before-retry protocol: before retrying a request whose outcome is unknown, query the processor for the transaction's current state; issue the retry only if no prior attempt is found.

Retries are bounded. An unbounded retry loop against a degraded processor produces queue growth, timeout amplification, and eventual dataset corruption when the processor recovers and processes a backlog of indistinguishable duplicates. We bound retries by count (typically 3–5) and by elapsed time (typically not exceeding the authorization hold window), after which the transaction transitions to UNCERTAIN and awaits asynchronous resolution.

6. Failure Scenarios and Recovery

Three failure classes dominate operational incidents.

Request timeout. The client issues a request; no response arrives within the timeout window. The request may have been delivered and processed, delivered and not yet processed, or dropped in transit. The state machine transitions the transaction to UNCERTAIN. Resolution proceeds by issuing a status-check request against the processor, keyed by the idempotency key or a unique request identifier. If the status query returns a definitive state, the state machine transitions accordingly. If the status query itself fails, the transaction remains UNCERTAIN; resolution defers to the next settlement file, which provides an authoritative view of the processor's position.

Duplicate response. The client receives two responses for the same request, typically as a result of the processor's own retry logic interacting with a partitioned or briefly congested network. The second response is ignored: the state machine accepts only one transition per event. The idempotency layer, when functioning, prevents the second response from triggering a duplicate transition by matching against the stored fingerprint. When the idempotency layer is absent or misconfigured, duplicate responses produce duplicate transitions — which is why idempotency is a system-wide invariant, not a per-call option.

Post-capture decline at settlement. The processor returns a success for a capture, but the settlement file the following day reports the capture as rejected — typically due to issuer-side risk reassessment or batch-level funding failure. The reconciliation pipeline detects the discrepancy between the local state (CAPTURED) and the settlement record (missing). The state machine transitions the transaction from CAPTURED to FAILED, with a reconciliation-sourced transition record distinguishing the cause from a synchronous failure. This transition is a legitimate part of the model; excluding it produces a class of silent drift between local state and financial reality.

Partial persistence failure. A capture succeeds at the processor but fails to persist the state transition locally, typically due to a database outage between the processor response and the commit. On recovery, a status query against the processor reveals the capture. The state machine replays the transition from AUTHORIZED to CAPTURED, marking the transition source as recovered from processor state, which distinguishes it from a synchronous transition for audit and operational purposes.

Recovery, in all cases, operates on persistent state and transition records — not on logs, not on application memory, and not on the hope that the next code path happens to handle what the previous one missed. The correctness property we seek under recovery is that the system, given only the persistent state at recovery time, converges to the same terminal outcome regardless of the number or timing of crashes during execution.

7. Implementation Considerations

Storage. Transaction state is persisted in a dedicated table with columns for current state, a state version counter incremented on each transition, the last event observed, and the timestamp of the last transition. A separate transition log table records each historical transition in append-only form. The state table is the authoritative source for current condition; the transition log is the authoritative source for history and audit.

interface TransactionRecord {
  id: string;
  merchantId: string;
  state: TransactionState;
  stateVersion: number;
  lastEvent: EventType;
  lastTransitionAt: Date;
  amount: Money;
  processorTransactionId?: string;
}

interface TransitionRecord {
  transactionId: string;
  fromState: TransactionState;
  toState: TransactionState;
  event: EventType;
  payload: Record<string, unknown>;
  actor: 'system' | 'user' | 'reconciliation';
  occurredAt: Date;
}

Transitions precede side effects. A transition to PENDING is written to the database before the processor call is issued. This ordering ensures that a crash between the database write and the processor call leaves a recoverable record: the transaction is in PENDING, and a status-check-on-startup routine will discover its true state at the processor. The inverse ordering (processor call first, persistence second) produces transactions whose existence is unknown to the system after a crash — the exact class of orphaned obligations that reconciliation is least equipped to detect.

API design. The public API exposes intent (authorize, capture, refund) and state (getTransaction); it does not expose transitions directly. Transitions are an internal concern of the state machine. Client retries are handled by the idempotency layer; clients do not reason about state and should not be permitted to request arbitrary transitions.

Asynchronous flows. Long-running transitions — settlement confirmation, delayed webhook arrival, scheduled authorization expiry — are handled by background workers that poll or subscribe to processor events. Each worker invokes the state machine with the observed event; the state machine remains the sole authority on whether the event produces a transition.

Testing. The state machine admits direct unit testing against a synthetic event stream. Integration tests exercise the full transition surface against sandbox processor endpoints. Property-based tests verify that, for any sequence of valid events, the resulting state is reachable from INITIATED via the defined transitions — and that no sequence produces an undefined state. This class of test is disproportionately valuable in payment systems, where the combinatorial explosion of failure modes defeats example-based testing.

8. Discussion and Limitations

The approach described here addresses reliability, but not every problem in payment systems. Several limitations merit acknowledgment.

State explosion with multi-party transactions. Tipping, split tender, and partial refunds produce state models that exceed a single-dimensional lifecycle. In practice, these are handled by decomposing the transaction into sub-transactions, each with its own state machine, coordinated by a parent record that tracks aggregate state. The decomposition itself requires care; errors at the coordination layer are as damaging as errors within the individual state machines and are harder to diagnose.

Processor semantic drift. Processor behavior changes over time, sometimes silently. A state machine captures the system's model of the processor; when the processor's actual behavior diverges from the model, reconciliation exposes the drift, but the state machine itself does not. Ongoing reconciliation is not an optional add-on — it is the mechanism by which the state machine remains accurate over time.

Human intervention. A subset of UNCERTAIN transactions cannot be resolved programmatically. When both the processor's status API and the settlement file disagree, a human must adjudicate. A merchant-operations interface is required for such adjudication, and its transitions must be recorded, attributed, and subject to the same audit discipline as automated ones.

No substitute for testing. The state machine makes the system's behavior inspectable; it does not guarantee correctness. The invariants it enforces (terminal states, transition validity) are necessary but not sufficient. Production correctness requires, in addition, a test surface that exercises the failure paths with the same rigor as the happy paths — which, in our experience, is the harder engineering problem.

9. Conclusion

A state machine is neither a novel nor a sufficient approach to payment processing. It is, however, the smallest abstraction that makes payment-system behavior reasonable under failure, and the smallest abstraction that forces engineers to confront ambiguous outcomes rather than bury them in exception handlers. The states, transitions, idempotency discipline, and recovery behaviors described in this paper are not a complete specification — they are a minimum load-bearing structure. The engineering work that builds on this structure — testing, reconciliation, observability, merchant operations — is where most of the effort resides. But without the structure, that effort is spent re-discovering the same bugs, whose root cause is the absence of a model against which behavior can be checked.

References

D. Harel. Statecharts: A visual formalism for complex systems. Science of Computer Programming, 8(3):231–274, 1987.
P. Helland. Life beyond distributed transactions: an apostate's opinion. Proc. CIDR, 2007.
P. Helland. Idempotence is not a medical condition. Communications of the ACM, 55(5):56–65, 2012.
L. Lamport. Time, clocks, and the ordering of events in a distributed system. Communications of the ACM, 21(7):558–565, 1978.
J. Gray. The transaction concept: virtues and limitations. Proc. VLDB, 1981.
H. Garcia-Molina and K. Salem. Sagas. Proc. ACM SIGMOD, 1987.
ISO 8583: Financial transaction card originated messages — Interchange message specifications, 2003.
EMV Integrated Circuit Card Specifications for Payment Systems. EMVCo, 2022.