Testing Payment Systems Is Nothing Like Testing Normal Software

The first time I tried to write a comprehensive test suite for a payment system, I produced something that looked thorough on paper and gave me almost no confidence in production. The unit tests passed. The integration tests passed against the processor's sandbox. The end-to-end smoke tests passed. We deployed, and within a week we had bugs that none of the tests caught — bugs that, in retrospect, were not catchable by any of the testing approaches I had used.

I spent the next several years figuring out what testing payment systems actually requires, and the short version is: most of what you know about testing software does not apply, and the parts that do apply have to be combined in ways that aren't standard practice anywhere else.

Why normal testing fails

Normal software testing rests on a few assumptions that payment systems violate.

You can run the system. For most software, you can spin up the service in a test environment, send it inputs, and observe its outputs. For payment systems, the system you're testing includes a payment processor that you don't control, that costs money to interact with, that has its own failure modes, and that behaves differently in test environments than in production.

You can observe the outputs. For most software, the outputs of a function are visible: a return value, a database write, a log line. For payment systems, the outputs include money movement that won't be visible until settlement, records held in the processor's systems that you can only query indirectly, and side effects that happen on someone else's infrastructure.

You can isolate failures. For most software, you can mock dependencies, inject failures, and verify that your code handles them correctly. For payment systems, the failure modes you care about — partial network failures, ambiguous timeouts, processor-side state divergence, settlement file errors — are very hard to simulate accurately, and the behaviors that matter are emergent properties of the interaction between your code and a system you don't control.

The tests that result from ignoring these issues are tests of the happy path against a mocked processor. They run fast, they pass reliably, and they catch almost none of the bugs that actually matter.

The four layers of payment testing

A real testing strategy for a payment system has four distinct layers, each addressing a different class of failure.

rendering…

Each layer catches different bugs. None of them is sufficient alone. The discipline is in knowing which class of bug each layer can and cannot catch, and not using one layer to build false confidence about a class of bugs that requires a different layer.

Layer 1: Pure logic tests

The parts of a payment system that are pure logic — state machine transitions, fee calculations, currency conversions, validation rules — can be tested with normal unit tests, and they should be. These tests are fast, deterministic, and easy to write.

The mistake is to extend this approach to anything that touches the processor. A unit test that mocks the processor's HTTP client and verifies that the right method gets called with the right arguments is testing nothing useful. It verifies that your code makes the call you wrote it to make, which is tautological. It does not verify that the call is correct, that the response handling is correct, or that the integration as a whole behaves correctly.

The right scope for pure logic tests is the transformation of input data into the request the processor will receive, plus the transformation of the response into your internal representation. Test those transformations exhaustively. Don't test the call itself with mocks.

Property-based testing belongs here. State machines are particularly amenable to property tests: generate random sequences of events and verify that invariants hold for the resulting state. The amount of money owed to the merchant should equal the sum of captures minus the sum of refunds, for any sequence of events. The transaction state should be one of a known set of values. The event log should never contain contradictory events without an accompanying conflict marker. These are things you can express as properties and verify across thousands of randomly generated histories, and they catch bugs that example-based tests would never find.

Layer 2: Sandbox integration tests

The processor's sandbox is the lowest-level environment where you can verify that your integration actually works against the real API. It is also misleading in ways that take time to understand.

Sandboxes are designed to demonstrate that the API contract works as documented. They use deterministic test cards with predictable responses. They process requests instantly, return clean responses, and produce settlement files in formats matching the documentation. None of this resembles production behavior, but it does verify that you can talk to the processor correctly.

Useful sandbox tests verify:

Request format and signing logic against the live API
Response parsing for the documented response shapes
Authentication and credential handling
Standard happy-path flows: authorize, capture, void, refund
Documented error responses for the test cards that produce them
Webhook delivery and signature validation

Sandbox tests should be part of your continuous integration, but they should run on a schedule rather than on every commit. Processors rate-limit sandbox traffic, and tests that hammer the sandbox on every push will get throttled or banned. Run them nightly, on a representative subset of changes, and against every release candidate.

What sandbox tests cannot verify is anything that depends on production behavior. They cannot verify timeout handling, because sandbox timeouts don't match production timeouts. They cannot verify settlement timing, because settlement runs on accelerated schedules in sandboxes. They cannot verify the long tail of error responses, because sandboxes only return errors you explicitly trigger.

Layer 3: Replay and property tests

This is the layer most teams skip, and it's the most valuable.

Replay testing is the practice of capturing real production traffic — requests, responses, webhook events, settlement files — and using it as input for tests. You build a harness that takes a recorded event stream and runs it through your current code, then verifies that the output matches the expected state. When you change the code, you re-run the replay and check that the output is still correct. When a new bug shows up in production, you capture the event sequence that produced it, add it to the replay corpus, and verify that the fix handles it.

The corpus grows over time. After a few years of operation, you have thousands of recorded transaction lifecycles, including every weird edge case anybody has encountered. New code has to pass them all. This is the strongest defense against regressions in payment logic, because the test cases are real transactions, not synthesized ones.

Replay testing requires three things:

Capture infrastructure. You need to be able to record and store the inputs your system receives — every API request, every webhook, every settlement file — in a form that can be replayed deterministically. This is a significant engineering investment but it pays for itself many times over.

Deterministic replay. The replay harness needs to feed events to the system in a controlled way, intercepting outbound calls and supplying recorded responses instead of making real ones. This means your code has to be structured so that external calls can be intercepted, which is good practice anyway.

Output assertions. The expected output of a replay isn't just "no errors." It's a precise specification of what state the transaction should be in, what events should have been emitted, what records should have been written. You build these by running the replay once, manually verifying the output, and then locking it in as the expected result.

Property testing also lives at this layer when applied to event sequences. Generate random but plausible sequences of payment events — authorizations followed by captures, refunds, voids, disputes, late settlements — and verify that the system's invariants hold. This catches the class of bug where individual operations work correctly but their interaction produces an inconsistent state.

rendering…

Layer 4: Production verification

The fourth layer is the only one that can catch the class of bugs that exist only in production: bugs caused by real network behavior, real processor quirks, real data, and real timing. These bugs are not detectable by any test you can run before deployment. They require continuous verification of the running system.

Production verification has three components:

Reconciliation as a test. The reconciliation pipeline that compares your records to the processor's settlement files is, in effect, a continuous test of whether your system is recording transactions correctly. Drift between your records and the processor's records is a test failure. Treat reconciliation alerts the same way you'd treat a CI test failure: investigate immediately, find the root cause, fix it, and add a check that prevents recurrence.

Invariant monitoring. Define invariants that should hold for the system as a whole — total money in equals total money out, no transaction is in a state without a known transition out of it, no settlement record exists without a corresponding transaction record — and run continuous queries that check them. Alert when an invariant is violated. These are integration tests that run against production data, and they catch bugs that no other layer can.

Synthetic transactions. Run real transactions through production at a controlled rate, using test merchant accounts and test cards. Verify that they produce the expected outcomes end-to-end — including settlement. This is the only way to test the full path from API request to settlement reconciliation against the actual production environment, and it's how you catch breakage in components that aren't exercised by normal traffic.

The certification trap

Certification is not testing, even though it looks like it. Certification is a contractual exercise where you prove to the processor that your integration handles their required test cases in the way they expect. The cases are usually narrow and don't cover most production scenarios.

The trap is treating certification as if it were a comprehensive test of your integration. Passing certification tells you that you've handled the cases the processor cares about for compliance purposes. It does not tell you that your integration handles the cases your merchants will actually generate. Many teams ship integrations that passed certification cleanly and broke in production within days, because the certification cases didn't include any of the actual edge cases.

Treat certification as a regulatory hurdle, not as a test. Your real testing happens in layers one through four.

Things that look like tests but aren't

A few practices feel like testing but provide much less safety than they appear to.

Mocking the processor at the HTTP layer. As discussed, this verifies that your code makes the call you wrote it to make. It doesn't verify that the call is correct or that the response is handled correctly. The only reasonable use of HTTP mocks is in pure logic tests where you need to provide a response shape — and in that case, the mock should return responses captured from the real processor, not synthesized ones.

End-to-end tests through a staging environment. Staging environments for payment systems are usually a mix of real components and mocked ones, configured in ways that don't match production. Tests that pass in staging often fail in production for reasons that have nothing to do with the code being tested. Use staging for manual exploration and demonstration, not as a source of automated test signal.

Smoke tests after deployment. A smoke test that runs one happy-path transaction after a deploy will catch some catastrophic failures, but it gives you false confidence about everything else. The bugs that matter rarely manifest on the first transaction of the day.

Code coverage. Coverage metrics measure which lines of code your tests execute, not whether the tests verify the right behaviors. A payment system can have 95% line coverage and zero useful tests. Stop measuring coverage. Measure whether you have tests that would have caught your last three production incidents, and if not, write them.

What I actually do

My current approach to testing a payment system, in priority order:

Pure logic tests for state machines, calculations, and validation, with property-based testing for state machine invariants.
A replay corpus built from production traffic, with new entries added every time a bug is found.
Sandbox integration tests for the basic happy paths and documented error cases, run on a schedule.
Production reconciliation as a continuous test, with alerts on drift.
Invariant monitoring as continuous integration tests against production data.
Synthetic transactions for end-to-end verification of components that wouldn't otherwise be exercised.

I do not write HTTP-mocked unit tests. I do not rely on staging environments for automated testing. I do not measure coverage. I do not treat certification as testing.

The result is a test suite that's slower to build than a normal one, more expensive to maintain, and dramatically more effective at catching the bugs that matter. The first time it catches a bug that would have cost a merchant real money, the investment pays for itself.

The deeper point

Testing a payment system is not about verifying that your code does what you wrote it to do. It's about building confidence that the system will behave correctly in situations you haven't anticipated. Every layer of testing is a different strategy for surfacing the unanticipated. Pure logic tests find the bugs in your model. Sandbox tests find the bugs in your protocol. Replay tests find the bugs in your edge case handling. Production verification finds the bugs that only exist when the real world is involved.

If your test strategy only covers one of these layers, your system will fail in the others, and you won't know until a merchant tells you.

The goal is not test coverage. The goal is sleeping through the night. You only get there by testing in ways that match the actual structure of the failures you're trying to prevent.

This is part of a series on payment systems architecture. See also why payment state machines are harder than you think and the hardest part of payment systems is reconciliation.