What I'd Build Differently: Lessons from Years of Payment Infrastructure

This is the last article in the series. Twenty-seven articles ago, I started writing about payment infrastructure for POS systems because I was tired of watching the same mistakes get made by different teams. Some of those mistakes were mine, made years earlier and still echoing through the systems I'd worked on. Others were happening in real time, by smart engineers who had every reason to know better but somehow didn't.

The series has covered a lot: the structural problem of processor fragmentation, the deceptive complexity of state machines, the discipline of reconciliation, the realities of compliance, the operational shape of multi-tenancy, the politics of migrations. Each article was its own argument. This one is the synthesis: if I were starting over with what I know now, what would I do differently?

The list is shorter than I expected. Most of the architectural decisions I'd make today are ones I'd argue for at the time too, even if I couldn't always win the argument. The bigger shift is in what I'd take seriously earlier.

I'd treat the database schema as the most important code in the system

This one I'd shout from rooftops if I could. The schema is the only part of the system that survives every refactor, every team change, every framework migration. Everything else is replaceable. The schema is forever.

I've spent too much of my career working around schemas that were designed casually. Status columns that mean different things at different lifecycle stages. Floating-point money. Soft deletes that turn every query into a maybe. Foreign keys that don't actually constrain because the constraints were turned off "for performance" and never turned back on.

If I could give one piece of architectural advice to a payment platform starting out, it would be: the schema gets four times the design attention you'd give the rest of the codebase. It's reviewed by your most senior engineer. It's enforced with constraints, not just application logic. It's documented exhaustively. Changes go through a formal process. It's the thing that doesn't get fixed later, because by the time you realize it needs fixing, fixing it requires migrating millions of rows.

This isn't unique to payments, but it's existential in payments. A schema that allows invalid states will eventually contain invalid states, and invalid states in a payment system are real money in the wrong place.

I'd build the orchestration layer before the second processor

The mistake I see repeatedly: integrate one processor (clean, focused, fast). Integrate the second processor by extending the first integration (becomes a fork). Integrate the third by trying to retrofit an abstraction (mostly fails). End up with a tangle of processor-specific code paths that nobody can safely modify.

The right answer is to build the orchestration layer when you start integrating the second processor — earlier than the volume justifies, but before the cost of retrofitting becomes prohibitive. Yes, it's premature for one processor. Yes, it adds complexity. The complexity is much smaller than the complexity of three integrations without an abstraction.

I'd also resist the urge to "just make the second one work like the first." The semantic differences between processors are real, and an orchestration layer that pretends they aren't is a leaky abstraction that will break in production at the worst times. Build the orchestration layer to mediate semantic differences, not just protocol differences.

I'd treat events as the source of truth from day one

Mutable rows are tempting because they're easy. The transaction has a status column; you update it as things happen; the current status is a single fast query. It works.

It works until it doesn't. Until two events arrive concurrently and one overwrites the other. Until you need to reconstruct what the system thought yesterday. Until reconciliation finds a discrepancy and you have no audit trail of how the discrepancy emerged.

The event-sourced model is harder to set up but trivially easier to operate. The events are immutable. Order is preserved. State is derived. Audit is automatic. Replay is possible.

The cost is real: more storage, more write volume, projection rebuilds when schema changes. The benefit is also real: when something goes wrong (and it will), you have ground truth to investigate. The teams I've worked on that stored events as truth recovered from production incidents in hours. The teams that stored mutable rows recovered from the same incidents in days, sometimes weeks.

For payment systems specifically, this isn't optional in my book anymore. The cost of mutable state is too high.

I'd resist the appeal of microservices for the wrong reasons

The default architecture pattern in 2020s engineering is "split into microservices." For payment systems, this is often the wrong move.

Payment processing requires consistency. A transaction's authorization, capture, and settlement need to be reasoned about together, with strong invariants. Splitting the orchestration layer into microservices fragments the consistency boundaries. Now you have to reason about distributed transactions, eventual consistency, partial failures across services. You've turned a hard problem into a much harder problem.

The teams I've worked with that went hard on microservices for payments often regret it later. Not because microservices are bad, but because the reasons they chose them (team autonomy, deployability, scaling) didn't apply with sufficient force to justify the consistency complexity they introduced.

A modular monolith for the payment core, with explicit boundaries and clear interfaces, is usually the right answer for years longer than people expect. Split out specific services (notification, reporting, third-party integration) when they have genuinely different scaling or deployment requirements. Don't split the core for ideological reasons.

I'd invest more in observability earlier

The observability mistake I made repeatedly: build the system, then add observability when something breaks. The right ordering is the opposite. The system isn't complete until you can see it operating.

For payment systems, observability isn't just dashboards. It's invariant monitoring (continuous queries that verify the system is in a valid state). It's drift detection (comparing internal state against processor records). It's synthetic transactions (real transactions exercising real paths to verify the system is alive). It's structured logs that link transaction IDs across components. It's metrics that distinguish operational health from business correctness.

I've watched teams ship a payment platform without these and then spend the next year retrofitting them, learning along the way which metrics they actually needed. The cost of the retrofitting was higher than the cost of building observability into the system from the start.

Most observability tooling is generic; payment-specific observability requires domain knowledge. Build it yourself, custom to your platform's invariants. The off-the-shelf APM dashboards won't tell you that money is moving incorrectly.

I'd accept that processor relationships are partnerships, not vendor relationships

Processors are not vendors in the way AWS is a vendor. You don't buy a service and consume it. You enter a multi-year partnership with shared liability, intertwined operational dependencies, and contractual obligations that affect each other's businesses.

Treating the processor as a vendor — pick one based on price, integrate, ignore otherwise — leads to bad outcomes. The processor's behavior changes don't get monitored. The processor's incidents are surprising. The processor's team isn't reachable when you need them. By the time you realize you need a deeper relationship, you've been burned.

The platforms with mature processor relationships have:

Named contacts at the processor at multiple levels (technical, operational, executive).
Regular meetings (monthly or quarterly) covering performance, roadmap, issues.
Shared incident response procedures with defined escalation paths.
Visibility into the processor's roadmap and changes affecting the platform.
Contractual SLAs that are actually enforced.

Building this takes time and political effort. It pays back when something goes wrong. The platforms without these relationships are the ones that find out about a processor's planned API change from a customer support email.

I'd hire payments expertise earlier

The biggest mistake I made was thinking payments was a technical problem that engineers could figure out from the documentation. It is partly that, but it's also a domain with deep institutional knowledge that is not written down anywhere.

The shape of card networks. The political economy of interchange. The actual behavior of issuers vs. what their published APIs claim. The rituals of dispute response. The unwritten norms of underwriting. The hidden semantics of decline codes. This knowledge lives in the heads of people who have spent years in the industry, and you cannot derive it from first principles.

The teams I've watched succeed had at least one person with deep payments background — usually someone who had worked at a processor or a payments-focused company. The teams that struggled were the ones where everyone was new to payments, learning together, missing the things that an experienced person would have caught immediately.

Hire the experienced person early. Pay above-market if you have to. The ROI is enormous because they prevent expensive mistakes you'd otherwise make.

I'd take onboarding as seriously as the API

I covered this in its own article, but it bears repeating in the synthesis. Onboarding is the gating step for every customer relationship. A great API with bad onboarding loses customers. A mediocre API with great onboarding wins them.

Most engineering-led platforms underinvest in onboarding because it's not technically interesting. It's mostly compliance, document collection, KYC, manual review, operational orchestration. Building a great onboarding flow is also building infrastructure: identity verification integrations, underwriting tooling, processor onboarding APIs, status tracking, communication automation. It's a real product, not a form.

If I could give startup payment platforms one operational priority, it would be: build a great onboarding experience before you build great analytics. The analytics improve customer satisfaction marginally; the onboarding determines whether you have customers at all.

I'd be honest about the cost of certifying processors

Every new processor integration takes 3–6 months of calendar time, even if the engineering is fast. The certification process is paced by the processor, not by you. Plan accordingly.

I've watched teams promise BD that "the second processor will be live in two months" and then explain a quarter later why it wasn't. The two-month estimate was the engineering estimate. The actual timeline included certification, sandbox testing, production rollout, and the inevitable edge cases discovered after launch.

Communicate this internally, repeatedly. The platforms that get bitten are the ones where engineering says "two months" and BD plans the marketing campaign around that timeline. When the campaign date arrives and the integration isn't ready, both teams blame each other.

The honest framing: integration is engineering's two months, certification is the processor's two-to-four months, hardening is operations' two months. Total: six to eight months from sign to live. Anything faster is luck.

I'd build for migration from day one

I've discussed this elsewhere too. Worth restating. Every payment platform eventually changes processors. The teams that built for migration find this routine. The teams that didn't find it transformative.

The architectural investments that make migration possible are also the architectural investments that make daily operation cleaner: a clean orchestration layer, processor-agnostic tokenization, unified reconciliation, configurable routing. These pay back every day, not just during migrations.

The temptation to "just hardcode the first processor" is real and shortsighted. You don't save much by hardcoding; you commit to a long-term cost of migration that you'd rather not pay.

I'd take time more seriously

Payments is a domain where time matters in ways that other software doesn't.

Settlement timing. Authorization expiration windows. Dispute response deadlines. Tokenization expiration. Hold release schedules. Batch processing windows. Statement cycles. Reporting periods. Tax reporting deadlines.

Each of these is a clock running in the background of every transaction. Miss one and there are real consequences: refunds become charges, captures fail, disputes are auto-lost, taxes are wrong.

I've seen platforms treat these clocks casually, assuming the underlying systems handle them. The underlying systems don't, always. Or they handle them in subtly wrong ways that the platform inherits.

The systems I'd build today have explicit time tracking: every transaction record includes the relevant temporal constraints, every operation checks those constraints, alerts fire when transactions approach deadlines. Time is a first-class concern, not an afterthought.

What I wouldn't change

A few things I'd keep doing the same way:

The discipline of testing. Every team I've worked with that tested rigorously had fewer production incidents. Not zero — payments has surprises that no test catches — but fewer. The investment in real testing (replay corpora, synthetic transactions, invariant monitoring, property tests of state machines) pays back many times over.

Treating reconciliation as a first-class system. Most teams treat reconciliation as a script that runs once a day. The teams that treat it as a continuous, monitored, alerting system catch problems earlier and have better operational confidence. I'd keep doing this.

Strict tokenization boundaries. Routing all card data through a narrow tokenization boundary, on day one, with the assumption that PCI scope is the most expensive thing the company can have. The teams that do this never have a "PCI scope expansion" project. The teams that don't have one every couple years.

Honest communication with stakeholders. "This will take longer than the optimistic estimate." "This integration has more edge cases than we initially planned." "We need more operational headcount as we add merchants." Saying these things early and repeatedly is unpopular but works. Hiding them is satisfying short-term and disastrous long-term.

The deeper pattern

Looking across all of these, the pattern is the same: payment systems reward conservative, deliberate engineering. They punish optimism, shortcuts, and trend-following.

This is uncomfortable in an industry where engineers are rewarded for moving fast. Payment systems don't move fast. They move carefully. The engineers who do well in this domain are the ones who can sustain the discipline over years, not the ones who ship the most code per sprint.

It's also uncomfortable because payment systems don't generate exciting demos. Reconciliation is not a demo. State machine invariants are not a demo. Strict tokenization boundaries are not a demo. The work that makes payment systems good is mostly invisible to anyone who isn't doing it.

But the work that makes payment systems bad is very visible. It's visible in incident reports, in chargeback rates, in customer churn, in compliance findings. The bad work shows up after the demos are over.

If I were starting a payment platform today, I would commit to this discipline from day one. Not because it's fashionable. Because it's how payment systems get built well, and there's no shortcut.

A final thought

The series has been my attempt to externalize what I've learned, partly because writing it down helps me understand it better, and partly because the next generation of payment engineers shouldn't have to make every mistake themselves to learn the same lessons.

If you've read along, I hope it was useful. If you're working on a payment system right now and you saw your own challenges in these articles, that's because the challenges are universal. The patterns repeat across companies, across teams, across decades.

The work is hard. It's also worthwhile, because every payment that goes through cleanly is a small piece of trust the system delivers on, between a customer and a merchant, between a merchant and their bank, between every party and the financial system as a whole. That trust isn't a marketing concept. It's an engineering output. And engineering it well is what we do.

Thanks for reading.

This is the final article in the series on payment systems architecture. The complete series, in chronological order, is available on the articles page.