A normal software outage has a simple structure: the system was working, now it isn't, fix it and restore service. The steps are well-practiced. Detect, diagnose, remediate, communicate, postmortem. There are frameworks for this that are widely taught and generally correct.
Payment system incidents are different. They have a different shape, a different set of decisions, a different definition of "fixed," and a different set of consequences for acting too fast or too slow. A team that runs a payment incident the same way they'd run a web-service incident will either restore service while leaving bad data behind or hesitate so long that merchants start calling their banks.
I want to walk through what actually happens, step by step, when a payment system breaks, and what the decisions look like in practice.
The three kinds of payment incident
Before anything else, the responder has to diagnose which kind of incident this is, because the response is different for each.
System unavailable. Your API is returning errors. Merchants can't process transactions. No money is moving. The symptoms are obvious, and the priority is to restore availability. This is the most familiar kind of incident.
System degraded. Transactions are being processed but at elevated error rates, slower latency, or reduced authorization success. Some transactions succeed, some fail, and the failure pattern may be specific to a processor, a card brand, or a geography. The priority is to identify what's failing and either fix it or route around it.
System producing wrong results. Transactions appear to be succeeding, but they're being recorded incorrectly, routed to the wrong ledger account, settled at wrong amounts, or otherwise producing financial records that don't match reality. This is the worst case. It's the hardest to detect and the hardest to recover from, and it's usually discovered by reconciliation rather than by monitoring.
The first diagnostic question is: is this type 1, 2, or 3? The triage for type 1 and 2 is operational. The triage for type 3 is forensic. Treating a type 3 incident as if it were type 1 — restoring service quickly without understanding what the system was doing while broken — makes the incident worse by generating more bad data.
The first five minutes
When an alert fires, the first five minutes are about confirming the problem and containing it.
Confirm. Is the alert real? Synthetic transactions are the fastest confirmation — they run end-to-end through production, and their success or failure is unambiguous. If synthetic transactions are failing, you have a real incident. If they're succeeding but other alerts are firing, you may have a metric issue or a narrow problem affecting specific paths.
Determine scope. Is this affecting all merchants or some? All processors or one? All transaction types or specific ones? Scope determines priority and also suggests cause. A problem hitting one processor is almost certainly that processor. A problem hitting one merchant is likely a configuration issue. A problem hitting everyone is likely your infrastructure.
Decide on containment. This is the payment-specific decision that most generic playbooks miss. Even before you fix the problem, you may need to stop the system from processing more transactions, to prevent the incident from generating more bad data. This is called putting the system into "read-only" or "maintenance" mode.
The decision is: is it worse to stop processing and have merchants frustrated, or to keep processing and produce more transactions that may need to be reversed later? There's no universal answer. For type 1 (unavailable), it's already stopped. For type 2 (degraded), the calculus depends on the severity and nature of the degradation. For type 3 (wrong results), stopping immediately is almost always the right call — every additional transaction is more work to unwind.
Stop the bleeding
For type 3 incidents especially, and often for type 2, the first priority is to prevent the incident from generating more bad data. This usually means one of:
- Fail closed. Configure the payment system to reject incoming transactions rather than process them. Merchants see errors; no new transactions happen. The alternative — letting transactions through while broken — produces more bad records to fix.
- Route around the failure. If the problem is isolated to one processor, move traffic to another. If it's isolated to one code path, disable that path via feature flag. If it's isolated to one merchant's configuration, disable that merchant's processing temporarily.
- Read-only mode. For systems that have it, read-only mode keeps the historical data accessible and the reporting working, but prevents any new writes. Merchants can see their existing data and report on it, they just can't process new transactions until the incident resolves.
The choice depends on what's broken. The decision should be automatic, not political — whoever is the incident commander has the authority to halt processing, no escalation required. Delays waiting for approval are where incidents get worse.
Assess the damage
Once containment is in place, the next task is to figure out what damage the incident has caused so far. This is where payment incidents diverge most sharply from general software incidents.
For a web service outage, "damage" is usually just "time the service was unavailable" and maybe lost revenue. For a payment system incident, damage means:
- Transactions that were authorized but not recorded in your system (money held on customer cards, no merchant record of it)
- Transactions that were recorded but not actually processed (merchant thinks they got paid, no actual authorization)
- Transactions recorded at wrong amounts (reconciliation will catch this later, but you want to catch it now)
- Transactions routed to the wrong processor, wrong merchant account, or wrong ledger account
- Refunds or voids that were requested but not processed
- Settlement records that don't match the transactions they claim to cover
- Customer-visible effects: duplicate charges, missing credits, failed operations that look successful
Each category requires different remediation. Some are reversible; some aren't. Some are obvious; some only show up in reconciliation days later. The damage assessment has to happen before you can plan remediation, because the plan depends on what's actually broken.
The tool for damage assessment is your invariant monitoring. Every invariant you've defined — ledger balance, transaction completeness, settlement matching — gets re-run against the data produced during the incident window. The violations that appear are the damage. Invariants that didn't exist before the incident get written now, because the incident has taught you about a failure mode your monitoring missed.
Restore service carefully
For types 1 and 2, restoring service is usually straightforward once the root cause is identified. Deploy a fix, roll back a bad change, restart a service, fail over to a replica. Standard operational playbook.
For type 3, restoring service is dangerous. You fixed the code that was producing wrong results. Now you restart processing. The system starts handling new transactions correctly, but the old transactions — the ones that were mishandled during the incident — are still in a broken state in your database. If you process new transactions that reference the broken old ones, you compound the damage.
The safer path for type 3 incidents:
- Fix the code that produced wrong results.
- Before resuming processing, identify every transaction affected by the incident.
- Correct or flag the affected transactions in a controlled process (not through the normal API — through a data-repair pipeline that's designed for this).
- Once the historical data is either correct or explicitly marked as under investigation, resume processing.
This takes longer than just restarting the service, sometimes much longer. The tradeoff is that merchants experience a longer outage, but the system's records come out of the incident in a consistent state. If you cut this corner and resume processing too early, the compound effects can easily take weeks of engineering time to unwind.
Communicating during payment incidents
Payment incidents are customer-visible in a specific and high-stakes way. A merchant who can't process transactions loses revenue. A customer who has been double-charged is upset and may dispute. Support is flooded. The communication strategy has to account for all of this.
The principles:
Communicate early and often. Status pages, automated emails, in-app banners, and direct outreach to high-value merchants. The worst case is not over-communication; it's under-communication, which produces rumors and destroys trust.
Be specific about impact, not cause. "Transactions may be failing for some merchants using Processor X" is useful. "A bug in our routing layer caused an edge case in our idempotency logic to produce duplicate records" is not useful to anyone outside engineering.
Tell merchants what to do. If there's a workaround, state it clearly. "You can continue processing by manually selecting Processor Y for now." If there isn't, say that too. "There is no workaround at this time; we recommend delaying new transactions until we publish an update."
Commit to an update cadence. "Next update in 30 minutes" is infinitely better than radio silence, even if the next update is "we're still working on it." Merchants can plan around updates; they can't plan around silence.
Do not speculate about cause publicly. Internal investigation is ongoing; public statements are for impact and timing. Premature cause attribution creates confusion when the real cause turns out to be something else, and "the real cause" is often not known for hours or days.
Money disposition during outages
This is the part most generic incident response frameworks miss entirely: during and after a payment incident, there is money in ambiguous states, and someone has to decide what to do with it.
Transactions that authorized but didn't capture: the money is held on the customer's card. If you don't capture, the hold eventually releases, usually within a few days. The merchant doesn't get paid. The customer isn't charged. This is often the safe default during an incident — let the holds expire rather than process captures against data you don't fully trust.
Transactions that captured but didn't settle: depending on the processor, these may settle automatically later or may require manual intervention. The processor's records show captures that your records don't correspond to. Reconciliation will raise these as exceptions; operations has to investigate each one.
Refunds that were requested but not processed: the customer expects their money back and didn't get it. If the incident lasts long enough to generate customer complaints, these have to be processed manually, out-of-band, with audit trail.
Chargebacks that arrived during the incident: you may have missed the response window. Chargebacks that should have been contested become automatically accepted. The money is gone, and the merchant takes the loss.
Each of these requires specific work during and after the incident. The plan for handling them has to be part of the incident response, not an afterthought. This is the kind of work that distinguishes a real payment operations team from a generic devops team — dealing with money that's in the wrong place is a domain skill.
The postmortem that matters
Most postmortems focus on the technical root cause: what broke, why it broke, how to prevent similar failures. For payment incidents, the postmortem has to cover more:
Financial impact. How much money was affected? How many transactions? How many merchants? How many customers? The numbers have to be precise, because they drive conversations with merchants, with processors, possibly with regulators.
Reconciliation status. Are all affected transactions now in a consistent state? If not, what's left? What's the plan for resolving it and by when?
Merchant impact. Which merchants were most affected? What did they experience? Have they been contacted? Is any restitution required?
Process failures. What process or tooling gap allowed the incident to occur or to go undetected for as long as it did? Usually this is the most valuable section. Technical bugs are one-time; process failures recur.
Prevention. What invariant check, what synthetic transaction, what alert, what code change would prevent this in the future? Commit to implementing it, with a deadline.
The postmortem for a payment incident is usually longer and more detailed than for a typical web-service incident. It often includes financial attachments and regulatory notes. It needs to be readable by non-engineers — CFO, compliance, sometimes auditors. Write it for that audience, not just for engineering.
The runbook approach
Everything I've described works better if it's already documented before the incident. Runbooks for payment incidents have specific structure:
- Triage questions. What to check first, in what order, to identify the type of incident.
- Containment actions. How to halt processing, how to fail over, how to disable specific paths — with the exact commands, not approximations.
- Diagnostic queries. SQL queries to check invariants, inspect specific transactions, identify affected scope.
- Recovery procedures. How to correct affected transactions, how to reverse bad state, how to process stuck refunds.
- Communication templates. Pre-written status page updates for common incident types, with blanks for specifics.
- Escalation paths. Who to contact, when, for what kinds of decisions.
Runbooks are not documentation; they are operational tools. They get used at three in the morning by someone who hasn't thought about payment systems in a week. They need to be actionable, not educational. If a step requires thinking, the thinking should already be captured in the runbook.
The discipline
Payment incident response is a discipline you build over time, and you build it by treating every small anomaly as practice. A reconciliation drift of ten cents is not worth an all-hands incident, but it is worth an investigation that follows the same structure as an incident. You practice the triage. You exercise the runbooks. You identify the gap that allowed the drift, and you close it.
When the real incident comes — and one will — the team that has practiced handles it calmly. The team that hasn't invents the process on the fly, misses steps, and compounds the damage.
The cost of not practicing is not that the first real incident is harder. It's that the first real incident is the first time anyone finds out the runbooks are wrong, the tooling doesn't work, the communication channels aren't set up, and the escalation path ends at someone who left the company two years ago.
What I've seen work
The payment teams I've worked with that handled incidents best shared a few traits:
- They had a clear incident commander role, rotated weekly, with explicit authority to halt processing.
- They had real runbooks that got updated after every incident.
- They practiced — tabletop exercises, game days, intentional small disruptions.
- They had invariant monitoring that caught problems before merchants did.
- They treated reconciliation drift as an incident signal, not an accounting exercise.
- They were willing to take longer outages to ensure data correctness.
The teams that handled incidents worst were not incompetent — they just hadn't internalized that payment incidents are different. They brought web-service instincts to financial-system problems, optimized for uptime when they should have been optimizing for correctness, and kept processing when they should have stopped.
When your system breaks and money is on the line, speed is not the goal. Accuracy is the goal. The incident response that gets this right is the one that builds enough trust to survive the next one.
This is part of a series on payment systems architecture. See also observability in payment systems and the hardest part of payment systems is reconciliation.