The worst production incidents I've seen in payment systems were all traceable to the database schema. Not to the application code, not to the processor integration, not to the network layer — to the database schema. A schema that lets invalid states be represented is a schema that will eventually contain invalid states, and in a payment system, invalid states are money moving in ways nobody intended.
The reason database design is so consequential in payment systems is that the database is the only component that persists. Everything else is ephemeral — application processes restart, caches invalidate, queues drain, network connections drop. The database is what remains, and whatever is written there becomes the ground truth. If the ground truth is wrong, it's wrong forever, and downstream systems will confidently report the wrong answer to every question they're asked.
I want to walk through the patterns that I've found actually hold up under production load, and to be specific about why the common alternatives fail.
Money is not a floating-point number
I know. Everyone knows this. And yet I still find floating-point columns in payment schemas, usually because someone's ORM picked a default type and nobody checked.
Use integers. Store the amount in the smallest indivisible unit of the currency — cents for USD, yen for JPY (which has no subunit), fils for BHD (which has three decimal places, so 1 BHD = 1000 fils). Store the currency code alongside the amount. Never perform arithmetic on amounts in different currencies without an explicit conversion step.
The schema should look like:
amount_minor BIGINT NOT NULL,
currency_code CHAR(3) NOT NULL,
CONSTRAINT valid_currency CHECK (currency_code IN (...)),
CONSTRAINT nonneg_amount CHECK (amount_minor >= 0)
Not DECIMAL(10,2). Not NUMERIC. And definitely not FLOAT. A single division by three in a floating-point codebase and you will eventually owe a customer a penny and not know why. I have debugged this exact bug and it took three days.
The BIGINT matters because INT is 32 bits and will overflow at about $21 million if you store cents. That's smaller than a single day's processing for a mid-size platform. Use 64 bits.
Events, not rows
The most important structural decision is whether transactions are mutable rows or append-only event streams. If I could enforce one architectural choice across every payment system I touch, it would be this: transactions are event streams.
A mutable transaction row looks like this:
CREATE TABLE transactions (
id UUID PRIMARY KEY,
merchant_id UUID NOT NULL,
amount_minor BIGINT NOT NULL,
status TEXT NOT NULL, -- 'authorized', 'captured', etc.
last_updated TIMESTAMPTZ NOT NULL,
...
);
Every time something happens to the transaction — capture, refund, settlement, dispute — the row is updated. The status column gets overwritten. Previous values are lost. This is the most common pattern I see, and it's wrong in a way that doesn't become visible until it's too late.
The problem is that it makes the current state the only state. You cannot answer the question "what did this transaction look like last Tuesday?" without a separate audit log, which then has to be kept in sync with the main table. Worse, the update-in-place approach creates race conditions: two events arriving concurrently can overwrite each other, and the final state depends on which thread committed last. In payment systems, this is catastrophic — a capture and a void arriving at the same time can end up with neither having happened, or both.
The append-only pattern:
CREATE TABLE transaction_events (
event_id UUID PRIMARY KEY,
transaction_id UUID NOT NULL,
event_type TEXT NOT NULL,
event_payload JSONB NOT NULL,
processor_ref TEXT,
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
sequence_number BIGINT NOT NULL
);
CREATE UNIQUE INDEX ON transaction_events (transaction_id, sequence_number);
The current state of any transaction is computed by folding its event stream. You never update. You only append. Race conditions between concurrent events are resolved by the UNIQUE constraint on (transaction_id, sequence_number) — exactly one event wins each slot, and the loser's handler sees the failure and can re-compute the state it should have used.
The derived state can be materialized into a cache table for query convenience, but the events are the source of truth. If the cache is ever wrong, you rebuild it from the events. You never rebuild the events from the cache, because the cache is lossy by definition.
The idempotency table needs more structure than people give it
Most idempotency tables I've seen look like this:
CREATE TABLE idempotency_keys (
key TEXT PRIMARY KEY,
response JSONB,
created_at TIMESTAMPTZ NOT NULL
);
This is wrong in several ways. First, it has no scope — if two merchants happen to generate the same key (which happens more than you'd expect with poorly-generated keys), they collide. Second, it has no request fingerprint — a retry with a different payload returns the cached response for the original payload. Third, it has no concept of an in-flight operation, so it can't distinguish "we're processing this now" from "this was never seen."
What it should look like:
CREATE TABLE idempotency_keys (
merchant_id UUID NOT NULL,
operation_type TEXT NOT NULL,
key TEXT NOT NULL,
request_hash TEXT NOT NULL,
state TEXT NOT NULL, -- 'in_progress', 'complete', 'pending_uncertain'
response JSONB,
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
completed_at TIMESTAMPTZ,
expires_at TIMESTAMPTZ NOT NULL,
PRIMARY KEY (merchant_id, operation_type, key),
CONSTRAINT state_valid CHECK (state IN ('in_progress', 'complete', 'pending_uncertain'))
);
The composite primary key is what makes the atomic insert work. An INSERT ... ON CONFLICT DO NOTHING RETURNING * (or equivalent) atomically reserves the slot, and the application can distinguish between "we just claimed this" (row inserted) and "this already existed" (nothing returned, but existing row can be fetched).
The request_hash lets the application detect the case where a client reuses a key for a different request. If the hash doesn't match on retry, the server returns an error instead of the wrong cached response.
The expires_at matters because idempotency keys should not be retained forever — they bloat the table and carry no value after the retry window closes. But expiry has to be longer than the longest reasonable retry window for the operation. For authorizations, I've seen this set as short as an hour (wrong — retries can happen days later) and as long as 90 days (wasteful, but fine). 24 hours minimum, 7 days is safe for most operations.
Double-entry accounting at the schema level
If your payment system is tracking money, it should be using double-entry accounting, and the double-entry structure should be enforced at the database level.
Every event that moves money consists of paired entries: one debit and one credit, to two different accounts, in equal amounts. The merchant's receivable goes up when a transaction is captured; the customer's obligation is recorded simultaneously. A refund reverses both. A chargeback creates a new pair.
CREATE TABLE ledger_entries (
entry_id UUID PRIMARY KEY,
event_id UUID NOT NULL, -- groups paired entries
account_id UUID NOT NULL,
direction TEXT NOT NULL, -- 'debit' or 'credit'
amount_minor BIGINT NOT NULL,
currency_code CHAR(3) NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
CONSTRAINT nonneg CHECK (amount_minor >= 0),
CONSTRAINT valid_direction CHECK (direction IN ('debit', 'credit'))
);
CREATE INDEX ON ledger_entries (event_id);
CREATE INDEX ON ledger_entries (account_id, created_at);
The invariant you enforce is that for every event_id, the sum of debits equals the sum of credits. You can enforce this with a deferred constraint, or with a trigger that fires on transaction commit, or with a post-commit validation job that alerts on violations. I prefer the trigger approach because it fails loudly at write time rather than letting bad data in and catching it later.
CREATE OR REPLACE FUNCTION check_ledger_balance() RETURNS trigger AS $$
DECLARE
imbalance BIGINT;
BEGIN
SELECT SUM(CASE WHEN direction = 'debit' THEN amount_minor ELSE -amount_minor END)
INTO imbalance
FROM ledger_entries
WHERE event_id = NEW.event_id;
IF imbalance <> 0 THEN
RAISE EXCEPTION 'ledger imbalance for event %: %', NEW.event_id, imbalance;
END IF;
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
The merchant's balance at any point in time is a single aggregate query over the ledger. The balance cannot be wrong — it is derived. If the derivation disagrees with the processor's settlement records, you have found a reconciliation issue, and it's a real one, not a presentation bug.
Constraints over code
The most valuable database work I've done on payment systems was translating application-level invariants into database-level constraints. Anything you can express in SQL should be expressed in SQL, because application-level checks are defeated by:
- Direct SQL access from admin tools
- Migrations that bypass the ORM
- Race conditions between check and insert
- Old versions of the application during deploys
- New engineers who don't know the rules
A refund cannot exceed the captured amount of its parent transaction. This is usually implemented in application code, where it's checked before insert. In production, I have seen this constraint violated repeatedly — because the check and the insert happen in different transactions, because two refunds were issued in parallel, or because someone ran a "quick fix" query against the database directly.
Express it as a constraint. The cleanest way is a check constraint backed by a computed column, or an explicit trigger:
CREATE OR REPLACE FUNCTION check_refund_amount() RETURNS trigger AS $$
DECLARE
captured BIGINT;
refunded BIGINT;
BEGIN
SELECT COALESCE(SUM(amount_minor), 0) INTO captured
FROM transaction_events
WHERE transaction_id = NEW.transaction_id AND event_type = 'captured';
SELECT COALESCE(SUM(amount_minor), 0) INTO refunded
FROM transaction_events
WHERE transaction_id = NEW.transaction_id AND event_type = 'refunded';
IF refunded + NEW.amount_minor > captured THEN
RAISE EXCEPTION 'refund exceeds captured amount for transaction %', NEW.transaction_id;
END IF;
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
Triggers have a reputation for being slow and opaque. In payment systems, that's an acceptable tradeoff. The alternative — invariant violations that go undetected until a merchant complains — is worse than a few milliseconds of write latency.
Partitioning strategy matters earlier than you think
The default advice on partitioning is "don't optimize until you have to." In payment systems, you have to earlier than in most domains. Transaction volume grows monotonically and is retained for years for compliance. A single unpartitioned transactions table becomes unmanageable faster than most teams expect — around the time you hit 100 million rows, depending on your database and hardware.
The partitioning key needs to satisfy two constraints: almost every query includes it, and it grows over time. For payment systems, the answer is almost always date — usually month or quarter.
CREATE TABLE transaction_events (
event_id UUID,
transaction_id UUID NOT NULL,
event_type TEXT NOT NULL,
...
created_at TIMESTAMPTZ NOT NULL DEFAULT now()
) PARTITION BY RANGE (created_at);
CREATE TABLE transaction_events_2025_01
PARTITION OF transaction_events
FOR VALUES FROM ('2025-01-01') TO ('2025-02-01');
Date-based partitioning plays well with the access pattern: active data is in the current and previous partition, and older partitions can be moved to cheaper storage or detached entirely. When a merchant asks for their historical data, the query is naturally bounded by the date range.
Merchant-based partitioning is tempting but usually wrong. Merchant sizes vary by orders of magnitude, and partitioning by merchant produces wildly uneven partitions. One Fortune 500 merchant's partition will be 1000x the size of a small merchant's. You get no performance benefit from the partitioning and you complicate queries.
Soft deletes are a trap
Never soft-delete payment data. If you need to mark something as no longer active, use a state transition, not a deleted_at column. The reason is that soft-delete columns are an invitation for every query to forget to filter on them, and in a payment system, querying deleted data is how you double-count money or refund transactions twice.
If compliance requires that records be kept, keep them. If they need to be marked superseded, add a proper state to the state machine. If they need to be obscured for privacy, do a real deletion and keep an anonymized reference. Soft deletes are the worst of all worlds: the data is there, but it's filtered out inconsistently, and any single missed filter is a bug.
Reconciliation as a table, not a script
Reconciliation between your records and the processor's records is something most teams implement as a script that runs nightly. The script reads both sides, compares them, and writes out a report. If the script is wrong, the report is wrong, and the error is not caught until someone notices.
Better: make reconciliation a table. Every settlement line from the processor becomes a row. Every matched internal record is a foreign key reference. Unmatched rows are visible as NULL foreign keys. Discrepancies are expressed as rows that fail a CHECK constraint on amount equality.
CREATE TABLE settlement_records (
settlement_id UUID PRIMARY KEY,
processor_batch_id TEXT NOT NULL,
processor_ref TEXT NOT NULL,
amount_minor BIGINT NOT NULL,
currency_code CHAR(3) NOT NULL,
settled_at DATE NOT NULL,
matched_transaction UUID REFERENCES transactions,
match_state TEXT NOT NULL, -- 'matched', 'unmatched', 'amount_mismatch', 'investigating'
...
);
A query against this table tells you, at any moment, exactly how much of the settlement activity has been reconciled and how much hasn't. Unmatched rows are work items for operations. The ratio of matched to unmatched is an SLI. Alerts can be driven by the table, not by the nightly script's exit code.
What breaks and how to know
A payment database that's working correctly has the following properties, and each one can be checked with a query:
- Every transaction has a complete, gapless
sequence_numbersequence in its events. - The sum of debits equals the sum of credits for every
event_idin the ledger. - No refund exceeds the captured amount of its parent transaction.
- Every settlement record is matched to an internal transaction, or explicitly marked under investigation.
- No two idempotency keys with different request hashes share a composite primary key.
- Every merchant's balance equals the sum of their ledger entries.
Run these checks continuously. Alert on any failure. Most of them will never fire — but when one does, it's a signal that something has gone wrong at a level deep enough to corrupt the financial record, and that's a situation where fast detection is the difference between a small bug and a corporate crisis.
The principle underneath all of this
The database schema of a payment system is the only part that's really yours. You don't control the processor, you don't control the network, you don't control the card networks. But you control your own database, and every correctness guarantee you want to provide to your merchants has to be rooted in that database. If the database lets invalid states be represented, invalid states will eventually be represented. If the database enforces invariants, the rest of the system becomes much easier to reason about.
Write the schema like it's going to be audited by a regulator and debugged at 3 AM by an engineer who has never seen it. Because eventually both will happen.
This is part of a series on payment systems architecture. See also why payment state machines are harder than you think and the hardest part of payment systems is reconciliation.