Database Deadlocks, Part 2: Diagnosis, Retries, and Prevention

TL;DR

Part 1 covered the patterns. This post is the operational half: reading the deadlock log to identify which pattern fired, designing retries that fail loudly instead of hiding the real bug, isolating hot rows before they become incidents, and the prevention primitives (NOWAIT, SKIP LOCKED, isolation-level changes, lock_timeout) that remove entire categories from the workload.

A nightly inventory sync deadlocks every Tuesday at 02:14. The job runs INSERT INTO inventory ... ON CONFLICT (sku) DO UPDATE across eighteen partner warehouses in parallel, and the deadlock counter spikes from 3 per hour to 400 per hour for the duration of the run. The on-call response over the last three incidents has been the same: bump the retry limit from 5 to 10. The deadlocks still happen, they just take longer to surface as user-facing errors. After the third week, someone reads the LATEST DETECTED DEADLOCK output and finds lock_mode X locks rec but not gap on the primary key. The workers aren’t sorting the upsert batches. One worker processes (1, 2, 3) while another processes (3, 2, 1), and they deadlock on the middle row.

The fix is one line in application code: rows.sort(key=lambda r: r.sku) before each batch. The deadlock counter drops back to baseline immediately. The retry limit goes back to 5. The three weeks of bumping retry settings were chasing the symptom; the actual fix lived in the access path, not the retry layer. The log told the story the first time it was generated.

This post is the operational half of the deadlocks series. Part 1 covered the patterns. What follows is the sequence: read the log to identify which pattern fired, design retries that don’t hide the bug, then reach for the prevention primitives that remove categories from the workload entirely.

Read the MySQL deadlock log

SHOW ENGINE INNODB STATUS dumps the most recent deadlock in the LATEST DETECTED DEADLOCK section. The catch: only the most recent. On a busy system, deadlocks overwrite each other faster than someone can log in and copy the output. Before anything else, turn on innodb_print_all_deadlocks = ON in every production deployment. It writes every deadlock to the error log instead of a single overwriting slot. The volume is negligible, the diagnostic value is high, and there is no downside.

A representative entry looks like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
*** (1) TRANSACTION:
TRANSACTION 4823941, ACTIVE 3 sec starting index read
mysql tables in use 1, locked 1
LOCK WAIT 4 lock struct(s), heap size 1136, 3 row lock(s)
MySQL thread id 892, OS thread handle 0x7f..., query id 18293 ...
UPDATE orders SET status = 'shipped' WHERE id = 1001

*** (1) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 48 page no 112 n bits 144 index PRIMARY
of table `shop`.`orders` trx id 4823941 lock_mode X locks rec but not gap
waiting

*** (2) TRANSACTION:
TRANSACTION 4823942, ACTIVE 2 sec starting index read
UPDATE orders SET status = 'paid' WHERE id = 1002

*** (2) HOLDS THE LOCK(S):
RECORD LOCKS space id 48 page no 112 n bits 144 index PRIMARY
of table `shop`.`orders` trx id 4823942 lock_mode X locks rec but not gap

*** (2) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 48 page no 112 n bits 144 index PRIMARY
of table `shop`.`orders` trx id 4823942 lock_mode X locks rec but not gap
waiting

*** WE ROLL BACK TRANSACTION (2)

The parts that matter:

lock_mode vs. lock_type. X is exclusive, S is shared. locks rec but not gap is a pure record lock; locks gap before rec is a gap lock; the unadorned X under REPEATABLE READ is usually next-key (record + gap). Matching lock_mode S locks rec but not gap against Part 1’s unique-index section tells you immediately that this is a duplicate-key-on-insert deadlock.
index name. index PRIMARY vs. index idx_customer reveals whether the cycle formed on the clustered index or a secondary one. Two transactions approaching the same rows from different indexes is the “secondary index locks on InnoDB” pattern from Part 1; the fix is usually consolidating access paths.
The query text. This is the last statement the transaction executed before the deadlock, not necessarily the one that caused it. A transaction holding locks from three earlier statements can deadlock on the fourth, and the log only shows the fourth. Cross-reference with the application’s structured logs to reconstruct the full transaction.
trx id is monotonically increasing and stable for the life of the transaction. Searching the general log or slow-query log for that trx id reconstructs the full statement sequence, but only if general-query logging is on for the window in question, which it usually isn’t.

performance_schema.data_locks and data_lock_waits give a real-time view of current locks and waits. Useful for catching a deadlock-adjacent pathology (long wait chains, hot rows) before the cycle forms:

1
2
3
4
5
6
7
8
SELECT
  bl.lock_type, bl.lock_mode, bl.object_name, bl.index_name,
  w.REQUESTING_ENGINE_TRANSACTION_ID AS waiting_trx,
  w.BLOCKING_ENGINE_TRANSACTION_ID AS blocking_trx
FROM performance_schema.data_lock_waits w
JOIN performance_schema.data_locks bl
  ON w.BLOCKING_ENGINE_LOCK_ID = bl.ENGINE_LOCK_ID
WHERE bl.OBJECT_SCHEMA = 'shop';

Read the PostgreSQL deadlock log

PostgreSQL’s diagnostic story is narrower by design. Deadlocks are logged automatically when the cycle is detected. log_lock_waits = on logs any wait exceeding deadlock_timeout (default 1s), which catches the wait-chain escalation before the detector fires. There’s no equivalent to SHOW ENGINE INNODB STATUS; everything lives in postgresql.log or the extensions you’ve installed.

A representative deadlock entry:

1
2
3
4
5
6
7
8
9
ERROR:  deadlock detected
DETAIL:  Process 14234 waits for ShareLock on transaction 89234;
         blocked by process 14235.
         Process 14235 waits for ShareLock on transaction 89233;
         blocked by process 14234.
         Process 14234: UPDATE accounts SET balance = balance - 100 WHERE id = 2
         Process 14235: UPDATE accounts SET balance = balance + 50 WHERE id = 1
HINT:  See server log for query details.
CONTEXT:  while updating tuple (0,18) in relation "accounts"

The ShareLock on transaction X wording is PostgreSQL-specific. One transaction is waiting to see the commit status of another (a row-lock wait manifests as waiting on the holder’s transaction ID). The tuple identifier (0,18) points to the exact physical row (page 0, tuple 18 in the heap), which is useful for reproducing the scenario but changes as rows are updated (MVCC creates new versions at new (page, tuple) locations).

For real-time inspection, pg_locks joined against pg_stat_activity shows live lock state:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
SELECT
  a.pid, a.usename, a.state,
  a.wait_event_type, a.wait_event,
  l.mode, l.locktype, l.relation::regclass,
  pg_blocking_pids(a.pid) AS blocked_by,
  LEFT(a.query, 80) AS query
FROM pg_stat_activity a
LEFT JOIN pg_locks l ON l.pid = a.pid AND NOT l.granted
WHERE a.state != 'idle'
ORDER BY a.xact_start;

pg_blocking_pids(pid) returns the array of PIDs blocking a given transaction. Walking it recursively reconstructs the live wait-for graph: the same data the deadlock detector uses, just before a cycle forms. For hot systems, pg_stat_statements combined with pg_stat_activity snapshots at regular intervals builds a picture of which statements accumulate the most wait time, which is almost always the right first place to look.

Row locks are invisible in pg_locks

PostgreSQL’s row-level locks (the result of SELECT ... FOR UPDATE, FK checks, and plain UPDATE/DELETE) are stored on the tuple itself, in the xmax system column, not in the shared lock table. They don’t show up in pg_locks. The only way to see them is through the pgrowlocks extension, which scans the heap directly:

1
2
CREATE EXTENSION pgrowlocks;
SELECT * FROM pgrowlocks('accounts');

This is the single biggest difference between PG and InnoDB lock introspection, and the reason PG operators often feel blind to row-level contention until a cycle actually forms.

Handle SERIALIZABLE serialization failures separately

Under SERIALIZABLE isolation, PostgreSQL uses Serializable Snapshot Isolation (SSI), an optimistic mechanism based on SIREAD predicate locks that track read-write dependencies between transactions. SSI cannot deadlock by design; it never blocks on lock acquisition. What it does is abort one transaction with a serialization failure when it detects a dangerous read-write dependency cycle that would violate serializability.

The two failure codes look similar but have fundamentally different semantics:

40001 serialization_failure. SSI detected a dependency cycle and aborted the transaction before it could commit a non-serializable result. The transaction did nothing wrong; the combination of its operations with a concurrent transaction would have produced an inconsistency. Retrying is always safe and usually succeeds (the concurrent transaction will have committed or aborted, so the second attempt doesn’t see the same conflict).
40P01 deadlock_detected. A cycle in the wait-for graph was broken by killing a victim. Retrying may or may not succeed depending on what caused the cycle. If the cycle was deterministic (two code paths with inconsistent ordering), it will keep recurring.

The practical consequence for retry architecture: an application running under SERIALIZABLE must handle 40001. It’s not a deadlock, it’s the normal failure mode of SSI, and retries are the only recovery path. An application running under READ COMMITTED never sees 40001. An application that handles 40001 identically to 40P01 is correct but coarse. The right granularity is: always retry 40001 (the workload’s own correctness guarantee assumes this); retry 40P01 with caution and escalate on repeat.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
def retry_on_conflict(fn, max_attempts=3):
    for attempt in range(max_attempts):
        try:
            return fn()
        except psycopg.errors.SerializationFailure:
            # 40001: always retry. SSI guarantees make this safe.
            backoff(attempt)
        except psycopg.errors.DeadlockDetected:
            # 40P01: retry with caution; log for root-cause analysis.
            log_deadlock_for_analysis(...)
            backoff(attempt)
    raise TransactionRetryExhausted

Architect retries that surface bugs, not bury them

Every database driver documentation says “deadlocks happen, retry the transaction.” That’s true. It’s also incomplete. The dangers are subtle enough that a naive retry loop becomes part of the problem:

Retries without backoff make the cycle worse. The condition that caused the deadlock (contention on a hot key set) is still in effect when the retry runs. A tight retry loop turns one deadlock into a thundering herd: all victims retry simultaneously, all hit the same contention, all deadlock again. Use exponential backoff with full jitter, capped at a few hundred milliseconds.

Retries mask lock-ordering bugs. If an application deadlocks 10x/minute but retries successfully, the operator sees no failures, but the underlying transactions are doing up to 20x the work (original + retry). The deadlock rate itself is a metric worth tracking, not just the post-retry error rate. When the rate grows, the fix is diagnosing the pattern, not tuning the retry limit.

Retries aren’t always safe. A transaction that sent an external notification, wrote to a message queue, or called a non-idempotent HTTP endpoint before the deadlock can’t be blindly retried; the external side effect already happened. Retries belong on database-only transactions, or on transactions where the external calls are idempotent and tolerant of duplicate execution. The boundary is architectural, not a library setting.

Retries need a budget. If a transaction can’t complete after ~3 attempts, the problem is no longer transient contention. It’s either a systemic hot spot or a bug that retries will never resolve. Escalate (alert, fail the request, enqueue for manual review), don’t loop forever.

The retry pattern that works in production:

1
2
3
4
5
6
7
8
9
for attempt in range(3):
    try:
        with db.transaction():
            do_work()
        return
    except (DeadlockError, SerializationFailure) as e:
        metrics.increment("db.retry", tags={"error": e.code})
        time.sleep(random.uniform(0, 0.1 * 2**attempt))
raise RetryBudgetExhausted()

Three attempts, exponential backoff up to ~400ms, metrics emitted on every retry, hard failure past the budget. The metric matters as much as the retry - without it, the team never learns which transactions are retrying frequently and why.

Idempotency is a transaction-shape property

A transaction is safe to retry iff re-running it produces the same observable state. That includes downstream side effects. A transaction that writes to a table AND sends a webhook is not safe to retry even if both operations are internally correct: the second attempt sends a duplicate webhook. The fix is the outbox pattern: write the webhook-send intent to a table in the same transaction, then have a separate worker process the outbox with its own idempotency guarantees. This is a non-negotiable part of building deadlock-retry-safe systems at scale.

Sort writes to lock in one direction

Part 1’s lock-ordering deadlock (two transactions updating the same set of rows in opposite orders) is the single most common production pattern and the one with the highest-leverage fix. If every code path that writes to a set of tables acquires locks in the same order, the wait-for graph literally cannot form a cycle on those rows. The engine still takes the locks, still holds them for the duration of the transaction, but the second transaction waits cleanly for the first instead of grabbing a lock the first will need.

The rule is: sort the rows by a stable key (usually the primary key) in application code, before any SQL is issued. Lock acquisition order in both engines is determined by the order the engine processes rows, which for most write patterns is the order the application submitted them. Sort once up front, and N workers all doing the same thing can’t cycle because they all approach the row set from the same end.

The three batch shapes that matter in practice, and where the ordering actually happens:

1. Loop of per-row UPDATEs. The classic batch worker:

1
2
3
4
5
6
7
# Sort in the application; the iteration order IS the lock acquisition order.
rows.sort(key=lambda r: r.id)
for row in rows:
    cursor.execute(
        "UPDATE accounts SET balance = balance + %s WHERE id = %s",
        (row.delta, row.id),
    )

Each UPDATE locks its target row at execution time; the loop order determines acquisition order. No SQL-level ORDER BY involved; the fix lives in the .sort() call.

2. Bulk UPSERT. INSERT ... ON DUPLICATE KEY UPDATE (MySQL) or INSERT ... ON CONFLICT (PostgreSQL):

1
2
3
4
5
6
7
8
# Sort rows by the unique key BEFORE building the batch.
rows.sort(key=lambda r: r.id)
execute_values(
    cursor,
    "INSERT INTO accounts (id, balance) VALUES %s "
    "ON CONFLICT (id) DO UPDATE SET balance = accounts.balance + EXCLUDED.balance",
    [(r.id, r.delta) for r in rows],
)

The engine processes the VALUES list in order and acquires locks as it goes. Two concurrent batches sorted by the same key approach the key space from the same end; without the sort, one batch might process (1, 2, 3) while another processes (3, 2, 1) and they deadlock on the middle row. This is the exact shape of the bulk-UPSERT deadlocks called out in Part 1’s unique-index section.

3. Small multi-row transactions. The canonical bank transfer:

1
2
3
4
5
-- Always update the lower id first, regardless of transfer direction.
BEGIN;
UPDATE accounts SET balance = balance - 100 WHERE id = 1;
UPDATE accounts SET balance = balance + 100 WHERE id = 2;
COMMIT;

For 2–3 rows with per-row different values, the application computes sorted((X, Y)) and issues UPDATEs in that order. Same principle as the batch case, just smaller.

Where SELECT ... FOR UPDATE ORDER BY actually earns its keep. Most batches don’t need it; they control lock order through the submission order. The one shape where it’s the right answer is a single UPDATE statement over a derived table where the engine decides scan order and you can’t control it from outside:

1
2
3
4
UPDATE accounts a
SET balance = a.balance + u.delta
FROM (VALUES (1, 50), (2, 100), (3, 25)) AS u(id, delta)
WHERE a.id = u.id;

Here, sorting the VALUES list in application code doesn’t reliably control lock order; the planner picks the scan. A SELECT id FROM accounts WHERE id IN (...) ORDER BY id FOR UPDATE up front pre-acquires locks in deterministic order before the UPDATE runs. Or refactor into shape 1 or 2.

ORDER BY controls result order, not scan order

ORDER BY on a SELECT ... FOR UPDATE controls the result ordering, but lock acquisition happens during the scan. With a primary-key or unique-index predicate (WHERE id IN (...) on a PK), the planner does ordered index lookups and locks land in ORDER BY order in practice. For non-indexed predicates or range scans on non-unique columns, the planner may scan in a different order and sort results afterward; locks get acquired in scan order. Verify with EXPLAIN before relying on this pattern against non-PK predicates. Also: MySQL’s UPDATE ... ORDER BY syntax applies one SET clause to all matching rows; it doesn’t help when rows need different values.

This sounds trivial. It almost never is in practice; the ordering has to hold across every code path that writes to the same tables: the main request handler, backfill scripts, admin scripts, scheduled jobs, ORM bulk-save methods, and whatever migration scripts run during releases. One path that writes in a different order is enough to reopen the cycle. The durable fix is encoding the order in the access layer so individual query sites can’t diverge from it:

A repository function that always sorts by PK before writing.
A stored procedure or database function that owns the multi-row write.
A service method with the ordering baked in, and a lint rule or review check forbidding direct table writes from elsewhere.

Two places this invariant breaks without anyone noticing:

ORM bulk-save methods. ORMs hide whether they process in input order or reorder internally. Django’s Model.objects.bulk_update, SQLAlchemy’s bulk_update_mappings, ActiveRecord’s upsert_all, Hibernate’s batch inserts; some process in input order, some sort by PK internally, some chunk before doing either. If you can’t tell from docs, test it: two concurrent bulk-saves with opposing-ordered input lists will either deadlock (proving input order matters) or not (proving the ORM sorts internally). Either way, sorting the input collection before handing it to the ORM is cheap insurance.

Batches sourced from a SELECT in the same transaction. A common pattern: “grab N pending rows, process them.” If the feeder SELECT doesn’t have an ORDER BY, rows come back in scan order, non-deterministic across workers, which reopens the cycle. The fix is an explicit ORDER BY on the feeder query, not in the subsequent loop:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# Bad: scan order feeds the loop.
rows = cursor.execute("SELECT id, delta FROM pending WHERE status = 'ready' LIMIT 100")
for row in rows:  # Whatever the scan produced; varies across workers.
    ...

# Good: deterministic order, same across every worker.
rows = cursor.execute(
    "SELECT id, delta FROM pending WHERE status = 'ready' ORDER BY id LIMIT 100"
)
for row in rows:
    ...

The lint-rule angle matters more than it sounds. Deadlock-ordering violations are almost impossible to catch in code review - two PRs that each look correct in isolation can introduce inconsistent ordering when combined. The check that actually works is structural: no direct writes to tables X, Y from anywhere except the repository. Once that invariant is enforced, the ordering invariant follows.

Multi-table transactions need the same rule applied to table order. A transaction that updates users then orders in one code path, and orders then users in another, can deadlock through the FK chain even with per-table row ordering. The rule generalizes: sort rows within a table, and sort tables within a transaction, by a stable convention the whole codebase agrees on (alphabetical is fine; just pick one).

Weigh the isolation-level trade-offs

The isolation level a workload runs under determines which deadlock categories even apply. Most MySQL deadlock incidents stem from REPEATABLE READ’s gap locks, a category that doesn’t exist on PostgreSQL or on MySQL at READ COMMITTED. Changing the isolation level is the single highest-leverage tuning lever, and also the one with the most potential to quietly break application correctness.

Dropping MySQL from REPEATABLE READ to READ COMMITTED. Under READ COMMITTED, InnoDB still takes row locks but skips most gap locks (they exist only for unique-key and FK enforcement on inserts). Most OLTP workloads don’t need REPEATABLE READ’s range-consistency guarantee. Most application code was designed around READ COMMITTED semantics anyway, because that’s what PostgreSQL and SQL Server default to. Teams migrating to READ COMMITTED typically see deadlock rates drop by an order of magnitude with no functional change.

Avoiding range locks on write paths. Independent of isolation level, replacing SELECT ... FOR UPDATE scans over ranges with point lookups by primary key removes the gap-lock surface entirely on the statements that do it. If a write path doesn’t need to lock “all orders for customer 5,” locking just the specific row it’s about to update is both faster and safer.

FK shared locks are shorter-lived under READ COMMITTED. The foreign-key shared-lock pattern from Part 1 (high-write child tables concentrating shared locks on hot parent rows) has a narrower window under READ COMMITTED simply because the lock lifespan is tied to the statement rather than the transaction. The cycle potential is still there, but the wait window is smaller.

Isolation change is a behavior change, not a tuning knob

READ COMMITTED eliminates most gap locks but also changes the visibility semantics of long-running transactions. Any code that relied on re-reading a row and getting the same result (transfer logic, inventory deductions, financial calculations) has to be re-examined. The safe migration is application-by-application, not database-wide. Run it in staging under production-like load and watch for subtle correctness regressions: “phantom” rows appearing inside a transaction that used to see a stable snapshot, inventory counts that shift mid-transaction, calculations that no longer match because an underlying row changed between reads.

Session-scoped change as a migration path. Both engines let you set isolation level per session or per transaction (SET SESSION TRANSACTION ISOLATION LEVEL READ COMMITTED in MySQL, SET TRANSACTION ISOLATION LEVEL READ COMMITTED per transaction in PostgreSQL). The usual migration pattern is to start with the most contended code paths, move them to session-scoped READ COMMITTED, monitor for regressions, then expand the scope. A global flip from REPEATABLE READ to READ COMMITTED on a large, stable MySQL deployment is rarely the right first step.

Isolate hot rows off the contention path

When the top N deadlocks on a production system concentrate on a small set of rows (a counter, a config row, an AUTO_INCREMENT source of truth), retries don’t converge. Every retry hits the same row, takes the same lock, cycles with the same peers. The fix is removing the row from the hot path, not tuning the retry layer.

Three patterns that work:

Counter tables with sharding, for extreme write-hot counters only. Reach for this only when the counter is taking thousands of writes per second against a single row and the simpler options below aren’t viable. For anything less, the queue pattern below or an external store (Redis atomic INCR, a time-series DB) is almost always the better answer: less complexity, no schema overhead, no sum-across-rows read cost. Sharded counters are the specialized escalation, not the default.

When it does fit the workload: N physical shards per logical counter, keyed on a compact integer (counter_id, shard_id) composite. Application code picks the shard. Keeping the random choice out of SQL makes it portable across engines and testable independently:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
CREATE TABLE counter_shards (
    counter_id BIGINT NOT NULL,   -- FK to a counters metadata table if you need names
    shard_id SMALLINT NOT NULL,   -- 0..N-1, fixed per-counter
    value BIGINT NOT NULL DEFAULT 0,
    PRIMARY KEY (counter_id, shard_id)
);

-- Seed the shards once per counter (e.g., when the counter is created):
INSERT INTO counter_shards (counter_id, shard_id, value)
VALUES (42, 0, 0), (42, 1, 0), (42, 2, 0), (42, 3, 0),
       (42, 4, 0), (42, 5, 0), (42, 6, 0), (42, 7, 0),
       (42, 8, 0), (42, 9, 0), (42, 10, 0), (42, 11, 0),
       (42, 12, 0), (42, 13, 0), (42, 14, 0), (42, 15, 0);

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
# Increment: application picks the shard. Portable, cheap, no SQL-side RAND().
shard = random.randrange(16)
cursor.execute(
    "UPDATE counter_shards SET value = value + 1 "
    "WHERE counter_id = %s AND shard_id = %s",
    (42, shard),
)

# Read: sum across shards.
cursor.execute(
    "SELECT COALESCE(SUM(value), 0) FROM counter_shards WHERE counter_id = %s",
    (42,),
)

16 shards turn one hot row into 16 warm rows. The contention surface scales with shard count. The read cost is one SUM across N rows instead of a single-row SELECT, usually acceptable for counter use cases; if not, cache the aggregate.

A common refinement is deriving the shard deterministically from the connection or worker ID (e.g., connection_id % 16) so each worker consistently hits the same shard. That improves InnoDB buffer-pool locality and reduces cross-shard interference, at the cost of slightly less even distribution if worker counts aren’t balanced.

Advisory locks for app-level serialization. Both MySQL and PostgreSQL support advisory locks: named locks that exist outside the table model and don’t take row locks. For operations that need to be serialized at the application level (leader election, rate limiting, config migration), advisory locks are dramatically cheaper than row locks and can’t participate in a table-lock cycle:

1
2
3
4
5
6
7
-- PostgreSQL: advisory lock keyed by a bigint.
SELECT pg_advisory_xact_lock(hashtext('refresh_cache:customer_42'));
-- Lock released at transaction end. Only one worker per key runs at a time.

-- MySQL equivalent:
SELECT GET_LOCK('refresh_cache:customer_42', 10);
-- Returns 1 if acquired, 0 on timeout. Must explicitly RELEASE_LOCK.

The caveats: advisory locks are application-layer discipline; they don’t enforce anything the database checks. Use them where the application chooses to serialize, not where correctness requires it.

Queue patterns instead of direct updates. For counter-like workloads, write intents to an append-only table and aggregate periodically:

1
2
3
4
5
6
7
8
9
INSERT INTO counter_events (counter_key, delta, created_at) VALUES (?, ?, NOW());
-- No contention: every insert creates a new row.

-- Periodic aggregation job:
INSERT INTO counter_totals (counter_key, total)
SELECT counter_key, SUM(delta) FROM counter_events
WHERE processed = FALSE
GROUP BY counter_key
ON CONFLICT (counter_key) DO UPDATE SET total = counter_totals.total + EXCLUDED.total;

Trades real-time accuracy for throughput. The right trade-off for page-view counters, metric accumulation, any workload where eventual consistency is acceptable.

Hot parent rows behind FK chains. Part 1 described how a high-write child table concentrates shared locks on a hot parent row, and how any exclusive-lock request on that parent (a name update, a soft-delete, a trigger-driven counter update) becomes a contention point. Two levers that work:

Move high-frequency parent-row updates to a side table. The last_activity_at timestamp, the cached counter, the updated_at that a trigger bumps on every child insert; none of these need to live on the parent table. Moving them to customer_activity(customer_id, last_seen_at) or customer_counters(customer_id, order_count) eliminates the exclusive-lock contention entirely. The parent row stops changing on hot paths, the shared locks from FK checks coexist fine, and the cycle potential disappears.
Narrow the FK scope where integrity can tolerate it. Not every child table needs an enforced FK to every parent. Logs, events, and audit tables are often the biggest offenders, and often have the least need for strict integrity (an orphaned log row is rarely a correctness problem). Dropping the FK removes the shared-lock dependency entirely. This trades integrity for throughput, a decision that belongs with the team owning the data, not a default.

Under READ COMMITTED, both levers matter less because the FK shared locks release at statement end rather than transaction end. A workload that runs on REPEATABLE READ and can’t change isolation level (because of application semantics) gets the most benefit from these two fixes.

Shorten long-running transactions

The longer a transaction holds locks, the wider the window for a cycle. Two patterns recur in production:

Application-layer long transactions. A transaction that opens at request start, makes several queries, calls an external API, then commits. The external call is where the transaction actually spends its time: seconds of open transaction holding row locks the whole time. Every concurrent transaction that touches those rows waits. Deadlock probability scales with transaction duration. The fix is the inverse of the outbox pattern: do the external call outside the transaction, passing in any needed IDs.

Idle-in-transaction sessions. A session that runs BEGIN, some writes, then stalls; idle but not committed. In PostgreSQL, this blocks vacuum on touched tables, bloats MVCC, and holds locks indefinitely. pg_stat_activity shows state = 'idle in transaction'. MySQL’s equivalent is a thread with an open transaction and no current query.

PostgreSQL has a first-class timeout for this; MySQL does not:

1
2
3
-- PostgreSQL: kill idle-in-transaction sessions after 5 minutes.
-- (Units required; bare integer would be interpreted as milliseconds.)
SET idle_in_transaction_session_timeout = '5min';

MySQL has no direct equivalent. wait_timeout and interactive_timeout govern idle connections, not sessions idle inside an open transaction. A connection that did BEGIN then stopped sending queries will hold its locks until the connection drops or the client commits. The production workaround is either a watchdog script (e.g., Percona’s pt-kill) that polls information_schema.innodb_trx and terminates transactions exceeding a duration threshold, or a connection pool with per-connection transaction lifetimes. Connection pools that acquire a connection, start a transaction, then return the connection to the pool without committing (rare but real) will produce sessions that live indefinitely otherwise.

Finding long-running transactions:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
-- PostgreSQL
SELECT pid, usename, state, xact_start, now() - xact_start AS duration, query
FROM pg_stat_activity
WHERE xact_start IS NOT NULL AND now() - xact_start > INTERVAL '30 seconds'
ORDER BY duration DESC;

-- MySQL
SELECT trx_id, trx_started, trx_mysql_thread_id, trx_rows_locked, trx_query
FROM information_schema.innodb_trx
WHERE TIMESTAMPDIFF(SECOND, trx_started, NOW()) > 30;

Alerting on any transaction exceeding 30s in an OLTP workload catches most of the long-transaction-induced deadlocks before they produce incidents.

Audit triggers and cascades for hidden locks

A trigger that updates a second table on every write to the first adds an edge to the wait-for graph that isn’t visible in the original query. ON DELETE CASCADE foreign keys behave similarly - one delete can take locks on every child row in the cascade, and if the cascade order differs between two concurrent deletes, they can deadlock through tables neither statement directly referenced.

This is the origin of the “why is my DELETE FROM users deadlocking against an INSERT INTO events?” question. The DELETE triggered a cascade to user_preferences, which had a trigger that updated a counter in tenants, which was locked by the INSERT. Four tables in the cycle, two in the application’s explicit query, zero mention of the other two in any log entry until someone reads the DDL.

The operational pattern: when a deadlock log mentions a table the application’s code doesn’t explicitly reference, check (1) FK cascades on the tables that are in the query, (2) triggers on those tables, (3) generated columns that fire on update. All three are non-obvious lock sources, all three are fixable, but only after they’re identified.

Set `innodb_autoinc_lock_mode` deliberately

MySQL InnoDB’s AUTO_INCREMENT column has its own lock, historically a source of contention and occasional deadlock. The innodb_autoinc_lock_mode parameter controls the behavior:

Mode 0 (traditional). Table-level AUTO-INC lock held for the duration of the statement. Serialized across inserts. Safe for statement-based replication, terrible for concurrency.
Mode 1 (consecutive). A lighter lock for simple inserts (single-row or known-row-count), and the traditional table lock for bulk inserts (INSERT ... SELECT). Was the default in 5.7.
Mode 2 (interleaved). No AUTO-INC table lock; IDs are assigned per-row as needed, possibly interleaved across concurrent statements. Default in MySQL 8.0. Fastest, and correct for row-based replication (which is also the 8.0 default).

The mode-2 default in 8.0 eliminated a substantial source of historical deadlocks and contention. Bulk inserts that used to serialize on the AUTO-INC lock now proceed in parallel. If you’re migrating from 5.7 to 8.0, this is a free win. If you’re still on binlog_format = STATEMENT (uncommon but not unheard of in legacy deployments), you cannot safely run mode 2; the replica may generate different IDs than the source, corrupting the data. Switch to binlog_format = ROW first, then adopt mode 2.

Guard DDL migrations with `lock_timeout`

Online schema change isn’t deadlock-prone in the classical sense, but it interacts with deadlocks in a specific operational way: DDL takes heavy locks that queue behind ongoing DML, and while the DDL waits, every subsequent query on that table queues behind the DDL. In PostgreSQL, a DDL taking ACCESS EXCLUSIVE that waits for an existing long-running SELECT will cause every new SELECT to wait behind the DDL. The system grinds to a halt, and application logs fill with timeout errors that look like deadlocks but aren’t. It’s a queue, not a cycle.

The standard prevention idiom for PostgreSQL migrations:

1
2
SET lock_timeout = '2s';
ALTER TABLE orders ADD COLUMN notes TEXT;

If the ALTER can’t acquire its lock in 2 seconds, it fails instead of queueing. The migration tool catches the error and retries with backoff. This prevents the queue-behind-DDL outage entirely - the cost is that some migrations need multiple attempts to land, which is almost always the right trade-off.

MySQL’s equivalent tooling is pt-online-schema-change (Percona) and gh-ost (GitHub). Both create a copy of the table, stream writes to both via trigger or binlog, and swap at the end. They run concurrent DML against the original and the copy, so they inflate deadlock rates during the migration window: not because the tool is buggy, but because there are now more transactions touching the same rows. The operational practice: run migrations at low-traffic windows, watch the deadlock counter during the run (not just replica lag), and have a rollback path ready.

DDL inside transactions is engine-dependent

PostgreSQL supports transactional DDL: BEGIN; ALTER TABLE ...; COMMIT; is atomic. MySQL does not; every DDL statement implicitly commits the current transaction. A migration script that assumes it can roll back mid-migration works on PostgreSQL and silently half-applies on MySQL. Know which engine you’re writing migrations for.

Reach for `NOWAIT` and `SKIP LOCKED`

Both engines support two SQL-level concurrency primitives that remove the need for application-layer deadlock handling in specific patterns:

SELECT ... FOR UPDATE NOWAIT. If the row is locked by another transaction, fail immediately with an error instead of waiting. Useful for user-facing paths where “I can’t get this resource right now” is a better UX than “wait 500ms and maybe deadlock anyway.” Also useful for detecting lock contention synthetically in tests.
SELECT ... FOR UPDATE SKIP LOCKED. If rows are locked by another transaction, skip them and return only rows the current transaction can lock. Transforms a contended queue-processor pattern into a lock-free one: N workers each grab a different set of rows, zero contention, zero deadlocks.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
-- Queue processor: deadlock-free, contention-free.
SELECT * FROM jobs
WHERE status = 'pending'
ORDER BY priority, created_at
LIMIT 10
FOR UPDATE SKIP LOCKED;

-- Fast-fail acquisition: don't wait, fail now.
SELECT * FROM leader_election
WHERE resource_id = 'cache-refresh'
FOR UPDATE NOWAIT;

SKIP LOCKED arrived in PostgreSQL 9.5 and MySQL 8.0. Before those versions, queue-processor patterns required either advisory locks or application-level coordination (Redis, Zookeeper). Post-SKIP LOCKED, they can live entirely in the database with a single primitive. For any workload where workers pull from a shared queue, this is the pattern - not retry loops on FOR UPDATE.

Monitor the metrics that catch regressions

The single most useful metric is deadlock rate over time. Not error rate, not retry rate; the raw count of deadlocks per minute or per thousand transactions. A workload with 0.1 deadlocks per thousand transactions is healthy; 10 per thousand is a paging threshold; 100 per thousand means retries aren’t converging and something is structurally wrong.

For MySQL: there’s no Innodb_deadlocks status variable. The correct source is performance_schema.events_errors_summary_global_by_error, which is enabled by default:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
-- Cumulative deadlock count (compare over time windows).
SELECT SUM_ERROR_RAISED AS deadlock_count
FROM performance_schema.events_errors_summary_global_by_error
WHERE ERROR_NAME = 'ER_LOCK_DEADLOCK';

-- Lock-wait activity (useful for adjacent contention, NOT a deadlock counter):
SHOW GLOBAL STATUS LIKE 'Innodb_row_lock_waits';

-- Plus the error log, searchable for "LATEST DETECTED DEADLOCK"
-- once innodb_print_all_deadlocks=ON.

Innodb_row_lock_waits is commonly misread as a deadlock counter - the manual defines it as “the number of times operations on InnoDB tables had to wait for a row lock,” which is contention in general. Pair it with the events_errors_summary query, not in place of it.

For PostgreSQL:

1
2
3
4
-- pg_stat_database exposes per-database deadlock counter.
SELECT datname, deadlocks, xact_commit, xact_rollback
FROM pg_stat_database
WHERE datname = current_database();

Scrape both into Prometheus (mysqld_exporter and postgres_exporter both expose these), compute the rate, alert on sustained rises. Pair the deadlock rate with a retry rate from the application layer - a spike in one without the other means either the retry logic is broken or the workload shape changed. A spike in both means a real regression.

Beyond the rate itself, the top-K pairs of statements involved in deadlocks (extracted from innodb_print_all_deadlocks logs or PG’s deadlock log entries) identify exactly which code paths are fighting. This list rarely changes - the same two or three patterns account for most deadlocks in any given system. Fix those and the rate drops by an order of magnitude.

The mental model for Part 2

Part 1’s patterns answer why deadlocks happen. Part 2’s operations answer what to do about them, and the useful reframe is that the answer is almost never “tune the retry logic.” Retries are the recovery mechanism that keeps the application running while the actual fix lands. The actual fix is almost always one of:

Identify the pattern from the log. This is step zero; skipping it means you’re tuning blind.
Enforce consistent lock ordering at the access-layer level. Highest-leverage fix for the lock-ordering pattern; deterministically eliminates the cycle rather than shrinking its window.
Change the code path to use SKIP LOCKED / NOWAIT where the pattern matches (queue processors, resource acquisition).
Isolate hot rows (counter tables, shards, advisory locks, queue patterns; move high-frequency parent-row updates to side tables).
Shorten transactions. Move external calls out, enforce idle-transaction timeouts.
Drop isolation level where the workload allows it; session-scoped first, global only after regression testing. Eliminates the gap-lock category on MySQL.
Remove cascades/triggers from the hot path when they’re the hidden lock source.
Handle SERIALIZABLE’s 40001 as a normal event if you’re on SERIALIZABLE, and don’t confuse it with 40P01.
Plan DDL windows with lock_timeout and watch the deadlock counter through the migration.

Retries let the application survive while the fix is in flight. Monitoring tells you which fix to prioritize. Each of the above removes a category from the workload entirely. The goal is a system where the few remaining deadlocks are rare enough that the retry layer handles them invisibly and the team’s attention can go elsewhere. Not zero (that’s a theoretical fiction at realistic concurrency), but managed.