Query-Performance on EXPLAIN ANALYZE

Covering Index Traps: When Adding One Column Breaks Your Query

Fri, 18 Jul 2025 00:00:00 +0000

TL;DR

An index-only scan is the fastest way a relational database can answer a query: the engine reads the index and never touches the table. Adding a single column to the SELECT list that isn’t in the index silently breaks the optimization, and the query that ran in a millisecond now takes seconds. The SELECT list is part of the query’s performance contract with the index.

Here’s a query that ran in production for a year with sub-millisecond latency:

1

SELECT status, created_at FROM orders WHERE customer_id = 42;

The orders table has a composite index on (customer_id, status, created_at). Every column the query needs (customer_id for the filter, status and created_at for the output) is in that index. The database reads the index, returns the results, and never touches the table. This is an index-only scan: one of the most significant optimizations a relational engine makes, and the mechanism behind “covering” queries.

Then a feature request: “show the order total on this page.” The change looks trivial.

1

SELECT status, created_at, total_cents FROM orders WHERE customer_id = 42;

One column added. The query is still correct. The index still matches the filter. total_cents isn’t in the index, so for every matching row, the engine now follows a pointer back to the table to fetch that one extra column. On a table with millions of rows, that’s a random I/O per match. The query that was 0.4 ms is now 1243 ms.

The obvious fix is “just don’t add columns to queries.” That doesn’t work; features need data. The slightly-less-obvious fix is “always project the minimum columns,” which is fine as advice and ignored in practice because every ORM defaults to SELECT *. The actual fix is to treat the SELECT list as part of the query’s performance contract with the index, and to know what that contract is before changing it.

What’s actually happening

The execution plan tells the whole story:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


-- Before: index-only scan
EXPLAIN ANALYZE SELECT status, created_at FROM orders WHERE customer_id = 42;
-- Index Only Scan using idx_orders_cust_status_created on orders
-- Heap Fetches: 0
-- Execution Time: 0.4 ms

-- After: index scan + table lookups
EXPLAIN ANALYZE SELECT status, created_at, total_cents FROM orders WHERE customer_id = 42;
-- Index Scan using idx_orders_cust_status_created on orders
-- Execution Time: 1243.7 ms

Same index. Same filter. Same rows returned. The only difference is the select list, and it moves the query from a pure index walk to an index walk plus one random I/O per matching row.

Buffer pool pollution compounds the damage. When the engine fetches full rows from the table instead of reading compact index entries, it loads entire data pages into the buffer pool. Those pages (carrying every column of every matched row, most of which the query doesn’t need) evict pages that other queries do need. On a busy system with a finite buffer pool, one query losing its covering index degrades performance for unrelated queries across the database. The slow query you noticed is rarely the only thing getting slower.

Nothing in the query results tells you. The rows come back correctly. The response looks the same. A SELECT COUNT(*) returns the same count. The only place the degradation is visible is in the execution plan, and nobody checks the execution plan when the feature ships.

ORM defaults

Most ORMs emit SELECT * unless explicitly told otherwise. ActiveRecord needs .select(:id, :status); Django needs .only('id', 'status'); SQLAlchemy needs explicit column specification; Prisma needs an explicit select block. On a high-traffic table, a one-line change to project only the needed columns is one of the highest-leverage optimizations available. Worth checking what your ORM actually generates on the query paths that matter; the generated SQL is the contract, not the method call.

The fix: match the index or extend it

There are two workable fixes when a query loses its covering index, and they trade different costs:

Project only what the index covers. If the new column isn’t worth fetching from the table on every row, don’t fetch it. Split the query: one covered query for the list view, a targeted lookup for the detail row the user actually wants. Most feature requests that “need” an extra column on a list page are actually fine with lazy-loading the value on click.

Extend the index to include the new column. If the column is genuinely needed on every row, add it to the index, either as an additional indexed column or (in PostgreSQL) as an INCLUDE clause that adds the value to the leaf pages without making it part of the B-tree ordering:

1
2
3
4


-- PostgreSQL: add total_cents as a non-key included column
CREATE INDEX idx_orders_cust_status_created_total
 ON orders (customer_id, status, created_at)
 INCLUDE (total_cents);

INCLUDE is the right tool when you need the column covered but don’t want it affecting the sort order or filter path. The trade-off is write cost: the index is now larger, and every update to total_cents has to update the index entry. On a write-heavy table that’s meaningful; on a read-heavy table it’s usually negligible compared to the read speedup.

MySQL (InnoDB) doesn’t support INCLUDE but has a natural equivalent: every secondary index already contains the primary key at its leaves, and you can extend the secondary index to cover additional columns by adding them as regular key columns. The planner is smart enough to use the covered form when the column is present.

When covering isn’t the right call

Covering indexes aren’t a universal good. Three cases where chasing a covering index is the wrong move:

Low-selectivity filters. If customer_id = 42 matches 80% of the table, the planner won’t use the index at all; a sequential scan is cheaper. Index-only scans matter when the filter is selective. On a low-selectivity predicate, covering changes nothing.

Write-heavy tables. Every index slows writes. A table taking 50,000 inserts per second with five secondary indexes already pays a real cost for every index entry. Adding a covering variant of an existing index to shave read latency from 15 ms to 3 ms is a bad trade if the table is write-dominated; the write penalty compounds on every row, and only the reads benefit.

Rapidly changing projections. If the feature team is adding and removing columns from the list view every sprint, chasing the covering index is a losing game. Freeze the list-view columns as a contract, document them in the schema, and let the index match that contract, or don’t bother indexing for coverage at all.

One more column, silently uncovered

The archetypal AI-generated version of this bug is a one-line change that adds a column to the SELECT list. A feature request says “show the order total on the list page”; the assistant reads the existing query, adds total_cents to the projection, and returns the patch. The query still runs, the list page still renders, and the p99 quietly moves from 0.4 ms to 1200 ms because the index-only scan became a heap-fetch scan and nobody noticed until the dashboard did.

Coverage checking itself isn’t hard reasoning. Given a query and an index definition, working out whether the SELECT list stays inside the index is a short syntactic check any capable model can do. The catalog exposes the ingredients: PostgreSQL’s pg_index separates key columns from INCLUDE ones via indnkeyatts vs indnatts, MySQL’s information_schema.STATISTICS lists all columns per index. The signal is there. What fails in practice is subtler. The relevant index often isn’t in the prompt’s context window; schema-aware tools pull catalog metadata, but whether idx_orders_cust_status_created lands in the retrieved context for “add total_cents to the list view” depends on retrieval heuristics, not the model’s capability. Even when the index definition is available, the default behavior for “modify this query” is to modify the query; re-verifying that the projection stays covered is a step the assistant rarely takes unsolicited. And only the planner’s actual choice is authoritative; static analysis gets most of the way, but nothing short of EXPLAIN tells you which index the query will use under real statistics.

The fix at the schema level is what makes the coverage relationship legible to the next reader, human or model: name indexes after the query they support (idx_orders_list_view tells you what depends on it), document INCLUDE columns in the index comment, and put a comment on the query itself pointing at the index. None of this is novel advice. It becomes load-bearing once an assistant is routinely modifying queries: the explicit link between query and covering index is the signal that tells the assistant (and the human reviewer) “this change has an index implication” rather than silently shipping the uncovered patch.

The bigger picture

The SELECT list is a performance contract in most code reviewers’ blind spot. WHERE clauses get scrutinized because they’re obviously performance-relevant. JOINs get scrutinized because cardinality mistakes are visible. The SELECT list gets waved through because “it’s just what we display”, and then a one-column addition drops a query from 0.4 ms to 1243 ms with no code-review signal to catch it.

EXPLAIN ANALYZE is the only authority here. Reading execution plans isn’t glamorous, but it’s the difference between a query that works and a query that works at scale, and between a select-list change that’s free and one that silently broke the optimization the index existed to enable. On the queries that carry the most traffic, the execution plan belongs in code review alongside the query itself.

Non-SARGable Predicates: How a Function in WHERE Kills Your Index

Sat, 14 Jun 2025 00:00:00 +0000

TL;DR

A predicate is SARGable (Search ARGument able) if the database can use an index to evaluate it. Wrapping a column in a function makes the predicate non-SARGable: the engine has to compute the function on every row before it can filter, which means a full table scan no matter what indexes exist. The fix isn’t always to rewrite the predicate (sometimes the column’s type or collation is wrong and the code is masking it) but every non-SARGable predicate on a hot path is a performance bug waiting for the table to grow.

Here are two queries that return the exact same rows:

1
2
3
4
5
6
7


-- Version A
SELECT id, status FROM events
WHERE YEAR(created_at) = 2025;

-- Version B
SELECT id, status FROM events
WHERE created_at >= '2025-01-01' AND created_at < '2026-01-01';

On a 10,000-row events table, both run in under a millisecond and nobody notices the difference. On a 200-million-row events table with an index on created_at, version A does a sequential scan and takes 45 seconds; version B does an index range scan and takes 12 milliseconds. Neither query is wrong. They don’t even disagree about the answer. One just does the same work in a way the planner can’t optimize.

The obvious fix is “rewrite every function-wrapped predicate as a range.” That works for the date-extraction case and a few others. For WHERE LOWER(email) = 'alice@example.com', the rewrite needs to know whether the column’s collation is case-insensitive, and if it isn’t, there’s no direct equivalent, only a functional index or a schema change. The fix depends on why the function is there, and “why” usually points back at something in the schema that’s pretending to be something it isn’t.

What SARGable means in practice

An index on created_at is a sorted structure: the engine can jump to any date range in O(log n) time by walking the B-tree. For the planner to use that index on a predicate, the predicate has to be expressible as “the column is in this range”: a direct comparison between the column and a constant or parameter.

created_at >= '2025-01-01' meets that contract. The planner translates it to “walk the index to the first entry ≥ 2025-01-01, read forward from there.” That’s a range scan.

YEAR(created_at) = 2025 doesn’t meet the contract. The value being compared isn’t created_at; it’s the output of YEAR() applied to created_at. The index on created_at doesn’t know the output of YEAR() for any row without computing it. So the planner falls back to evaluating the function on every row (a sequential scan) and only then filtering.

Common forms of the same mistake:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


-- Non-SARGable: function on column → full scan
WHERE LOWER(email) = 'alice@example.com'
WHERE DATE(created_at) = '2025-01-15'
WHERE CAST(price AS INT) > 100
WHERE CONCAT(first_name, ' ', last_name) = 'Alice Smith'

-- SARGable equivalents
WHERE email = 'alice@example.com' -- if collation is case-insensitive
WHERE created_at >= '2025-01-15' AND created_at < '2025-01-16'
WHERE price > 100 -- fix the type at the schema level
WHERE first_name = 'Alice' AND last_name = 'Smith'

Three of the four non-SARGable forms have clean rewrites. The first one (LOWER(email)) depends on collation, which is where a lot of real-world cases live.

The collation case

WHERE LOWER(email) = 'alice@example.com' is almost always a tell that the email column has a case-sensitive collation and the application is hiding it at query time. Two real fixes, one cosmetic fix:

Fix the column. If the data should be matched case-insensitively, give the column a case-insensitive collation. In PostgreSQL that’s CITEXT or a COLLATE "und-x-icu" with the ICU provider; in MySQL it’s a _ci collation (which is usually the default anyway). Once the column’s collation handles the case folding, WHERE email = 'alice@example.com' is SARGable and fast. This is the right fix when case-insensitivity is a property of the data.

Add a functional (expression) index. If you can’t change the column’s collation (there’s a case-sensitive comparison elsewhere in the schema that depends on the current behavior) index the expression itself:

1
2
3
4
5
6


-- PostgreSQL: functional index
CREATE INDEX idx_users_email_lower ON users (LOWER(email));
-- Now WHERE LOWER(email) = '...' uses the index

-- MySQL 8.0+: expression index (requires the same constant-folding fix)
ALTER TABLE users ADD INDEX idx_email_lower ((LOWER(email)));

This works, with caveats. The index’s storage and write cost is real. The predicate has to match the indexed expression exactly: LOWER(email) is indexed, but UPPER(email) isn’t, and the planner won’t translate between them. Every non-SARGable expression you want fast needs its own index.

Cosmetic fix: case-fold at write time. Store the email as already-lowercased. WHERE email = 'alice@example.com' is now SARGable directly, no expression index needed. This usually requires application changes (whoever’s writing has to remember to case-fold) which is why the functional index is more popular even though it’s heavier. Where business logic lives covers the general shape of this decision; case-folding at the database with a generated column (GENERATED ALWAYS AS (LOWER(email)) STORED) is often the cleanest answer when the application can’t be trusted to normalize consistently.

Implicit type conversions are the subtler version

The function isn’t always in the query. Sometimes the planner is adding one:

1
2


-- account_id is VARCHAR, literal is numeric
WHERE account_id = 12345

MySQL will silently cast every account_id value to a number for comparison: a per-row function call that kills index usage just as effectively as an explicit CAST(). PostgreSQL is stricter and usually errors, but can still do implicit conversions between compatible types that undermine indexes.

The fix is matching types in both directions: the column type should be what the column is (a numeric ID should be BIGINT, not VARCHAR), and the query should write the literal in the column’s type (WHERE account_id = '12345' if the column is genuinely a string). Either fix works; matching the column type to the data’s real shape is usually the durable answer.

This is also where mixed PK strategies show up. Joining a BIGINT id to a UUID id doesn’t just return wrong results; on MySQL it coerces one side to a string, which is the same implicit-function problem dressed up as a join.

When non-SARGable is acceptable

Not every non-SARGable predicate is a bug. Three cases where it’s fine:

Small tables. A 5,000-row lookup table with a function-wrapped predicate scans in microseconds. The planner isn’t going to use an index on that size anyway. WHERE UPPER(code) = 'NY' on a 50-row states table is not worth worrying about.

One-off analytical queries. A one-time data extract that scans a large table is going to scan it regardless. If the query will never run again, the function call isn’t the bottleneck (the table size is) and adding a functional index to optimize one query isn’t worth the write cost on every future insert.

When the function genuinely can’t be avoided. Some predicates legitimately need to compute. WHERE haversine_distance(lat, lng, user_lat, user_lng) < 10 on a geospatial query can’t be rewritten as a simple range; you need a spatial index (PostGIS, MySQL spatial extensions) to make it SARGable in the geometric sense. The fix is a different kind of index, not a rewrite.

Why natural-language-to-SQL tilts non-SARGable

Schema-reading assistants and text-to-SQL models produce this class of bug more often than hand-written queries do. A user asks “events in 2025”; the closest English-to-SQL mapping is WHERE YEAR(created_at) = 2025, and that’s what the model writes. The correct form (a half-open range) requires knowing the calendar boundary of the year and producing two comparison operators, which is a less-direct translation of the question. WHERE LOWER(email) = 'alice@example.com' is the natural translation of “find the user with this email, case-insensitive,” even when the column’s collation already handles case and the function wrap defeats the index it would otherwise use.

The catalog-level fix is the same one the bigger-picture section below points at: model the column so the natural query is already SARGable. Pick a case-insensitive collation on email, store prices as NUMERIC so no cast is needed, partition or index date columns so the range-literal form performs. When the schema matches the shape of the question, the model’s default translation works. When it doesn’t, the model produces a query that runs clean and scans the table, and no plan inspection is built into the generation loop to catch it.

The bigger picture

Non-SARGable predicates are easy to write, and they come from somewhere: almost always a schema decision that’s being papered over at query time. LOWER(email) hides a collation mismatch. CAST(price AS INT) hides a type that should have been NUMERIC from the start. DATE(created_at) hides the fact that the query is answering a date-range question but written in a way that reads more naturally as an equality. Every one of these is a query-level workaround for a schema-level issue, and every one of them costs an index when the table grows large enough to care.

EXPLAIN ANALYZE is the diagnostic. If the plan shows a sequential scan on a predicate that should hit an index, the predicate is almost certainly non-SARGable; look at what’s wrapping the column. Fix the schema if you can, add a functional index if you can’t, and treat non-SARGable predicates on hot paths as latent performance bugs, not style issues.

Database Deadlocks, Part 2: Diagnosis, Retries, and Prevention

Sun, 02 Mar 2025 00:00:00 +0000

TL;DR

Part 1 covered the patterns. This post is the operational half: reading the deadlock log to identify which pattern fired, designing retries that fail loudly instead of hiding the real bug, isolating hot rows before they become incidents, and the prevention primitives (NOWAIT, SKIP LOCKED, isolation-level changes, lock_timeout) that remove entire categories from the workload.

The patterns in Part 1 answer “why did this deadlock happen?”. This post answers the next three questions every team ends up asking: “which pattern fired?”, “what should the application do when it does?”, and “how do we stop the ones that hurt most?”. Each has its own tooling, failure modes, and production idioms, and almost all of them are learned after the first incident, not before.

The trap worth naming up front: tuning retry logic is easy, reading a deadlock log is harder, and the second is what tells you whether the first matters. A team that only tunes retries is optimizing the symptom without ever seeing the cause. This post is structured in the opposite order: log-reading first, then retries, then the prevention patterns that remove entire categories from the workload.

Reading the MySQL deadlock log

SHOW ENGINE INNODB STATUS dumps the most recent deadlock in the LATEST DETECTED DEADLOCK section. The catch: only the most recent. On a busy system, deadlocks overwrite each other faster than someone can log in and copy the output. Before anything else, turn on innodb_print_all_deadlocks = ON in every production deployment. It writes every deadlock to the error log instead of a single overwriting slot. The volume is negligible, the diagnostic value is high, and there is no downside.

A representative entry looks like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26


*** (1) TRANSACTION:
TRANSACTION 4823941, ACTIVE 3 sec starting index read
mysql tables in use 1, locked 1
LOCK WAIT 4 lock struct(s), heap size 1136, 3 row lock(s)
MySQL thread id 892, OS thread handle 0x7f..., query id 18293 ...
UPDATE orders SET status = 'shipped' WHERE id = 1001

*** (1) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 48 page no 112 n bits 144 index PRIMARY
of table `shop`.`orders` trx id 4823941 lock_mode X locks rec but not gap
waiting

*** (2) TRANSACTION:
TRANSACTION 4823942, ACTIVE 2 sec starting index read
UPDATE orders SET status = 'paid' WHERE id = 1002

*** (2) HOLDS THE LOCK(S):
RECORD LOCKS space id 48 page no 112 n bits 144 index PRIMARY
of table `shop`.`orders` trx id 4823942 lock_mode X locks rec but not gap

*** (2) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 48 page no 112 n bits 144 index PRIMARY
of table `shop`.`orders` trx id 4823942 lock_mode X locks rec but not gap
waiting

*** WE ROLL BACK TRANSACTION (2)

The parts that matter:

lock_mode vs. lock_type. X is exclusive, S is shared. locks rec but not gap is a pure record lock; locks gap before rec is a gap lock; the unadorned X under REPEATABLE READ is usually next-key (record + gap). Matching lock_mode S locks rec but not gap against Part 1’s unique-index section tells you immediately that this is a duplicate-key-on-insert deadlock.
index name. index PRIMARY vs. index idx_customer reveals whether the cycle formed on the clustered index or a secondary one. Two transactions approaching the same rows from different indexes is the “secondary index locks on InnoDB” pattern from Part 1; the fix is usually consolidating access paths.
The query text. This is the last statement the transaction executed before the deadlock, not necessarily the one that caused it. A transaction holding locks from three earlier statements can deadlock on the fourth, and the log only shows the fourth. Cross-reference with the application’s structured logs to reconstruct the full transaction.
trx id is monotonically increasing and stable for the life of the transaction. Searching the general log or slow-query log for that trx id reconstructs the full statement sequence, but only if general-query logging is on for the window in question, which it usually isn’t.

performance_schema.data_locks and data_lock_waits give a real-time view of current locks and waits. Useful for catching a deadlock-adjacent pathology (long wait chains, hot rows) before the cycle forms:

1
2
3
4
5
6
7
8


SELECT
 bl.lock_type, bl.lock_mode, bl.object_name, bl.index_name,
 w.REQUESTING_ENGINE_TRANSACTION_ID AS waiting_trx,
 w.BLOCKING_ENGINE_TRANSACTION_ID AS blocking_trx
FROM performance_schema.data_lock_waits w
JOIN performance_schema.data_locks bl
 ON w.BLOCKING_ENGINE_LOCK_ID = bl.ENGINE_LOCK_ID
WHERE bl.OBJECT_SCHEMA = 'shop';

Reading the PostgreSQL deadlock log

PostgreSQL’s diagnostic story is narrower by design. Deadlocks are logged automatically when the cycle is detected. log_lock_waits = on logs any wait exceeding deadlock_timeout (default 1s), which catches the wait-chain escalation before the detector fires. There’s no equivalent to SHOW ENGINE INNODB STATUS; everything lives in postgresql.log or the extensions you’ve installed.

A representative deadlock entry:

1
2
3
4
5
6
7
8
9


ERROR: deadlock detected
DETAIL: Process 14234 waits for ShareLock on transaction 89234;
 blocked by process 14235.
 Process 14235 waits for ShareLock on transaction 89233;
 blocked by process 14234.
 Process 14234: UPDATE accounts SET balance = balance - 100 WHERE id = 2
 Process 14235: UPDATE accounts SET balance = balance + 50 WHERE id = 1
HINT: See server log for query details.
CONTEXT: while updating tuple (0,18) in relation "accounts"

The ShareLock on transaction X wording is PostgreSQL-specific. One transaction is waiting to see the commit status of another (a row-lock wait manifests as waiting on the holder’s transaction ID). The tuple identifier (0,18) points to the exact physical row (page 0, tuple 18 in the heap), which is useful for reproducing the scenario but changes as rows are updated (MVCC creates new versions at new (page, tuple) locations).

For real-time inspection, pg_locks joined against pg_stat_activity shows live lock state:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


SELECT
 a.pid, a.usename, a.state,
 a.wait_event_type, a.wait_event,
 l.mode, l.locktype, l.relation::regclass,
 pg_blocking_pids(a.pid) AS blocked_by,
 LEFT(a.query, 80) AS query
FROM pg_stat_activity a
LEFT JOIN pg_locks l ON l.pid = a.pid AND NOT l.granted
WHERE a.state != 'idle'
ORDER BY a.xact_start;

pg_blocking_pids(pid) returns the array of PIDs blocking a given transaction. Walking it recursively reconstructs the live wait-for graph: the same data the deadlock detector uses, just before a cycle forms. For hot systems, pg_stat_statements combined with pg_stat_activity snapshots at regular intervals builds a picture of which statements accumulate the most wait time, which is almost always the right first place to look.

Row locks are invisible in pg_locks

PostgreSQL’s row-level locks (the result of SELECT ... FOR UPDATE, FK checks, and plain UPDATE/DELETE) are stored on the tuple itself, in the xmax system column, not in the shared lock table. They don’t show up in pg_locks. The only way to see them is through the pgrowlocks extension, which scans the heap directly:

1
2


CREATE EXTENSION pgrowlocks;
SELECT * FROM pgrowlocks('accounts');

This is the single biggest difference between PG and InnoDB lock introspection, and the reason PG operators often feel blind to row-level contention until a cycle actually forms.

PostgreSQL SERIALIZABLE: serialization failures are not deadlocks

Under SERIALIZABLE isolation, PostgreSQL uses Serializable Snapshot Isolation (SSI), an optimistic mechanism based on SIREAD predicate locks that track read-write dependencies between transactions. SSI cannot deadlock by design; it never blocks on lock acquisition. What it does is abort one transaction with a serialization failure when it detects a dangerous read-write dependency cycle that would violate serializability.

The two failure codes look similar but have fundamentally different semantics:

40001 serialization_failure. SSI detected a dependency cycle and aborted the transaction before it could commit a non-serializable result. The transaction did nothing wrong; the combination of its operations with a concurrent transaction would have produced an inconsistency. Retrying is always safe and usually succeeds (the concurrent transaction will have committed or aborted, so the second attempt doesn’t see the same conflict).
40P01 deadlock_detected. A cycle in the wait-for graph was broken by killing a victim. Retrying may or may not succeed depending on what caused the cycle. If the cycle was deterministic (two code paths with inconsistent ordering), it will keep recurring.

The practical consequence for retry architecture: an application running under SERIALIZABLE must handle 40001. It’s not a deadlock, it’s the normal failure mode of SSI, and retries are the only recovery path. An application running under READ COMMITTED never sees 40001. An application that handles 40001 identically to 40P01 is correct but coarse. The right granularity is: always retry 40001 (the workload’s own correctness guarantee assumes this); retry 40P01 with caution and escalate on repeat.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12


def retry_on_conflict(fn, max_attempts=3):
 for attempt in range(max_attempts):
 try:
 return fn()
 except psycopg.errors.SerializationFailure:
 # 40001: always retry. SSI guarantees make this safe.
 backoff(attempt)
 except psycopg.errors.DeadlockDetected:
 # 40P01: retry with caution; log for root-cause analysis.
 log_deadlock_for_analysis(...)
 backoff(attempt)
 raise TransactionRetryExhausted

Retry architecture that doesn’t hide the bug

Every database driver documentation says “deadlocks happen, retry the transaction.” That’s true. It’s also incomplete. The dangers are subtle enough that a naive retry loop becomes part of the problem:

Retries without backoff make the cycle worse. The condition that caused the deadlock (contention on a hot key set) is still in effect when the retry runs. A tight retry loop turns one deadlock into a thundering herd: all victims retry simultaneously, all hit the same contention, all deadlock again. Use exponential backoff with full jitter, capped at a few hundred milliseconds.

Retries mask lock-ordering bugs. If an application deadlocks 10x/minute but retries successfully, the operator sees no failures, but the underlying transactions are doing up to 20x the work (original + retry). The deadlock rate itself is a metric worth tracking, not just the post-retry error rate. When the rate grows, the fix is diagnosing the pattern, not tuning the retry limit.

Retries aren’t always safe. A transaction that sent an external notification, wrote to a message queue, or called a non-idempotent HTTP endpoint before the deadlock can’t be blindly retried; the external side effect already happened. Retries belong on database-only transactions, or on transactions where the external calls are idempotent and tolerant of duplicate execution. The boundary is architectural, not a library setting.

Retries need a budget. If a transaction can’t complete after ~3 attempts, the problem is no longer transient contention. It’s either a systemic hot spot or a bug that retries will never resolve. Escalate (alert, fail the request, enqueue for manual review), don’t loop forever.

The retry pattern that works in production:

1
2
3
4
5
6
7
8
9


for attempt in range(3):
 try:
 with db.transaction():
 do_work()
 return
 except (DeadlockError, SerializationFailure) as e:
 metrics.increment("db.retry", tags={"error": e.code})
 time.sleep(random.uniform(0, 0.1 * 2**attempt))
raise RetryBudgetExhausted()

Three attempts, exponential backoff up to ~400ms, metrics emitted on every retry, hard failure past the budget. The metric matters as much as the retry - without it, the team never learns which transactions are retrying frequently and why.

Idempotency is a transaction-shape property

A transaction is safe to retry iff re-running it produces the same observable state. That includes downstream side effects. A transaction that writes to a table AND sends a webhook is not safe to retry even if both operations are internally correct: the second attempt sends a duplicate webhook. The fix is the outbox pattern: write the webhook-send intent to a table in the same transaction, then have a separate worker process the outbox with its own idempotency guarantees. This is a non-negotiable part of building deadlock-retry-safe systems at scale.

What schema-reading assistants write, and what they skip

AI-generated database code rarely ships with the retry scaffold above, and the reason is structural: nothing in the query text advertises that it can deadlock. An assistant asked to “bulk-upsert inventory” produces the INSERT ... ON DUPLICATE KEY UPDATE and stops. The query looks atomic from the outside, and the catalog doesn’t expose the shared next-key locks, gap-lock retention on FK checks, or unique-index lock escalation that turn it into a deadlock candidate under load. Retry logic, backoff, and idempotency boundaries are all decisions that live outside the query, in places the schema-reading prompt typically doesn’t reach.

The architectural choices are worse. Outbox-pattern correctness (write to an outbox table in the same transaction, process it separately) depends on understanding which external calls must happen exactly once, which is domain knowledge no catalog carries. A model asked to “send a notification after the update” reliably produces the direct call: clean, passing tests, and unsafe to retry. Treat AI-generated DB code as the ORM layer and nothing more. The queries are often fine in isolation; the retry loop, the idempotency boundary, the outbox separation, and the lock-ordering discipline are the pieces a human review still has to add. The article above is the review checklist.

Consistent lock ordering: the highest-leverage fix

Part 1’s lock-ordering deadlock (two transactions updating the same set of rows in opposite orders) is the single most common production pattern and the one with the highest-leverage fix. If every code path that writes to a set of tables acquires locks in the same order, the wait-for graph literally cannot form a cycle on those rows. The engine still takes the locks, still holds them for the duration of the transaction, but the second transaction waits cleanly for the first instead of grabbing a lock the first will need.

The rule is: sort the rows by a stable key (usually the primary key) in application code, before any SQL is issued. Lock acquisition order in both engines is determined by the order the engine processes rows, which for most write patterns is the order the application submitted them. Sort once up front, and N workers all doing the same thing can’t cycle because they all approach the row set from the same end.

The three batch shapes that matter in practice, and where the ordering actually happens:

1. Loop of per-row UPDATEs. The classic batch worker:

1
2
3
4
5
6
7


# Sort in the application; the iteration order IS the lock acquisition order.
rows.sort(key=lambda r: r.id)
for row in rows:
 cursor.execute(
 "UPDATE accounts SET balance = balance + %s WHERE id = %s",
 (row.delta, row.id),
 )

Each UPDATE locks its target row at execution time; the loop order determines acquisition order. No SQL-level ORDER BY involved; the fix lives in the .sort() call.

2. Bulk UPSERT. INSERT ... ON DUPLICATE KEY UPDATE (MySQL) or INSERT ... ON CONFLICT (PostgreSQL):

1
2
3
4
5
6
7
8


# Sort rows by the unique key BEFORE building the batch.
rows.sort(key=lambda r: r.id)
execute_values(
 cursor,
 "INSERT INTO accounts (id, balance) VALUES %s "
 "ON CONFLICT (id) DO UPDATE SET balance = accounts.balance + EXCLUDED.balance",
 [(r.id, r.delta) for r in rows],
)

The engine processes the VALUES list in order and acquires locks as it goes. Two concurrent batches sorted by the same key approach the key space from the same end; without the sort, one batch might process (1, 2, 3) while another processes (3, 2, 1) and they deadlock on the middle row. This is the exact shape of the bulk-UPSERT deadlocks called out in Part 1’s unique-index section.

3. Small multi-row transactions. The canonical bank transfer:

1
2
3
4
5


-- Always update the lower id first, regardless of transfer direction.
BEGIN;
UPDATE accounts SET balance = balance - 100 WHERE id = 1;
UPDATE accounts SET balance = balance + 100 WHERE id = 2;
COMMIT;

For 2–3 rows with per-row different values, the application computes sorted((X, Y)) and issues UPDATEs in that order. Same principle as the batch case, just smaller.

Where SELECT ... FOR UPDATE ORDER BY actually earns its keep. Most batches don’t need it; they control lock order through the submission order. The one shape where it’s the right answer is a single UPDATE statement over a derived table where the engine decides scan order and you can’t control it from outside:

1
2
3
4


UPDATE accounts a
SET balance = a.balance + u.delta
FROM (VALUES (1, 50), (2, 100), (3, 25)) AS u(id, delta)
WHERE a.id = u.id;

Here, sorting the VALUES list in application code doesn’t reliably control lock order; the planner picks the scan. A SELECT id FROM accounts WHERE id IN (...) ORDER BY id FOR UPDATE up front pre-acquires locks in deterministic order before the UPDATE runs. Or refactor into shape 1 or 2.

ORDER BY controls result order, not scan order

ORDER BY on a SELECT ... FOR UPDATE controls the result ordering, but lock acquisition happens during the scan. With a primary-key or unique-index predicate (WHERE id IN (...) on a PK), the planner does ordered index lookups and locks land in ORDER BY order in practice. For non-indexed predicates or range scans on non-unique columns, the planner may scan in a different order and sort results afterward; locks get acquired in scan order. Verify with EXPLAIN before relying on this pattern against non-PK predicates. Also: MySQL’s UPDATE ... ORDER BY syntax applies one SET clause to all matching rows; it doesn’t help when rows need different values.

This sounds trivial. It almost never is in practice; the ordering has to hold across every code path that writes to the same tables: the main request handler, backfill scripts, admin scripts, scheduled jobs, ORM bulk-save methods, and whatever migration scripts run during releases. One path that writes in a different order is enough to reopen the cycle. The durable fix is encoding the order in the access layer so individual query sites can’t diverge from it:

A repository function that always sorts by PK before writing.
A stored procedure or database function that owns the multi-row write.
A service method with the ordering baked in, and a lint rule or review check forbidding direct table writes from elsewhere.

Two places this invariant breaks without anyone noticing:

ORM bulk-save methods. ORMs hide whether they process in input order or reorder internally. Django’s Model.objects.bulk_update, SQLAlchemy’s bulk_update_mappings, ActiveRecord’s upsert_all, Hibernate’s batch inserts; some process in input order, some sort by PK internally, some chunk before doing either. If you can’t tell from docs, test it: two concurrent bulk-saves with opposing-ordered input lists will either deadlock (proving input order matters) or not (proving the ORM sorts internally). Either way, sorting the input collection before handing it to the ORM is cheap insurance.

Batches sourced from a SELECT in the same transaction. A common pattern: “grab N pending rows, process them.” If the feeder SELECT doesn’t have an ORDER BY, rows come back in scan order, non-deterministic across workers, which reopens the cycle. The fix is an explicit ORDER BY on the feeder query, not in the subsequent loop:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


# Bad: scan order feeds the loop.
rows = cursor.execute("SELECT id, delta FROM pending WHERE status = 'ready' LIMIT 100")
for row in rows: # Whatever the scan produced; varies across workers.
 ...

# Good: deterministic order, same across every worker.
rows = cursor.execute(
 "SELECT id, delta FROM pending WHERE status = 'ready' ORDER BY id LIMIT 100"
)
for row in rows:
 ...

The lint-rule angle matters more than it sounds. Deadlock-ordering violations are almost impossible to catch in code review - two PRs that each look correct in isolation can introduce inconsistent ordering when combined. The check that actually works is structural: no direct writes to tables X, Y from anywhere except the repository. Once that invariant is enforced, the ordering invariant follows.

Multi-table transactions need the same rule applied to table order. A transaction that updates users then orders in one code path, and orders then users in another, can deadlock through the FK chain even with per-table row ordering. The rule generalizes: sort rows within a table, and sort tables within a transaction, by a stable convention the whole codebase agrees on (alphabetical is fine; just pick one).

Isolation-level trade-offs

The isolation level a workload runs under determines which deadlock categories even apply. Most MySQL deadlock incidents stem from REPEATABLE READ’s gap locks, a category that doesn’t exist on PostgreSQL or on MySQL at READ COMMITTED. Changing the isolation level is the single highest-leverage tuning lever, and also the one with the most potential to quietly break application correctness.

Dropping MySQL from REPEATABLE READ to READ COMMITTED. Under READ COMMITTED, InnoDB still takes row locks but skips most gap locks (they exist only for unique-key and FK enforcement on inserts). Most OLTP workloads don’t need REPEATABLE READ’s range-consistency guarantee. Most application code was designed around READ COMMITTED semantics anyway, because that’s what PostgreSQL and SQL Server default to. Teams migrating to READ COMMITTED typically see deadlock rates drop by an order of magnitude with no functional change.

Avoiding range locks on write paths. Independent of isolation level, replacing SELECT ... FOR UPDATE scans over ranges with point lookups by primary key removes the gap-lock surface entirely on the statements that do it. If a write path doesn’t need to lock “all orders for customer 5,” locking just the specific row it’s about to update is both faster and safer.

FK shared locks are shorter-lived under READ COMMITTED. The foreign-key shared-lock pattern from Part 1 (high-write child tables concentrating shared locks on hot parent rows) has a narrower window under READ COMMITTED simply because the lock lifespan is tied to the statement rather than the transaction. The cycle potential is still there, but the wait window is smaller.

Isolation change is a behavior change, not a tuning knob

READ COMMITTED eliminates most gap locks but also changes the visibility semantics of long-running transactions. Any code that relied on re-reading a row and getting the same result (transfer logic, inventory deductions, financial calculations) has to be re-examined. The safe migration is application-by-application, not database-wide. Run it in staging under production-like load and watch for subtle correctness regressions: “phantom” rows appearing inside a transaction that used to see a stable snapshot, inventory counts that shift mid-transaction, calculations that no longer match because an underlying row changed between reads.

Session-scoped change as a migration path. Both engines let you set isolation level per session or per transaction (SET SESSION TRANSACTION ISOLATION LEVEL READ COMMITTED in MySQL, SET TRANSACTION ISOLATION LEVEL READ COMMITTED per transaction in PostgreSQL). The usual migration pattern is to start with the most contended code paths, move them to session-scoped READ COMMITTED, monitor for regressions, then expand the scope. A global flip from REPEATABLE READ to READ COMMITTED on a large, stable MySQL deployment is rarely the right first step.

Hot-row isolation: removing the pattern instead of retrying it

When the top N deadlocks on a production system concentrate on a small set of rows (a counter, a config row, an AUTO_INCREMENT source of truth), retries don’t converge. Every retry hits the same row, takes the same lock, cycles with the same peers. The fix is removing the row from the hot path, not tuning the retry layer.

Three patterns that work:

Counter tables with sharding, for extreme write-hot counters only. Reach for this only when the counter is taking thousands of writes per second against a single row and the simpler options below aren’t viable. For anything less, the queue pattern below or an external store (Redis atomic INCR, a time-series DB) is almost always the better answer: less complexity, no schema overhead, no sum-across-rows read cost. Sharded counters are the specialized escalation, not the default.

When it does fit the workload: N physical shards per logical counter, keyed on a compact integer (counter_id, shard_id) composite. Application code picks the shard. Keeping the random choice out of SQL makes it portable across engines and testable independently:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13


CREATE TABLE counter_shards (
 counter_id BIGINT NOT NULL, -- FK to a counters metadata table if you need names
 shard_id SMALLINT NOT NULL, -- 0..N-1, fixed per-counter
 value BIGINT NOT NULL DEFAULT 0,
 PRIMARY KEY (counter_id, shard_id)
);

-- Seed the shards once per counter (e.g., when the counter is created):
INSERT INTO counter_shards (counter_id, shard_id, value)
VALUES (42, 0, 0), (42, 1, 0), (42, 2, 0), (42, 3, 0),
 (42, 4, 0), (42, 5, 0), (42, 6, 0), (42, 7, 0),
 (42, 8, 0), (42, 9, 0), (42, 10, 0), (42, 11, 0),
 (42, 12, 0), (42, 13, 0), (42, 14, 0), (42, 15, 0);

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13


# Increment: application picks the shard. Portable, cheap, no SQL-side RAND().
shard = random.randrange(16)
cursor.execute(
 "UPDATE counter_shards SET value = value + 1 "
 "WHERE counter_id = %s AND shard_id = %s",
 (42, shard),
)

# Read: sum across shards.
cursor.execute(
 "SELECT COALESCE(SUM(value), 0) FROM counter_shards WHERE counter_id = %s",
 (42,),
)

16 shards turn one hot row into 16 warm rows. The contention surface scales with shard count. The read cost is one SUM across N rows instead of a single-row SELECT, usually acceptable for counter use cases; if not, cache the aggregate.

A common refinement is deriving the shard deterministically from the connection or worker ID (e.g., connection_id % 16) so each worker consistently hits the same shard. That improves InnoDB buffer-pool locality and reduces cross-shard interference, at the cost of slightly less even distribution if worker counts aren’t balanced.

Advisory locks for app-level serialization. Both MySQL and PostgreSQL support advisory locks: named locks that exist outside the table model and don’t take row locks. For operations that need to be serialized at the application level (leader election, rate limiting, config migration), advisory locks are dramatically cheaper than row locks and can’t participate in a table-lock cycle:

1
2
3
4
5
6
7


-- PostgreSQL: advisory lock keyed by a bigint.
SELECT pg_advisory_xact_lock(hashtext('refresh_cache:customer_42'));
-- Lock released at transaction end. Only one worker per key runs at a time.

-- MySQL equivalent:
SELECT GET_LOCK('refresh_cache:customer_42', 10);
-- Returns 1 if acquired, 0 on timeout. Must explicitly RELEASE_LOCK.

The caveats: advisory locks are application-layer discipline; they don’t enforce anything the database checks. Use them where the application chooses to serialize, not where correctness requires it.

Queue patterns instead of direct updates. For counter-like workloads, write intents to an append-only table and aggregate periodically:

1
2
3
4
5
6
7
8
9


INSERT INTO counter_events (counter_key, delta, created_at) VALUES (?, ?, NOW());
-- No contention: every insert creates a new row.

-- Periodic aggregation job:
INSERT INTO counter_totals (counter_key, total)
SELECT counter_key, SUM(delta) FROM counter_events
WHERE processed = FALSE
GROUP BY counter_key
ON CONFLICT (counter_key) DO UPDATE SET total = counter_totals.total + EXCLUDED.total;

Trades real-time accuracy for throughput. The right trade-off for page-view counters, metric accumulation, any workload where eventual consistency is acceptable.

Hot parent rows behind FK chains. Part 1 described how a high-write child table concentrates shared locks on a hot parent row, and how any exclusive-lock request on that parent (a name update, a soft-delete, a trigger-driven counter update) becomes a contention point. Two levers that work:

Move high-frequency parent-row updates to a side table. The last_activity_at timestamp, the cached counter, the updated_at that a trigger bumps on every child insert; none of these need to live on the parent table. Moving them to customer_activity(customer_id, last_seen_at) or customer_counters(customer_id, order_count) eliminates the exclusive-lock contention entirely. The parent row stops changing on hot paths, the shared locks from FK checks coexist fine, and the cycle potential disappears.
Narrow the FK scope where integrity can tolerate it. Not every child table needs an enforced FK to every parent. Logs, events, and audit tables are often the biggest offenders, and often have the least need for strict integrity (an orphaned log row is rarely a correctness problem). Dropping the FK removes the shared-lock dependency entirely. This trades integrity for throughput, a decision that belongs with the team owning the data, not a default.

Under READ COMMITTED, both levers matter less because the FK shared locks release at statement end rather than transaction end. A workload that runs on REPEATABLE READ and can’t change isolation level (because of application semantics) gets the most benefit from these two fixes.

Long-running transactions are a deadlock amplifier

The longer a transaction holds locks, the wider the window for a cycle. Two patterns recur in production:

Application-layer long transactions. A transaction that opens at request start, makes several queries, calls an external API, then commits. The external call is where the transaction actually spends its time: seconds of open transaction holding row locks the whole time. Every concurrent transaction that touches those rows waits. Deadlock probability scales with transaction duration. The fix is the inverse of the outbox pattern: do the external call outside the transaction, passing in any needed IDs.

Idle-in-transaction sessions. A session that runs BEGIN, some writes, then stalls; idle but not committed. In PostgreSQL, this blocks vacuum on touched tables, bloats MVCC, and holds locks indefinitely. pg_stat_activity shows state = 'idle in transaction'. MySQL’s equivalent is a thread with an open transaction and no current query.

PostgreSQL has a first-class timeout for this; MySQL does not:

1
2
3


-- PostgreSQL: kill idle-in-transaction sessions after 5 minutes.
-- (Units required; bare integer would be interpreted as milliseconds.)
SET idle_in_transaction_session_timeout = '5min';

MySQL has no direct equivalent. wait_timeout and interactive_timeout govern idle connections, not sessions idle inside an open transaction. A connection that did BEGIN then stopped sending queries will hold its locks until the connection drops or the client commits. The production workaround is either a watchdog script (e.g., Percona’s pt-kill) that polls information_schema.innodb_trx and terminates transactions exceeding a duration threshold, or a connection pool with per-connection transaction lifetimes. Connection pools that acquire a connection, start a transaction, then return the connection to the pool without committing (rare but real) will produce sessions that live indefinitely otherwise.

Finding long-running transactions:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


-- PostgreSQL
SELECT pid, usename, state, xact_start, now() - xact_start AS duration, query
FROM pg_stat_activity
WHERE xact_start IS NOT NULL AND now() - xact_start > INTERVAL '30 seconds'
ORDER BY duration DESC;

-- MySQL
SELECT trx_id, trx_started, trx_mysql_thread_id, trx_rows_locked, trx_query
FROM information_schema.innodb_trx
WHERE TIMESTAMPDIFF(SECOND, trx_started, NOW()) > 30;

Alerting on any transaction exceeding 30s in an OLTP workload catches most of the long-transaction-induced deadlocks before they produce incidents.

Triggers and cascades are invisible lock sources

A trigger that updates a second table on every write to the first adds an edge to the wait-for graph that isn’t visible in the original query. ON DELETE CASCADE foreign keys behave similarly - one delete can take locks on every child row in the cascade, and if the cascade order differs between two concurrent deletes, they can deadlock through tables neither statement directly referenced.

This is the origin of the “why is my DELETE FROM users deadlocking against an INSERT INTO events?” question. The DELETE triggered a cascade to user_preferences, which had a trigger that updated a counter in tenants, which was locked by the INSERT. Four tables in the cycle, two in the application’s explicit query, zero mention of the other two in any log entry until someone reads the DDL.

The operational pattern: when a deadlock log mentions a table the application’s code doesn’t explicitly reference, check (1) FK cascades on the tables that are in the query, (2) triggers on those tables, (3) generated columns that fire on update. All three are non-obvious lock sources, all three are fixable, but only after they’re identified.

`innodb_autoinc_lock_mode` and the AUTO-INC lock

MySQL InnoDB’s AUTO_INCREMENT column has its own lock, historically a source of contention and occasional deadlock. The innodb_autoinc_lock_mode parameter controls the behavior:

Mode 0 (traditional). Table-level AUTO-INC lock held for the duration of the statement. Serialized across inserts. Safe for statement-based replication, terrible for concurrency.
Mode 1 (consecutive). A lighter lock for simple inserts (single-row or known-row-count), and the traditional table lock for bulk inserts (INSERT ... SELECT). Was the default in 5.7.
Mode 2 (interleaved). No AUTO-INC table lock; IDs are assigned per-row as needed, possibly interleaved across concurrent statements. Default in MySQL 8.0. Fastest, and correct for row-based replication (which is also the 8.0 default).

The mode-2 default in 8.0 eliminated a substantial source of historical deadlocks and contention. Bulk inserts that used to serialize on the AUTO-INC lock now proceed in parallel. If you’re migrating from 5.7 to 8.0, this is a free win. If you’re still on binlog_format = STATEMENT (uncommon but not unheard of in legacy deployments), you cannot safely run mode 2; the replica may generate different IDs than the source, corrupting the data. Switch to binlog_format = ROW first, then adopt mode 2.

DDL migration windows and `lock_timeout`

Online schema change isn’t deadlock-prone in the classical sense, but it interacts with deadlocks in a specific operational way: DDL takes heavy locks that queue behind ongoing DML, and while the DDL waits, every subsequent query on that table queues behind the DDL. In PostgreSQL, a DDL taking ACCESS EXCLUSIVE that waits for an existing long-running SELECT will cause every new SELECT to wait behind the DDL. The system grinds to a halt, and application logs fill with timeout errors that look like deadlocks but aren’t. It’s a queue, not a cycle.

The standard prevention idiom for PostgreSQL migrations:

1
2


SET lock_timeout = '2s';
ALTER TABLE orders ADD COLUMN notes TEXT;

If the ALTER can’t acquire its lock in 2 seconds, it fails instead of queueing. The migration tool catches the error and retries with backoff. This prevents the queue-behind-DDL outage entirely - the cost is that some migrations need multiple attempts to land, which is almost always the right trade-off.

MySQL’s equivalent tooling is pt-online-schema-change (Percona) and gh-ost (GitHub). Both create a copy of the table, stream writes to both via trigger or binlog, and swap at the end. They run concurrent DML against the original and the copy, so they inflate deadlock rates during the migration window: not because the tool is buggy, but because there are now more transactions touching the same rows. The operational practice: run migrations at low-traffic windows, watch the deadlock counter during the run (not just replica lag), and have a rollback path ready.

DDL inside transactions is engine-dependent

PostgreSQL supports transactional DDL: BEGIN; ALTER TABLE ...; COMMIT; is atomic. MySQL does not; every DDL statement implicitly commits the current transaction. A migration script that assumes it can roll back mid-migration works on PostgreSQL and silently half-applies on MySQL. Know which engine you’re writing migrations for.

`NOWAIT` and `SKIP LOCKED` as prevention primitives

Both engines support two SQL-level concurrency primitives that remove the need for application-layer deadlock handling in specific patterns:

SELECT ... FOR UPDATE NOWAIT. If the row is locked by another transaction, fail immediately with an error instead of waiting. Useful for user-facing paths where “I can’t get this resource right now” is a better UX than “wait 500ms and maybe deadlock anyway.” Also useful for detecting lock contention synthetically in tests.
SELECT ... FOR UPDATE SKIP LOCKED. If rows are locked by another transaction, skip them and return only rows the current transaction can lock. Transforms a contended queue-processor pattern into a lock-free one: N workers each grab a different set of rows, zero contention, zero deadlocks.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


-- Queue processor: deadlock-free, contention-free.
SELECT * FROM jobs
WHERE status = 'pending'
ORDER BY priority, created_at
LIMIT 10
FOR UPDATE SKIP LOCKED;

-- Fast-fail acquisition: don't wait, fail now.
SELECT * FROM leader_election
WHERE resource_id = 'cache-refresh'
FOR UPDATE NOWAIT;

SKIP LOCKED arrived in PostgreSQL 9.5 and MySQL 8.0. Before those versions, queue-processor patterns required either advisory locks or application-level coordination (Redis, Zookeeper). Post-SKIP LOCKED, they can live entirely in the database with a single primitive. For any workload where workers pull from a shared queue, this is the pattern - not retry loops on FOR UPDATE.

Monitoring that actually catches regressions

The single most useful metric is deadlock rate over time. Not error rate, not retry rate; the raw count of deadlocks per minute or per thousand transactions. A workload with 0.1 deadlocks per thousand transactions is healthy; 10 per thousand is a paging threshold; 100 per thousand means retries aren’t converging and something is structurally wrong.

For MySQL: there’s no Innodb_deadlocks status variable. The correct source is performance_schema.events_errors_summary_global_by_error, which is enabled by default:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


-- Cumulative deadlock count (compare over time windows).
SELECT SUM_ERROR_RAISED AS deadlock_count
FROM performance_schema.events_errors_summary_global_by_error
WHERE ERROR_NAME = 'ER_LOCK_DEADLOCK';

-- Lock-wait activity (useful for adjacent contention, NOT a deadlock counter):
SHOW GLOBAL STATUS LIKE 'Innodb_row_lock_waits';

-- Plus the error log, searchable for "LATEST DETECTED DEADLOCK"
-- once innodb_print_all_deadlocks=ON.

Innodb_row_lock_waits is commonly misread as a deadlock counter - the manual defines it as “the number of times operations on InnoDB tables had to wait for a row lock,” which is contention in general. Pair it with the events_errors_summary query, not in place of it.

For PostgreSQL:

1
2
3
4


-- pg_stat_database exposes per-database deadlock counter.
SELECT datname, deadlocks, xact_commit, xact_rollback
FROM pg_stat_database
WHERE datname = current_database();

Scrape both into Prometheus (mysqld_exporter and postgres_exporter both expose these), compute the rate, alert on sustained rises. Pair the deadlock rate with a retry rate from the application layer - a spike in one without the other means either the retry logic is broken or the workload shape changed. A spike in both means a real regression.

Beyond the rate itself, the top-K pairs of statements involved in deadlocks (extracted from innodb_print_all_deadlocks logs or PG’s deadlock log entries) identify exactly which code paths are fighting. This list rarely changes - the same two or three patterns account for most deadlocks in any given system. Fix those and the rate drops by an order of magnitude.

The mental model for Part 2

Part 1’s patterns answer why deadlocks happen. Part 2’s operations answer what to do about them, and the useful reframe is that the answer is almost never “tune the retry logic.” Retries are the recovery mechanism that keeps the application running while the actual fix lands. The actual fix is almost always one of:

Identify the pattern from the log. This is step zero; skipping it means you’re tuning blind.
Enforce consistent lock ordering at the access-layer level. Highest-leverage fix for the lock-ordering pattern; deterministically eliminates the cycle rather than shrinking its window.
Change the code path to use SKIP LOCKED / NOWAIT where the pattern matches (queue processors, resource acquisition).
Isolate hot rows (counter tables, shards, advisory locks, queue patterns; move high-frequency parent-row updates to side tables).
Shorten transactions. Move external calls out, enforce idle-transaction timeouts.
Drop isolation level where the workload allows it; session-scoped first, global only after regression testing. Eliminates the gap-lock category on MySQL.
Remove cascades/triggers from the hot path when they’re the hidden lock source.
Handle SERIALIZABLE’s 40001 as a normal event if you’re on SERIALIZABLE, and don’t confuse it with 40P01.
Plan DDL windows with lock_timeout and watch the deadlock counter through the migration.

Retries let the application survive while the fix is in flight. Monitoring tells you which fix to prioritize. Each of the above removes a category from the workload entirely. The goal is a system where the few remaining deadlocks are rare enough that the retry layer handles them invisibly and the team’s attention can go elsewhere. Not zero (that’s a theoretical fiction at realistic concurrency), but managed.

Database Deadlocks, Part 1: The Patterns

Thu, 13 Feb 2025 00:00:00 +0000

TL;DR

A deadlock is two transactions each holding a lock the other needs, caught in a cycle the engine breaks by killing one. The patterns are finite and repeatable: inconsistent lock ordering across workers, InnoDB gap locks under REPEATABLE READ, foreign-key shared locks on hot parent rows, unique-index conflicts, index-scan lock amplification, and parallelism patterns that only surface on replicas or under worker-pool load. This post is the patterns. Part 2 covers diagnosis, retry architecture, and prevention.

Deadlocks occupy a strange place in production operations. They’re rare enough that most engineers haven’t thought hard about them, frequent enough in high-concurrency workloads to show up as paging incidents, and subtle enough that the first instinct (“just retry”) is right often enough to keep the root cause hidden. The transaction that got killed was syntactically perfect. The one that survived was too. The bug wasn’t in either statement; it was in the order the two transactions touched rows.

That makes deadlocks harder to reason about than most database failures. The query text in the error log isn’t wrong. The lock it was waiting for isn’t held by a misbehaving process. The system is doing exactly what concurrency control says it should. The failure mode is the interaction between transactions, and those interactions are almost never visible from any single query.

This post covers the patterns: the shapes deadlocks take and why each one exists. The companion post covers reading the deadlock log end-to-end, retry architecture, hot-row isolation, SERIALIZABLE’s serialization failures, DDL migration windows, and NOWAIT / SKIP LOCKED as prevention primitives.

What a deadlock actually is

A deadlock is a cycle in the wait-for graph. Transaction A holds lock L1 and needs L2; transaction B holds L2 and needs L1. Neither can proceed. The only way out is to kill one: pick a victim, roll it back, release its locks, and let the other complete. Every modern relational database does this automatically, usually within hundreds of milliseconds.

The three preconditions are always the same:

Two or more transactions hold locks.
Each needs a lock the other holds.
The locks can’t be acquired atomically (there’s no single “grab both or grab nothing” operation).

Remove any one and the deadlock can’t form. In practice, that’s the shape of every prevention strategy: reduce the number of locks held concurrently, reduce the duration they’re held, or make the acquisition order consistent across all code paths.

Deadlock vs. lock wait timeout

A deadlock is a cycle. A lock wait timeout is a long queue: transaction A is waiting for transaction B, which is waiting for something reasonable, which is taking too long. No cycle, no victim selection, just a timer expiring. Both produce errors that look similar in application logs, but they’re entirely different failure modes with different fixes. innodb_lock_wait_timeout (MySQL) and lock_timeout (PostgreSQL) govern the second. The deadlock detector is a separate mechanism that fires independently of those timers.

The canonical lock-ordering deadlock

The single most common production deadlock is two transactions updating the same two rows in opposite orders:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


-- Transaction A
BEGIN;
UPDATE accounts SET balance = balance - 100 WHERE id = 1;
UPDATE accounts SET balance = balance + 100 WHERE id = 2;
COMMIT;

-- Transaction B (concurrent)
BEGIN;
UPDATE accounts SET balance = balance - 50 WHERE id = 2;
UPDATE accounts SET balance = balance + 50 WHERE id = 1;
COMMIT;

If A and B interleave such that A takes the row-level lock on row 1, B takes the row-level lock on row 2, and each then tries to grab the other row, the engine has a cycle. One gets killed.

sequenceDiagram
 participant A as Transaction A
 participant R1 as Row id=1
 participant R2 as Row id=2
 participant B as Transaction B

 Note over A,B: t=0, both transactions begin
 A->>R1: UPDATE (acquire X-lock)
 R1-->>A: granted
 B->>R2: UPDATE (acquire X-lock)
 R2-->>B: granted

 Note over A,B: t=1, each reaches for the other's row
 A->>R2: UPDATE (request X-lock)
 R2--xA: BLOCKED (held by B)
 B->>R1: UPDATE (request X-lock)
 R1--xB: BLOCKED (held by A)

 Note over A,B: Wait-for graph has a cycle: A → B → A
 Note over A,B: Detector fires; victim: Transaction B
 B->>B: ROLLBACK, release R2 lock
 R2-->>A: now granted
 A->>A: COMMIT

The key property is that neither transaction is wrong in isolation. Each acquires locks in an order that’s locally correct. The cycle forms in the global ordering across concurrent sessions, which no single query can see. That’s the defining shape of the pattern: correct code, in both cases, interacting at the transaction boundary. The fix is in how all code paths agree on an ordering, covered in Part 2: Consistent lock ordering.

InnoDB gap locks turn inserts into deadlock sources

MySQL’s default isolation level is REPEATABLE READ, and under that isolation level, InnoDB takes next-key locks: a row lock plus a gap lock on the range before it. The gap lock prevents other transactions from inserting into that range, which is how REPEATABLE READ keeps range queries consistent across re-execution.

The consequence: a SELECT ... FOR UPDATE or an UPDATE with a range predicate locks not just the matching rows, but the gaps between them. Two concurrent transactions that both try to insert into the same gap can deadlock without sharing a single row:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18


-- Table: orders(id BIGINT PK, customer_id BIGINT, amount_cents BIGINT)
-- Index on customer_id
-- Existing rows: customer_id = 5 has orders with ids 100, 200, 300

-- Transaction A
BEGIN;
SELECT * FROM orders WHERE customer_id = 5 FOR UPDATE;
-- Takes next-key locks covering ids 100, 200, 300 AND the gaps between them,
-- AND the gap after 300 extending to the next customer_id.

-- Transaction B (concurrent)
BEGIN;
INSERT INTO orders (id, customer_id, amount_cents) VALUES (250, 5, 10000);
-- Blocks: tries to insert into a gap A has locked.

-- Transaction A
INSERT INTO orders (id, customer_id, amount_cents) VALUES (150, 5, 5000);
-- Deadlock if B has also started acquiring locks that A now needs.

Two transactions inserting into what looks like “different rows” can still cycle through gap locks. The failure mode is especially insidious because the EXPLAIN plan doesn’t mention gaps - only rows - and the lock information in SHOW ENGINE INNODB STATUS requires reading the next-key notation carefully.

The category is peculiar to InnoDB under REPEATABLE READ. PostgreSQL prevents phantom reads through MVCC snapshot isolation starting at its own REPEATABLE READ level (stricter than the SQL standard requires) without any range-locking mechanism, so the whole class of gap-lock deadlocks doesn’t exist on PG, at any isolation level. Under READ COMMITTED on MySQL, gap locks are disabled for most searches and index scans but retained for foreign-key and duplicate-key checking, which is the first lever most teams reach for once this pattern dominates their deadlocks, though it doesn’t eliminate the gap-lock category entirely. The isolation-level trade-off and the “avoid range locks on write paths” refactor both live in Part 2: Isolation-level trade-offs.

Unique-index deadlocks are a category of their own

The detailed patterns are covered in Uniqueness and Selectivity, but the shape worth naming here: when InnoDB detects a duplicate-key error on an INSERT, it acquires a shared lock on the conflicting index record before raising the error. Under REPEATABLE READ that shared lock is next-key (record + gap). Under READ COMMITTED, gap locks are mostly disabled, but duplicate-key checking is one of the documented exceptions where gap locking still occurs, so dropping isolation alone doesn’t eliminate the category. Two concurrent transactions inserting toward the same unique key end up holding shared locks and waiting for each other to release: a deadlock caused entirely by the uniqueness check, not by the rows the application thought it was writing.

INSERT ... ON DUPLICATE KEY UPDATE behaves differently: on conflict it takes an exclusive lock instead of a shared one, because the statement is about to modify the row. This matters for reasoning about cycles. Two concurrent ODKU statements contend on exclusive locks (mutually exclusive, one always waits), whereas two concurrent plain INSERTs can both hold shared locks at once and then deadlock when either tries to upgrade. Blog posts and older documentation sometimes conflate the two; the locking rules are documented in the MySQL reference: Locks Set by Different SQL Statements in InnoDB.

The equivalent in PostgreSQL is less severe (the duplicate-key check doesn’t hold long-lived shared locks the same way) but INSERT ... ON CONFLICT with multiple unique indexes can still produce deadlocks when batches touch overlapping keys in different orders. The shape is the same across engines: the uniqueness check itself is what forces the extra locking, and the cycle forms when two sessions approach the same key from different batches.

Foreign keys take shared locks you didn’t ask for

Both MySQL and PostgreSQL acquire shared locks on the referenced row when you insert or update a row with a foreign key. The purpose is to prevent the referenced row from being deleted mid-transaction; you can’t have an orders.customer_id pointing to a customers.id that’s being concurrently deleted.

The side effect is that a high-write child table concentrates shared locks on hot parent rows:

1
2
3
4
5


-- customers has id = 42 (a frequently-used customer)
-- Many concurrent transactions inserting orders for customer 42:

INSERT INTO orders (customer_id, amount_cents) VALUES (42, 1000);
-- Takes a shared lock on customers(id=42)

Shared locks don’t block each other, so concurrent inserts coexist fine. What breaks is the interaction with any transaction that wants an exclusive lock on the parent row: an update to the customer’s name, a soft-delete, a trigger that updates a cached counter. Suddenly, dozens of shared-lock holders are blocking one exclusive-lock request, and if any of them start trying to acquire other locks (say, through a trigger cascade), a cycle can form.

The symptoms: deadlocks that mention tables far removed from the one the application thought it was touching. “Why is my UPDATE customers deadlocking against an INSERT INTO order_items?” Because the order_items insert took a shared lock on customers through the FK chain, and the UPDATE wanted exclusive on the same row.

This is one of the hardest patterns to diagnose on sight, because the offending query never references the contended table explicitly. Mitigations (narrowing FK scope, moving hot parent-row updates to side tables, isolation-level trade-offs) are covered in Part 2: Hot-row isolation.

Index scans lock more rows than queries return

Under InnoDB’s default REPEATABLE READ, an UPDATE with a WHERE clause on a non-indexed column acquires a record lock on every row it scans, not just the ones that match. The engine has to examine each row to check the predicate, and it takes a lock to guarantee the check is stable for the duration of the transaction.

1
2
3


-- Without an index on status, under REPEATABLE READ:
UPDATE orders SET priority = 'high' WHERE status = 'pending';
-- Locks every row in orders during the scan.

If the table has a million rows and only a thousand match, all million get locked for the duration of the update. Any concurrent transaction touching any of those rows has to wait, which inflates the wait-for graph and makes deadlocks more likely.

Under READ COMMITTED, InnoDB narrows this substantially: per the docs, it releases locks on non-matching rows after the WHERE evaluation and uses semi-consistent reads, returning the latest committed version of an already-locked row so the engine can check whether it matches the WHERE before deciding to wait. The net effect is much lower lock footprint and deadlock risk on the same query. PostgreSQL behaves similarly by default: only rows actually updated retain their locks. This is one of the few cases where the same underlying issue (an unindexed predicate) shows up as both a latency problem and a concurrency problem, and where the concurrency angle is specifically a REPEATABLE READ-on-InnoDB amplifier.

Secondary index locks on InnoDB

InnoDB takes locks on both the clustered index (primary key) and any secondary indexes touched by the query. A WHERE status = 'pending' using a status index locks the relevant index entries and the corresponding PK entries. Transactions that approach the same rows from different indexes (one via status, another via customer_id) can deadlock on the PK-side lock even though their index-side locks don’t overlap. This is the most common “why are these two queries deadlocking, they don’t even reference the same columns?” failure mode.

Parallelism-induced deadlocks

The lock-ordering patterns above assume two separate transactions from two separate sessions. Parallelism adds a few variants that don’t fit that frame; the cycle can form inside a single logical unit of work, or show up only on a replica that never issued the original statements.

Worker pools racing on a shared queue. The archetypal production pattern: N application workers pulling jobs from the same table (jobs, outbox, email_queue) and locking rows for processing. If every worker does SELECT ... FOR UPDATE on “the next available batch” without a deterministic ordering, two workers can grab overlapping row sets in opposite orders and cycle. This is the same lock-ordering cycle from earlier, distributed across workers that all look identical from a code-review perspective.

Intra-query parallel workers. PostgreSQL has a full parallel query executor (parallel sequential scans, bitmap heap scans, index and index-only scans (B-tree), parallel aggregates, parallel joins) that spawns worker processes to cooperate on a single query. MySQL has a much narrower feature: innodb_parallel_read_threads (added in 8.0.14) enables parallel scanning of the clustered index, used initially by CHECK TABLE and extended to unconditional SELECT COUNT(*) in 8.0.17. It is not general parallel query; MySQL does not parallelize arbitrary SELECTs, joins, or aggregates. In both engines, workers coordinating on a single query don’t deadlock among themselves in normal operation; the engine manages the shared lock state. What can happen is a parallel worker holds a lock an unrelated transaction needs, and the parallel query itself takes longer than a serial one would, widening the wait window. Usually not a direct deadlock source, but it changes the timing of existing ones.

Parallel replication on replicas. MySQL’s multi-threaded replica applies committed transactions in parallel. Transactions that committed serially on the source (no possibility of deadlock there) can deadlock on the replica because the applier threads are racing on rows the source never had concurrent writers on. The replica’s deadlock detector resolves them the same way it would a live deadlock, but they show up in the replica’s error log with no corresponding entry on the source. Since MySQL 8.0.27, replica_parallel_type=LOGICAL_CLOCK and replica_parallel_workers=4 are the defaults, and replica_parallel_type was deprecated in 8.0.29; LOGICAL_CLOCK is effectively the only supported mode going forward. The slave_* → replica_* rename happened in 8.0.26; older deployments and blog posts still use the legacy names. PostgreSQL 16+ introduced parallel apply for logical replication (streaming = parallel is the default on CREATE SUBSCRIPTION), which exposes the same class of apply-side cycles on a setup that historically didn’t have them: a surprise for teams upgrading from 15 and earlier.

Parallel/online DDL interacting with DML. Tools like pt-online-schema-change and gh-ost run concurrent DML against the table being altered (through triggers or a row-copy process). Under load, the trigger-installed writes and the copy process can both take locks on the same rows the application is updating, and the wait-for graph gains edges that wouldn’t exist during a normal workload. This rarely manifests as a hard deadlock (the tools are written defensively) but it does manifest as elevated deadlock rates during the migration window.

None of these are properties of the queries themselves. They’re properties of how work gets distributed across workers, processes, or replicas, which means they’re invisible to query-level review and only surface when the deadlock counter is watched over time. The primitives for fixing them (SKIP LOCKED, NOWAIT, advisory locks, DDL timeouts) are covered in Part 2: NOWAIT and SKIP LOCKED.

Engine-level differences that shape the patterns

The same pattern can deadlock on one engine and not the other. These differences are pattern-shaping; they change which of the above sections apply to your workload. Operational tuning (detector cost, wait timeouts, monitoring) is covered in Part 2: Monitoring.

Default isolation. PostgreSQL defaults to READ COMMITTED. MySQL defaults to REPEATABLE READ (with gap locks). The same application code has measurably different deadlock rates between the two because of this alone, before any other tuning.
Gap locks. Only InnoDB has them, and only under REPEATABLE READ (plus the foreign-key and duplicate-key exceptions that retain gap locking even under READ COMMITTED). PostgreSQL prevents phantom reads through MVCC at its own REPEATABLE READ (stricter than the SQL standard requires) without a range-locking mechanism, so the entire gap-lock deadlock category doesn’t exist on PG at any isolation level.
Lock granularity. PostgreSQL takes row-level (tuple) locks; InnoDB takes record locks on index entries (with next-key extension under REPEATABLE READ). The practical consequence is that InnoDB locks are more entangled with index choice than PostgreSQL’s; changing which index a query uses can change which rows and gaps get locked.
FK lock style. MySQL’s FK check holds a shared lock on the referenced row (next-key under REPEATABLE READ, and the docs list FK checking as one of the places gap locks persist even under READ COMMITTED). PostgreSQL takes a FOR KEY SHARE lock (added in 9.3 specifically to reduce FK lock contention vs. the older FOR SHARE). Hot parent rows are more contended under MySQL as a result.
Row-lock visibility. PostgreSQL row-level locks don’t show up in pg_locks. Per the docs, they’re stored on the tuple header on disk, not in shared memory. A process waiting for a row lock usually appears in pg_locks as waiting for the holder’s transaction ID, not the row. InnoDB’s performance_schema.data_locks exposes row-level lock state directly. More on this in Part 2.

Neither engine is “better.” The behaviors are different, and code that assumes one can deadlock mysteriously when moved to the other.

Why schema-reading assistants hit these patterns

Locking behavior has no syntax in the query text. A SELECT ... FOR UPDATE advertises the intent; a plain INSERT ... ON DUPLICATE KEY UPDATE or INSERT ... ON CONFLICT doesn’t. The shared next-key lock on a duplicate-key violation, the FK shared lock on the parent row, the gap-lock extension under MySQL’s default isolation are all implementation details of the storage engine. Schema-reading assistants read the catalog, which describes tables, columns, and constraints, and the codebase, which describes queries. Neither surfaces lock ordering, gap-lock semantics, or the difference between READ COMMITTED and REPEATABLE READ unless the prompt specifically includes them.

That’s why AI-generated UPSERT and batch-insert code deadlocks in production the way it does. The model reads INSERT ... ON DUPLICATE KEY UPDATE as an atomic upsert, not as “takes a shared next-key lock, possibly including a gap, before raising the duplicate-key error that the application will retry.” It generates batch INSERTs that process rows in whatever order the application supplies, not sorted by key: fine for a single writer, a lock-ordering cycle under any realistic concurrency. The patterns above are the ones the catalog and query text can’t warn about, and they’re the ones that arrive in production as “intermittent deadlocks under load” after passing every test that didn’t include a second concurrent worker. The fix lives one level up from the query (sorted batches, explicit lock ordering, retry loops) which is the subject of Part 2.

What’s in Part 2

The patterns are the first half. Turning them into working systems takes a different set of skills: reading the deadlock log to identify which pattern is firing, building retry logic that doesn’t mask the real bug, isolating hot rows before they become incident reports, and choosing the right tool for each (NOWAIT, SKIP LOCKED, advisory locks, counter tables, or the isolation-level change that eliminates the category entirely). PostgreSQL’s SERIALIZABLE/SSI produces serialization failures that look like deadlocks but aren’t; the difference matters for retry architecture. AUTO_INCREMENT and sequence-related locking have their own failure modes. DDL migrations on both engines introduce lock queues that manifest as deadlock-like incidents.

All of that is in Database Deadlocks, Part 2: Diagnosis, Retries, and Prevention.

Mental model for the patterns

Deadlocks are what consistent concurrency control does when two transactions make the engine choose between them. The database isn’t misbehaving; it’s refusing to let both of two contradictory orderings win. The error in the application log is a notification, not a fault.

That makes the diagnostic question concrete. Which pattern is firing? Every deadlock in production fits one of the shapes above: lock-ordering cycle, gap lock on a range, duplicate-key shared lock, FK shared lock on a hot parent, unindexed predicate lock amplification, worker-pool race, or replication-apply cycle. Identifying the pattern from the deadlock log narrows the fix enormously. “Two transactions deadlocked, retry the transaction” is true but useless. “Two workers took locks on the same jobs queue in different orders, switch to SKIP LOCKED” is a fix. The work is in the identification.

NULL in SQL: Three-Valued Logic and the Silent Bug Factory

Sun, 26 Jan 2025 00:00:00 +0000

TL;DR

NULL is the absence of a value, and SQL evaluates expressions involving it under three-valued logic (TRUE / FALSE / UNKNOWN). Most operators return UNKNOWN when one of their operands is NULL, so rows with NULLs silently drop out of !=, IN, and NOT IN filters and behave inconsistently across JOIN, GROUP BY, DISTINCT, and aggregate functions. The rules are consistent if you know them, and a source of silently wrong results when you don’t.

There’s a category of SQL bug that shows up in almost every mature codebase. Someone writes a filter like WHERE status != 'closed', expecting it to return every row that isn’t closed. Instead it returns fewer rows than the raw table contains. The rows where status is NULL silently dropped out. No error. No warning. The query is doing exactly what the SQL standard says it should, and the result is still wrong for what the author meant.

NULL handling is the single most common source of silently wrong query results in relational databases. The behavior is consistent if you know the rules, but the rules don’t match the intuition most programming languages build. In Java or Python, null != "closed" is true. In SQL, it’s UNKNOWN, and UNKNOWN rows get filtered out. That one difference produces most of the bugs.

NULL is not a value

Every introduction to SQL NULL starts here because it has to. NULL is the absence of a value, a marker that says “this column has no data.” It’s not zero, not empty string, not false. It doesn’t equal itself. It doesn’t not-equal itself either. Any comparison involving NULL returns a third logical state: UNKNOWN.

This is called three-valued logic (3VL), and SQL uses it consistently throughout the language. The three values are TRUE, FALSE, and UNKNOWN. Most operators propagate UNKNOWN: any arithmetic, string, or comparison operation with a NULL operand returns NULL (or UNKNOWN, in a boolean context).

1
2
3
4
5
6


SELECT NULL = NULL; -- NULL (not TRUE)
SELECT NULL = 5; -- NULL
SELECT NULL != 5; -- NULL
SELECT NULL + 1; -- NULL
SELECT 'hello' || NULL; -- NULL in PostgreSQL (ANSI standard behavior)
SELECT CONCAT('hello', NULL); -- NULL in MySQL, 'hello' in PostgreSQL

The CONCAT difference is a good example of how engines diverge even within well-defined territory. MySQL’s CONCAT propagates NULL: any NULL argument makes the whole result NULL. PostgreSQL’s CONCAT function does the opposite, silently skipping NULL arguments and returning the concatenation of the non-NULL parts. (PostgreSQL’s || operator still propagates NULL, matching ANSI.) Two queries that look identical can return different results on different engines, and the difference only shows up when a NULL appears.

For NULL-skipping concatenation that behaves the same on both engines, use CONCAT_WS (concat with separator). Both MySQL and PostgreSQL skip NULL arguments with it:

1
2


SELECT CONCAT_WS(' ', first_name, middle_name, last_name);
-- "Alice Smith" even if middle_name IS NULL, on both engines.

One MySQL-specific gotcha: if the separator itself is NULL, the whole result is NULL. The separator is the one argument CONCAT_WS still propagates NULL from. As long as the separator is a literal string, the function is a reliable NULL-safe concat across engines.

IS NULL, not = NULL

The only way to test for NULL is with IS NULL or IS NOT NULL. WHERE col = NULL always returns zero rows, because col = NULL evaluates to NULL, which is not TRUE, so the row is filtered out. This is one of those mistakes every SQL engineer makes exactly once.

WHERE clauses filter out UNKNOWN

The rule that drives most NULL bugs: WHERE only keeps rows where the condition evaluates to TRUE. UNKNOWN rows are filtered out, same as FALSE rows.

1
2


-- "Users not on the sales team"
SELECT * FROM users WHERE team_id != 3;

If team_id is NULL for unassigned users (a completely normal state) those rows are silently dropped. The expression NULL != 3 evaluates to UNKNOWN, and UNKNOWN is not TRUE, so the row doesn’t survive the filter.

The mental model most developers carry from application code (“anything that isn’t team 3 is included”) is wrong in SQL. To get that behavior, you have to spell it out:

1

SELECT * FROM users WHERE team_id != 3 OR team_id IS NULL;

This is one of the most common sources of “the numbers don’t match” bugs. A report that’s supposed to count “everyone outside the sales team” quietly excludes every unassigned user, and the total looks plausible because unassigned users aren’t visible in the team-level breakdown either. The discrepancy only surfaces when someone reconciles against a direct row count.

NOT IN is a trap

NOT IN with a nullable subquery is the classic silent-failure NULL bug. The trap is specifically that the subquery has to return a column that can contain NULL: rules out primary keys but extremely common for foreign keys, self-references, and any column that’s optional by design.

1
2
3


-- "Find users who aren't anybody's manager."
SELECT * FROM users
WHERE id NOT IN (SELECT manager_id FROM users);

The subquery returns every manager_id in the table, including NULL for users who don’t have a manager (the CEO, top-level roles, anyone unassigned). The moment the subquery contains a single NULL, the outer query returns zero rows.

The reason is how NOT IN expands. x NOT IN (a, b, c) is equivalent to x != a AND x != b AND x != c. If any of a, b, c is NULL, that comparison returns UNKNOWN, and AND with UNKNOWN can only ever be FALSE or UNKNOWN. The row never passes the filter.

Safer alternatives:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


-- Use NOT EXISTS - handles NULLs correctly
SELECT * FROM users u
WHERE NOT EXISTS (
 SELECT 1 FROM users m WHERE m.manager_id = u.id
);

-- Or filter NULLs out of the subquery explicitly
SELECT * FROM users
WHERE id NOT IN (
 SELECT manager_id FROM users WHERE manager_id IS NOT NULL
);

NOT EXISTS is the better habit. It’s correct regardless of NULL presence, and the query planner handles it at least as well as NOT IN on any modern engine. Treating NOT IN as “suspicious until proven NULL-free” saves a category of bug that’s almost impossible to catch in review.

COUNT and NULL: skipped, not zero

The single most important thing to know about aggregates and NULL: NULL is not treated as zero. It’s skipped entirely. Nothing about NULL gets coerced or counted; it’s as if the row weren’t there for the purposes of the aggregate.

COUNT makes this visible because it has three forms that behave differently:

COUNT(*) counts rows, regardless of their contents. NULLs in the row don’t matter.
COUNT(col) counts non-NULL values of col. A row where col IS NULL is skipped.
COUNT(DISTINCT col) counts distinct non-NULL values. NULL is not treated as a distinct value; it’s excluded.

1
2
3
4
5


SELECT
 COUNT(*) AS total_rows,
 COUNT(email) AS rows_with_email,
 COUNT(DISTINCT email) AS distinct_emails
FROM users;

On a table of 1,000 users where 200 have NULL emails:

COUNT(*) returns 1000 (all rows)
COUNT(email) returns 800 (NULLs skipped)
COUNT(DISTINCT email) returns ≤ 800 (distinct non-NULL emails only)

This shows up in reports all the time. “How many users signed up this month?” gets answered with COUNT(signup_source) and comes up short because the column was added later and older rows have NULL. The row is there. COUNT(*) would see it. COUNT(signup_source) doesn’t.

The rule: use COUNT(*) when you want rows, COUNT(col) when you specifically want “rows with that column populated.”

SUM, AVG, MIN, MAX: also skip NULL

The same rule holds for every aggregate. NULL is not contributed to the sum, not counted in the denominator for the average, not considered for min or max.

1

SELECT SUM(rating), AVG(rating) FROM reviews;

If half the rows have NULL rating:

SUM(rating) is the sum of the non-NULL half. NULLs don’t contribute 0; they contribute nothing.
AVG(rating) is the sum of the non-NULL half divided by the count of non-NULL rows, not the total row count.

The AVG behavior is the most common source of surprise. If 10,000 rows have rating = 5 and 10,000 have rating = NULL, AVG(rating) is 5.0, not 2.5. The NULL rows don’t pull the average down toward zero. They’re not in the denominator at all.

If you want NULL-as-zero behavior, you have to opt in:

1
2
3


SELECT AVG(COALESCE(rating, 0)) FROM reviews;
-- Now NULLs become 0 and land in both the sum and the denominator.
-- Returns 2.5 in the example above.

SUM of all NULLs is NULL, not zero

SUM(col) over a set where every value is NULL returns NULL, not 0. A SUM that feeds into arithmetic downstream (total + tax, for example) can propagate NULL through the rest of the expression, often somewhere the query author wasn’t expecting. COALESCE(SUM(col), 0) is the idiomatic fix; make the fallback explicit at the aggregate.

The framing that keeps this straight: NULL is not a value, so aggregates have nothing to aggregate. Absent, not zero. If you want absent to mean zero, that’s a COALESCE decision the query author makes; the engine won’t make it for you.

GROUP BY and DISTINCT treat NULLs as equal

Here’s where the rules get inconsistent in a way that genuinely surprises people: GROUP BY and DISTINCT treat all NULLs as the same group, even though NULL = NULL returns UNKNOWN.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12


-- All rows where team_id is NULL land in one group, as if they were equal.
SELECT team_id, COUNT(*) FROM users GROUP BY team_id;
-- team_id | count
-- NULL | 200
-- 1 | 500
-- 2 | 300

-- DISTINCT collapses all NULLs into one row.
SELECT DISTINCT team_id FROM users;
-- NULL
-- 1
-- 2

This is a deliberate exception carved out by the SQL standard. GROUP BY and DISTINCT use a “NULL-safe” equality for grouping purposes, because the alternative (one group per NULL row) would be useless. But it means the behavior is internally inconsistent: WHERE a = b says NULLs aren’t equal, GROUP BY a says they are.

The practical implication: COUNT(DISTINCT col) excludes NULL entirely (consistent with COUNT(col)), while GROUP BY col produces a single row for all NULLs. Two different “null-handling” behaviors under the same umbrella of “treats NULLs as equal for grouping.” Queries that rely on either for correctness should be written with the awareness that the two operations don’t agree.

NULL-safe comparison operators

Both MySQL and PostgreSQL offer operators that treat NULL as equal to NULL, mirroring the GROUP BY behavior for regular comparisons.

1
2
3
4
5
6
7


-- MySQL
SELECT * FROM users WHERE email <=> NULL;
-- Matches rows where email IS NULL. <=> is the null-safe equal operator.

-- PostgreSQL (ANSI SQL)
SELECT * FROM users WHERE email IS NOT DISTINCT FROM NULL;
-- Same idea. Treats NULLs as equal to each other.

These are useful when joining or filtering on columns that may contain NULL on both sides and you want NULLs to match:

1
2
3
4
5


-- Standard equality misses NULL-to-NULL matches
SELECT * FROM a JOIN b ON a.col = b.col;

-- IS NOT DISTINCT FROM treats NULLs as matching
SELECT * FROM a JOIN b ON a.col IS NOT DISTINCT FROM b.col;

Neither is used often in practice. The habit most teams settle on is “don’t let NULL be meaningful in join columns”: either constrain the columns NOT NULL or filter NULLs out before joining. The operators are there for the cases where those aren’t options.

ORDER BY: NULL placement varies by engine

When sorting, NULL has to go somewhere. The SQL standard leaves the default placement implementation-defined, and engines disagree.

PostgreSQL. NULLs sort last for ASC and first for DESC by default.
MySQL. NULLs sort first for ASC and last for DESC by default.
Oracle and SQL Server. Match PostgreSQL’s behavior (NULLs last for ASC).

The fix is to be explicit:

1
2


SELECT * FROM events ORDER BY event_time ASC NULLS LAST;
SELECT * FROM events ORDER BY event_time DESC NULLS LAST;

NULLS FIRST / NULLS LAST is ANSI standard and supported by PostgreSQL, Oracle, and SQL Server. MySQL doesn’t support the NULLS FIRST/LAST syntax directly; you fake it with a computed column:

1
2
3


-- MySQL idiom for "NULLS LAST" on an ASC sort
SELECT * FROM events ORDER BY event_time IS NULL, event_time ASC;
-- event_time IS NULL returns 0 for non-nulls, 1 for nulls; 0 sorts first.

Teams that run the same reports against different engines (especially during a migration or in a polyglot analytics stack) hit this one hard. A top-10 leaderboard quietly reorders when the ORDER BY engine changes underneath it.

JOINs don’t match on NULL

A standard equi-join a.col = b.col doesn’t match rows where either side is NULL. This is consistent with the three-valued logic rule: NULL = NULL is UNKNOWN, so the join predicate fails.

1
2
3
4
5


-- Users can have no manager (manager_id IS NULL).
-- This join drops any user with no manager.
SELECT u.name, m.name AS manager_name
FROM users u
JOIN managers m ON u.manager_id = m.id;

If the intent is “every user, with manager info if present,” use a LEFT JOIN. If the intent is “users where manager_id matches some manager row,” the INNER JOIN is correct but it’s worth naming the exclusion: users with NULL manager_id are gone, on purpose.

For joins that should treat NULLs as matching (both sides have NULL, and that means “same”), use the null-safe operator:

1
2


SELECT *
FROM a JOIN b ON a.external_ref IS NOT DISTINCT FROM b.external_ref;

This is rare but legitimate (e.g., matching optional identifiers where “both unspecified” should be treated as a match). Most of the time, the correct answer is to make the column NOT NULL and use a sentinel if needed (and then deal with the sentinel’s own problems, covered below).

Foreign keys are nullable by default

A foreign key column is nullable unless declared NOT NULL. A nullable FK means the reference is optional: users may or may not have a manager, orders may or may not be linked to a promotion. This is often the correct intent, but it’s frequently unintentional.

1
2
3
4
5
6


-- manager_id is nullable by default. This is intentional if users can be unmanaged.
CREATE TABLE users (
 id BIGINT PRIMARY KEY,
 name TEXT NOT NULL,
 manager_id BIGINT REFERENCES managers(id)
);

Review migration files with this in mind. A column that should always be populated but was added as nullable will accept NULLs forever. Retrofitting NOT NULL later requires backfilling or cleaning up existing NULL rows: easy when the table is small, painful at scale. (Foreign Keys Are Not Optional covers the broader picture of FK enforcement and why application-level validation is an incomplete substitute.)

What NULL actually means is context-dependent

The SQL rules for NULL are unambiguous. What NULL means in a given column is not. NULL can mean:

Unknown. The data exists but we don’t have it. A user’s birthdate where the user declined to share.
Not applicable. The field doesn’t make sense for this row. spouse_name on a row for a single person.
Ongoing or not yet set. The state isn’t finalized. end_date on an active subscription.
Data entry error. The column should have been populated but wasn’t.
Legacy. The column was added after the row was created and never backfilled.

The same column may mean different things in different rows, and the schema doesn’t tell you which is which. This is where schema comments earn their keep, documenting the semantics of NULL in each column in the DDL itself rather than in a wiki page nobody finds.

Sentinel values: the alternative, and its own problems

A common workaround: use a sentinel value instead of NULL. end_date = '9999-12-31' for “ongoing.” status = -1 for “unknown.” deleted_at = '1970-01-01' for “not deleted.”

Sentinels avoid the three-valued-logic rules at the cost of introducing their own bugs. A few to watch for:

Aggregates include sentinels. AVG(rating) over a column where “unknown” is stored as -1 skews the average toward negative. Sentinels break the “aggregates skip missing values” assumption that NULL provides for free.
Range queries break in unexpected directions. WHERE end_date > NOW() returns all the sentinel rows along with real future dates. Every filter has to explicitly exclude the sentinel.
Indexes skew. A column where 80% of the values are the sentinel has a low-selectivity index. The planner may skip the index entirely on queries that filter out the sentinel, because it doesn’t know that’s the intent.
Downstream consumers have to know. Every system that reads the data has to treat 9999-12-31 specially. Miss one consumer and wrong data shows up in a report.

The trade-off is real. NULL forces every query author to think about three-valued logic. Sentinels let queries use normal equality but require every author to know the sentinel. Neither is free; they move the cost around.

The pragmatic middle ground: use NULL for genuinely absent data (ongoing subscriptions, optional fields), use sentinels sparingly and document them, and declare NOT NULL everywhere you can enforce presence. A column that’s NOT NULL is the one case where the rules don’t matter, because NULL can’t get in.

Diagnosing a NULL bug

When a query returns fewer (or more, or none) of the rows it should, the fastest way to narrow it down to a NULL issue:

1
2
3
4
5
6


-- Are there NULLs in the columns referenced by the filter?
SELECT
 COUNT(*) AS total,
 COUNT(team_id) AS with_team,
 COUNT(*) - COUNT(team_id) AS no_team
FROM users;

If no_team is non-zero and the filter is team_id != X or team_id IN (...), the NULL rows are the likely culprit. Rewriting with explicit NULL handling (team_id != X OR team_id IS NULL, or NOT EXISTS, or COALESCE(team_id, -1) != X) will reveal whether NULLs were being silently excluded.

For NOT IN, inspect the subquery:

1
2


-- Does the NOT IN subquery contain NULL?
SELECT COUNT(*) FROM users WHERE manager_id IS NULL;

If the answer is non-zero, NOT IN is returning an empty set regardless of the outer query’s data.

The mental model

NULL handling is consistent once you internalize the rule set, and the rule set is smaller than it looks:

Any comparison involving NULL returns UNKNOWN. WHERE filters out UNKNOWN rows.
Aggregates skip NULLs. COUNT(*) doesn’t. COUNT(col) does.
GROUP BY, DISTINCT, and ORDER BY treat all NULLs as equivalent (with engine-specific sort placement).
NOT IN with a nullable subquery returns empty. Use NOT EXISTS.
Join predicates don’t match NULLs unless you use IS NOT DISTINCT FROM.

Past that, most NULL bugs are prevented by one habit: declare NOT NULL wherever the column should actually be populated. Every NOT NULL column is a column where none of these rules matter, because there’s nothing for them to misbehave on. The fewer nullable columns the schema has, the less of this there is to think about.

The columns where NULL genuinely carries meaning (optional references, ongoing states, data that may not exist) are the ones worth documenting. A schema comment that says “NULL means the subscription is still active” pulls the NULL semantics into the DDL itself, where it’s visible to every engineer, every tool, and every query author who wasn’t around when the decision was made.

Joins That Lie: The Cardinality Problem

Thu, 09 Jan 2025 00:00:00 +0000

TL;DR

Most silently wrong SQL comes from the same root cause: a join that multiplies rows in a way the author didn’t expect. Aggregations built on those rows (SUM, COUNT, AVG) inflate without producing any error. The fix is understanding the cardinality of every join before writing the aggregation, not more careful SQL.

There’s a category of SQL bug that never throws an error, never fails code review, and never shows up in tests. The query runs. The results look reasonable. Someone ships a dashboard, and a month later finance asks why revenue is 40% higher than what the billing system reports. That 40% isn’t a bug in the data; it’s a join that multiplied rows, and a SUM that dutifully added them all up.

The tricky part is that structurally the query is fine. The joins are valid. The filters are valid. The aggregation is valid. Every individual piece is correct. The cardinality of the relationships (how many child rows exist per parent, and how that changes when multiple child tables are joined at once) is doing damage the query never surfaces.

Cardinality, briefly

Cardinality describes the number of rows on each side of a relationship:

One-to-one (1:1). Each row in table A matches at most one row in table B. Less common than 1:N, but legitimately used for optional extensions (splitting off rarely-accessed or sensitive columns into a side table), inheritance patterns (a base table with specialized sub-tables), or separating hot and cold data for caching and storage reasons. A 1:1 join preserves row count.
One-to-many (1:N). Each row in A matches zero or more rows in B. The common case: one order has many order items, one user has many sessions, one post has many comments. Joining A to B duplicates the parent row once per matching child. If a parent has zero children, an inner join drops it entirely; a left join keeps it with NULLs on the child side. This difference matters and it’s the source of another whole class of silent bugs.
Many-to-many (N:M). Rows in A match many rows in B and vice versa. Always implemented through a bridge table (junction table) that sits between them. A bridge is two 1:N relationships back-to-back: the bridge table holds a foreign key to A and a foreign key to B, with each row pairing one A with one B. A has many bridge rows, and B has many bridge rows. Joining through it multiplies by the cardinality on both sides.

The shape of the relationship determines what a join does to row counts. This is where aggregations start to lie.

Schema cardinality vs. data cardinality

There’s a distinction worth naming: what the schema allows vs. what the data actually contains. A foreign key from user_profiles.user_id to users.id with a unique constraint is 1:1 at the schema level. A column typed as 1:N by constraint can be 1:1 in practice; if every order in your system happens to have exactly one line item, the relationship is legally 1:N but effectively 1:1. This matters for query planning (the optimizer uses constraints, not observed data), index choice, and reasoning about whether a join can actually multiply rows. A query that’s safe against the current data can break as soon as the data starts exercising the cardinality the schema permits.

The row multiplication problem

The examples in this article use a deliberately simple customers / orders / order_items schema so the mechanics are easy to follow. In real systems the shape changes constantly: invoices and payments, subscriptions and usage events, tickets and messages, events and dimensions in a warehouse. The permutations are endless, but the underlying failure is the same: a join that multiplies rows in a way the author didn’t expect, feeding an aggregation that now lies. Once the pattern is visible in one schema, it’s visible everywhere.

Consider a schema everyone has seen some version of:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17


CREATE TABLE customers (
 id BIGINT PRIMARY KEY,
 name VARCHAR(255) NOT NULL
);

CREATE TABLE orders (
 id BIGINT PRIMARY KEY,
 customer_id BIGINT NOT NULL REFERENCES customers(id),
 total_cents BIGINT NOT NULL
);

CREATE TABLE order_items (
 id BIGINT PRIMARY KEY,
 order_id BIGINT NOT NULL REFERENCES orders(id),
 price_cents BIGINT NOT NULL,
 quantity INT NOT NULL
);

A question: what’s the total revenue per customer? The obvious query:

1
2
3
4


SELECT c.name, SUM(o.total_cents) AS revenue
FROM customers c
JOIN orders o ON o.customer_id = c.id
GROUP BY c.name;

This is correct. One row per order, total_cents summed per customer. Now someone asks: “can we also see how many items they bought?” The change looks trivial; add a join and a count:

1
2
3
4
5
6
7
8


SELECT
 c.name,
 SUM(o.total_cents) AS revenue,
 COUNT(oi.id) AS items_purchased
FROM customers c
JOIN orders o ON o.customer_id = c.id
JOIN order_items oi ON oi.order_id = o.id
GROUP BY c.name;

The items_purchased count is correct. The revenue is wrong.

Here’s what happened. orders to order_items is 1:N. Joining them multiplies each order row by the number of items it contains. An order with 5 items now appears 5 times in the result set, once per item. total_cents, which lives on the orders row, is duplicated in each of those 5 copies.

SUM(o.total_cents) now sums the same order total once per item. A $100 order with 5 items contributes $500. Revenue is inflated by the average number of items per order.

The query runs. The numbers look like revenue. Nothing is flagged. The dashboard ships.

Why it's easy to miss

The inflation is proportional to the cardinality of the join, so it affects every row by roughly the same factor. Totals grow uniformly, relative rankings stay intact, and top-10 lists still look “right.” There’s nothing that stands out as obviously wrong, except the grand total doesn’t match the source system.

The bridge table trap

Many-to-many relationships make this problem worse because the multiplication happens in both directions. Take a schema with products, orders, and promotions:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17


CREATE TABLE order_items (
 id BIGINT PRIMARY KEY,
 order_id BIGINT NOT NULL,
 product_id BIGINT NOT NULL,
 price_cents BIGINT NOT NULL
);

CREATE TABLE order_item_promotions (
 order_item_id BIGINT NOT NULL,
 promotion_id BIGINT NOT NULL,
 PRIMARY KEY (order_item_id, promotion_id)
);

CREATE TABLE promotions (
 id BIGINT PRIMARY KEY,
 name VARCHAR(255) NOT NULL
);

An order item can have multiple promotions applied to it (a percentage discount stacked with a free shipping promo). Query: total revenue, broken down by promotion:

1
2
3
4
5


SELECT p.name, SUM(oi.price_cents) AS revenue
FROM order_items oi
JOIN order_item_promotions oip ON oip.order_item_id = oi.id
JOIN promotions p ON p.id = oip.promotion_id
GROUP BY p.name;

If an order item had two promotions, its price_cents shows up twice (once under each promotion). Sum those up and total revenue exceeds actual revenue. Worse, if you then compare “sum across all promotions” to “total revenue from order_items,” the numbers don’t tie out, and there’s no obvious reason why.

The bridge table is doing exactly what it’s supposed to do. The query is doing exactly what the SQL says. The meaning of the aggregation drifts as soon as you cross a many-to-many boundary.

A variation of the grain problem shows up in schemas where related tables each carry their own independently-moving date column: orders vs. shipments, subscriptions vs. invoices, tickets vs. updates, orders vs. returns. When a question is time-bounded (“Q1 revenue from items shipped in Q1”), the date filter has to land on the column that matches the question. Filtering on both tables “to be safe” silently excludes rows whose dates diverge. An order placed in December with items shipping in January is a Q1 shipment; a filter on orders.created_at throws it out.

The rule is the same as for row multiplication: pick the grain that matches the question, once. If the question is about shipments, filter on shipped_at. If it’s about orders, filter on created_at. Combining both feels more rigorous and quietly returns the wrong set.

How to diagnose it

The symptom is always the same: a number that doesn’t match what another system says it should be. Revenue doesn’t match billing. User counts don’t match the auth service. Item totals don’t match inventory. When that happens, the first thing to check isn’t the aggregation; it’s the row count at each stage of the query.

Take the aggregation off and see what you’re actually summing:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


-- Original (wrong) query
SELECT c.name, SUM(o.total_cents) AS revenue
FROM customers c
JOIN orders o ON o.customer_id = c.id
JOIN order_items oi ON oi.order_id = o.id
GROUP BY c.name;

-- Diagnostic: see the raw rows for one customer
SELECT c.name, o.id AS order_id, o.total_cents, oi.id AS item_id
FROM customers c
JOIN orders o ON o.customer_id = c.id
JOIN order_items oi ON oi.order_id = o.id
WHERE c.id = 42
ORDER BY o.id;

If the same order_id and total_cents appear on multiple rows, the sum is going to double-count. Seeing the raw rows makes the multiplication obvious in a way the aggregated output never does.

Another useful check: compare counts at each level independently.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


-- Count orders directly
SELECT COUNT(*) FROM orders WHERE customer_id = 42;
-- Returns: 3

-- Count orders through the joined query
SELECT COUNT(*) FROM orders o
JOIN order_items oi ON oi.order_id = o.id
WHERE o.customer_id = 42;
-- Returns: 12

-- The 4x multiplication is the join's cardinality

When the two numbers don’t match, the join is multiplying rows. Every aggregation downstream of that join is suspect.

How to solve it

There’s no single fix; the right technique depends on whether the aggregation lives on the parent or the child, and how many cardinality boundaries you’re crossing.

Aggregate at the correct grain, then join

The cleanest approach is usually to do each aggregation at the table where the data actually lives, then join the pre-aggregated results together. This keeps row counts under control and makes the query’s intent obvious.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18


WITH order_stats AS (
 SELECT customer_id, SUM(total_cents) AS revenue
 FROM orders
 GROUP BY customer_id
),
item_stats AS (
 SELECT o.customer_id, COUNT(oi.id) AS items_purchased
 FROM orders o
 JOIN order_items oi ON oi.order_id = o.id
 GROUP BY o.customer_id
)
SELECT
 c.name,
 order_stats.revenue,
 item_stats.items_purchased
FROM customers c
LEFT JOIN order_stats ON order_stats.customer_id = c.id
LEFT JOIN item_stats ON item_stats.customer_id = c.id;

Revenue is summed from orders where it lives, once per order. Items are counted through the orders→order_items join separately. Then both are joined back to customers. Each aggregation happens at its correct grain, and the final join is 1:1:1, no multiplication.

It looks more verbose. It is. That’s the point. The verbosity is making the cardinality explicit instead of hiding it behind a single flat join.

Use `DISTINCT` inside the aggregate, with caution

When the multiplication is already there, SUM(DISTINCT ...) can sometimes paper over it:

1
2
3
4
5
6
7


SELECT
 c.name,
 SUM(DISTINCT o.total_cents) AS revenue -- suspicious
FROM customers c
JOIN orders o ON o.customer_id = c.id
JOIN order_items oi ON oi.order_id = o.id
GROUP BY c.name;

This only works if total_cents is guaranteed to be unique across the duplicated rows. If two different orders happen to have the same total, DISTINCT collapses them into one and revenue drops. It’s fragile: correct for the query but wrong for the data.

COUNT(DISTINCT o.id) is safer because id is always unique by definition. Use DISTINCT on natural keys, not on aggregated values.

Window functions for “per parent” aggregates

When you need a running or per-group aggregate without collapsing rows, window functions keep the row count intact and do the math within a partition:

1
2
3
4
5
6
7
8
9


SELECT
 o.id AS order_id,
 o.customer_id,
 oi.id AS item_id,
 oi.price_cents,
 SUM(oi.price_cents) OVER (PARTITION BY o.id) AS order_total,
 SUM(oi.price_cents) OVER (PARTITION BY o.customer_id) AS customer_total
FROM orders o
JOIN order_items oi ON oi.order_id = o.id;

No group-by, no row collapsing, totals computed at the right grain. The cost is a result set the size of order_items, so use this pattern when the row-level detail is actually needed, not as a default replacement for GROUP BY.

LATERAL joins and correlated subqueries

When you need a per-row aggregate (the total for each order, or the most recent child row) a lateral join keeps the parent’s grain and evaluates the child aggregation row by row.

1
2
3
4
5
6
7
8


-- PostgreSQL: LATERAL join
SELECT o.id, o.customer_id, items.total, items.item_count
FROM orders o,
LATERAL (
 SELECT SUM(price_cents) AS total, COUNT(*) AS item_count
 FROM order_items
 WHERE order_id = o.id
) items;

One row per order, aggregation computed inside the lateral subquery, no multiplication. This is often faster than joining and then grouping, especially when orders is heavily filtered and order_items is large.

Schema-level defenses

Query-level fixes only work if the person writing the query knows to apply them. Schema-level guarantees work for every query, forever.

Foreign keys tell the query planner about cardinality. PostgreSQL in particular uses FK metadata to make join-order decisions and to eliminate redundant joins during planning. Beyond the integrity benefits, FKs make the shape of the data visible to both humans and the planner. (Foreign Keys Are Not Optional goes deeper on why skipping them compounds into silent corruption over time.)

Unique constraints on bridge tables prevent accidental many-to-many explosions. A bridge table with PRIMARY KEY (a_id, b_id) can’t contain duplicates, so joining through it can’t multiply rows because of duplicate bridge entries (only because of legitimate N:M relationships).

1
2
3
4
5


CREATE TABLE order_item_promotions (
 order_item_id BIGINT NOT NULL REFERENCES order_items(id),
 promotion_id BIGINT NOT NULL REFERENCES promotions(id),
 PRIMARY KEY (order_item_id, promotion_id) -- prevents duplicates
);

Without that composite primary key, a bug in the application layer that inserts the same (order_item_id, promotion_id) pair twice would silently double revenue for that item in any query joining through the bridge. With it, the database rejects the duplicate at write time.

Schema comments on tables and columns document the cardinality and semantics that aren’t visible from the DDL. A line like COMMENT ON TABLE order_item_promotions IS 'N:M bridge. One row per (item, promotion). Joining this multiplies order_item rows by avg promotions-per-item.' tells every future engineer exactly what the table does to row counts. (Comment Your Schema covers the mechanics across MySQL and PostgreSQL and why this metadata layer is almost always empty.)

Denormalized totals, when the trade-off is worth it. For heavily queried aggregates (order totals, user balance, post comment counts), storing the aggregate on the parent table eliminates the join entirely. The write-path cost is keeping the denormalized value consistent: either through application code, triggers, or scheduled reconciliation. For high-read, low-write aggregates, the read simplicity often wins. For everything else, computing on demand is cleaner.

Denormalization has its own failure mode

A stored orders.total_cents that’s out of sync with SUM(order_items.price_cents) is its own form of silent corruption, moved from the query layer to the write layer. Either invest in keeping it consistent (triggers, reconciliation jobs) or don’t denormalize it at all. A half-maintained denormalized aggregate is worse than no denormalization.

The pause that schema-reading assistants don’t take

A schema-reading assistant asked for “total revenue by customer” reads the catalog, finds the chain of tables it needs, writes the JOINs, adds the SUM, and hands back a query that looks right. The pause described in the section below (“wait, does this join multiply rows?”) is a step the model doesn’t take unless the prompt asks for it. The catalog tells the assistant that customers, orders, order_items, and the order_item_promotions bridge exist; it doesn’t tell it that joining through the bridge duplicates every order_items row once per promotion. The inflated total and the correct one look the same on the way back.

The same schema-level defenses that help humans give the model more to work with. FK metadata lets a catalog-reading tool see which joins are 1:N versus N:M. Composite primary keys on bridge tables prevent the “duplicate-in-bridge” multiplier from ever materializing in the data. Table comments that spell out cardinality (something like 'N:M bridge. Joining this multiplies order_item rows by avg promotions-per-item.') put the warning in the part of the schema the assistant actually reads. This doesn’t replace the pause described below; it narrows the set of cases where the pause has to do all the work.

The mental model

The shortcut that prevents most of these bugs: before writing an aggregation, picture the row count at every stage of the query.

Start with the leftmost table. How many rows?
Each join: does this multiply, preserve, or filter the row count?
At the point where the aggregate runs: what is the grain of each row? What does “one row” represent?
Does the aggregate make sense at that grain?

When the answer is “one row represents an order item, but I’m summing an order-level field,” the bug is already obvious. When the answer is “one row represents an order, and I’m summing order totals,” the query is correct.

This isn’t a skill that scales with query complexity; it’s a habit that kicks in before the query gets written. The senior engineers who never seem to hit these bugs aren’t writing smarter SQL. They’re pausing before the SUM and asking what row they’re actually summing over.

Putting it together

Cardinality bugs are a specific kind of wrong: syntactically valid, semantically broken, and invisible to every automated check. Tests pass. Code reviews approve. Reports render. The numbers just happen to be wrong.

The defense is structural, not tactical. Understand the cardinality of each relationship before writing the join. Aggregate at the grain where the data lives. Use the schema to make cardinality explicit: foreign keys, composite primary keys on bridges, comments that document the shape. When diagnosing a wrong number, strip the aggregation and look at the raw rows; the multiplication is almost always visible as soon as the SUM is out of the way.

The worst thing about silent bugs is that they stay silent. A crash gets fixed; wrong numbers persist for quarters. The habit of thinking about cardinality first (before writing the aggregation, not after someone flags the total) is one of the highest-leverage habits in working with relational data.

Uniqueness and Selectivity: The Two Numbers That Drive Query Plans

Mon, 23 Dec 2024 00:00:00 +0000

TL;DR

Uniqueness governs correctness, selectivity governs performance. The interesting parts of both live in the edge cases: partial unique indexes and their UPSERT targeting quirks, the way partitioning weakens every uniqueness guarantee, correlated columns that defeat planner assumptions, stale statistics that turn a 5ms query into a 5-minute one. Declaring the constraints the planner can see and keeping its statistics fresh buys more than any amount of query rewriting.

Everyone who works with relational databases knows UNIQUE. What they often don’t know is how it behaves under partitioning, how ON CONFLICT targets it (and doesn’t), and what the planner actually does with it beyond rejecting duplicates. Selectivity is in the same category. The definition is trivial, but the behavior that matters lives in composite column ordering, stale statistics, and the correlated-columns problem that breaks the planner’s core assumption.

This is the territory where “the query is correct” and “the query is fast” stop being the same question, and both depend on what the database can actually prove about the data. The constraints are the contract between the schema and the planner. Everything else is inference.

Partial and filtered unique indexes

PostgreSQL supports partial unique indexes: uniqueness enforced only over rows matching a predicate. This is the right tool for the common real-world case “email must be unique among active users”:

1
2
3
4


-- PostgreSQL: email unique only among non-soft-deleted rows.
CREATE UNIQUE INDEX users_active_email_uniq
 ON users (email)
 WHERE deleted_at IS NULL;

A plain UNIQUE (email) forces a choice: either allow re-registration (and lose referential integrity by reusing emails across deleted and active rows) or block it (and frustrate users whose accounts were long ago soft-deleted). The partial index lets both coexist.

MySQL doesn’t support partial unique indexes directly. The workaround exploits MySQL’s treatment of NULL as distinct under UNIQUE (covered in NULL: Three-Valued Logic):

1
2
3
4
5


-- MySQL idiom: generated column that's NULL for deleted users.
ALTER TABLE users
 ADD COLUMN email_active VARCHAR(255)
 GENERATED ALWAYS AS (CASE WHEN deleted_at IS NULL THEN email END) VIRTUAL,
 ADD UNIQUE KEY users_active_email_uniq (email_active);

The constraint effectively fires only for rows where email_active is non-NULL: exactly partial-index semantics, just expressed through a generated column. Awkward to write, but portable-ish and the ORMs catch on eventually.

Partitioned tables force uniqueness compromises

Partitioned tables in both PostgreSQL and MySQL require the partition key to be part of every unique constraint - including the primary key. The rule exists for correctness: without the partition key in the constraint, the database would have to scan every partition on every insert to enforce uniqueness, defeating the point of partitioning.

The practical consequence is that PRIMARY KEY (id) isn’t allowed on a table partitioned by created_at. It has to become PRIMARY KEY (id, created_at). The same applies to every other unique constraint: UNIQUE (email) on a users table partitioned by region becomes UNIQUE (email, region), which quietly weakens the guarantee. The schema now allows the same email to exist in multiple regions, whether or not the application ever intended that.

This is one of the sharper trade-offs in partitioning decisions. A uniqueness guarantee the schema used to provide gets weaker, and point lookups that used to be single-row const accesses become ref lookups because the full primary key isn’t spelled out in every query. Designing Partitioning You Don’t Have to Babysit covers the full picture, including why partitioning by the primary key itself (when the PK is monotonically increasing) sidesteps the trade-off entirely.

UPSERT targeting is more specific than it looks

INSERT ... ON CONFLICT (PostgreSQL) and INSERT ... ON DUPLICATE KEY UPDATE (MySQL) bind to specific unique constraints, not to “any uniqueness that happens to apply.” The difference between the two engines is where most of the subtle bugs live.

PostgreSQL is explicit. ON CONFLICT (email) requires a unique constraint or unique index exactly matching email. If none exists, the statement errors out. If a partial unique index exists instead of a plain one, ON CONFLICT (email) does not match it; you need the full predicate:

1
2
3
4
5


-- Must match the partial index's predicate to target it.
INSERT INTO users (email, name)
VALUES ('alice@example.com', 'Alice')
ON CONFLICT (email) WHERE deleted_at IS NULL
DO UPDATE SET name = EXCLUDED.name;

If the partial index changes (predicate tightened, column added), every ON CONFLICT targeting it has to change too. This is explicit coupling, but it’s coupling.

MySQL is implicit and more dangerous. ON DUPLICATE KEY UPDATE fires on conflict with any unique key on the table, not just the one the query author had in mind. If the table has UNIQUE (email) and UNIQUE (external_id), an insert that conflicts on either key triggers the update. For rows where the inserted email matches one existing row and the external_id matches a different one, the behavior depends on which index is checked first and is undefined as far as the language is concerned.

The practical implication: adding a new unique key to a table can silently change the semantics of every existing INSERT ... ON DUPLICATE KEY UPDATE against that table. There’s no error, no warning, just different behavior on the next conflict that falls into the new key’s path. On large schemas with dozens of unique keys, this is the UPSERT equivalent of action at a distance.

The mitigation on MySQL is to prefer INSERT ... ON DUPLICATE KEY UPDATE only when there’s a single obvious unique key, and to reach for REPLACE or explicit SELECT ... FOR UPDATE + conditional UPDATE/INSERT flows when the semantics need to be explicit.

Unique indexes also concentrate deadlock pressure

Two deadlock patterns are specific to unique indexes and show up almost nowhere else:

Duplicate-key inserts take locks even when they fail. When InnoDB detects a duplicate on insert, it doesn’t just raise the error; it first acquires a shared next-key lock on the conflicting row. Under REPEATABLE READ (the default), that lock covers the gap too. Two concurrent transactions inserting near the same unique key can deadlock on those shared locks before either sees the duplicate-key error. The most common production signature: a batch-upsert worker hitting the same hot row ranges from multiple threads.

ON DUPLICATE KEY UPDATE batches deadlock when key ordering differs. Each row in a batch insert acquires its lock when the row is processed, not when the batch starts. Two batches touching overlapping keys (A, B) vs (B, A) take locks in opposite order and cycle. The fix is either sorting rows by unique key before the batch (so lock acquisition order is consistent across workers) or switching to INSERT ... ON CONFLICT DO NOTHING plus a separate targeted UPDATE pass.

Neither of these shows up the same way with non-unique indexes; the uniqueness check itself is what forces the extra locking. It’s the cost of making the database enforce the guarantee, and it scales badly once the hot-key set is small and write concurrency is high. (Database Deadlocks, Part 1 covers the broader patterns; Part 2 covers reading the log, retries, and prevention.)

Composite index column ordering

The order of columns in a composite index is a selectivity decision that determines whether the index helps the query it was built for. The usual rules compress to three:

Equality filters before range filters. An index on (customer_id, created_at) is efficient for WHERE customer_id = 42 AND created_at > '2026-01-01'. Reversed ((created_at, customer_id)), the index has to scan a wide range of created_at values and filter customer_id as a secondary step, which is usually worse than a sequential scan.
More selective column first for equality-only predicates. For filters of the form WHERE a = ? AND b = ?, the column with more distinct values goes first so the first lookup narrows more aggressively.
Match the query’s access pattern. An index on (a, b, c) serves queries filtering by a, (a, b), or (a, b, c). It does not serve queries filtering by b alone, c alone, or (b, c). The leading column is load-bearing.

These interact with covering index considerations, sort order requirements, and the planner’s ability to combine multiple indexes via bitmap scans. But the starting point is: think about how the index will be read, not what columns are available to throw at it.

MySQL clustered indexes flip the rule

The above applies cleanly to secondary indexes. The MySQL InnoDB primary key is a different animal: a clustered index, meaning the PK’s leaf pages are the table. The ordering of PK columns decides physical row order on disk, and that often matters more than selectivity.

The canonical example is PRIMARY KEY (tenant_id, id) on a multi-tenant table. tenant_id has maybe 10K distinct values (low selectivity); id is near-unique. By “most selective first,” the answer would be (id, tenant_id), and it would be wrong:

Physical clustering. All rows for one tenant sit contiguously in the B-tree. Tenant-scoped range scans read a narrow slice of pages sequentially, and the buffer pool caches a single tenant’s hot data together. (id, tenant_id) scatters that same tenant’s rows across the whole table.
Secondary index lookups cost less. InnoDB secondary indexes store the PK, not a row pointer. A query that uses a secondary index and then needs a full row does a PK lookup per match. With (tenant_id, id), those lookups for one tenant cluster together. With (id, tenant_id), each is random I/O across the table.
Insert locality. If id is monotonically increasing within a tenant, inserts land on recent pages per tenant, avoiding page splits scattered across the index.

The rule for an InnoDB PK is: put the column that represents the dominant access pattern first, even if it’s less selective. Selectivity cuts rows; clustering cuts I/O. On a large clustered index, I/O usually dominates.

This is also why PRIMARY KEY (id) plus INDEX (tenant_id) on a multi-tenant table is often slower than PRIMARY KEY (tenant_id, id); the secondary index forces a PK-lookup hop on every read that the clustered choice avoids entirely.

PostgreSQL’s primary key is a separate B-tree unique index, not clustered (a CLUSTER command exists but isn’t maintained as rows are inserted), so the ordering logic there stays closer to the secondary-index rules.

The planner doesn’t read the data - it reads statistics

The planner’s entire decision-making process rests on statistics that summarize the data, not the data itself. PostgreSQL’s per-column statistics live in pg_stats:

1
2
3
4
5
6
7
8
9


SELECT
 attname,
 n_distinct, -- estimated distinct values (negative means fraction)
 null_frac, -- fraction of rows that are NULL
 most_common_vals, -- top values by frequency
 most_common_freqs, -- corresponding frequencies
 histogram_bounds -- distribution of non-common values
FROM pg_stats
WHERE tablename = 'orders' AND attname = 'customer_id';

MySQL exposes similar information through information_schema.STATISTICS and INNODB_TABLESTATS, though less granularly than PostgreSQL’s statistics. MySQL lacks per-column histograms on most versions (8.0+ has optional histograms, off by default).

These statistics are gathered by explicit ANALYZE in PostgreSQL and maintained automatically by InnoDB in MySQL. They go stale between runs. A table that was analyzed at 10M rows and is now 200M rows has planner statistics that no longer reflect reality. Join reorderings based on those estimates are decisions made on outdated data.

The usual symptom is a query that was fast yesterday and slow today, with no schema or query change. The planner’s row estimate for some step has drifted far enough from reality that the plan shape flipped: nested loop where it should have been hash join, or a sequential scan where an index seek would have won. EXPLAIN ANALYZE with its estimated-vs-actual row counts is the fastest way to confirm this:

1
2
3
4


EXPLAIN ANALYZE SELECT * FROM orders WHERE customer_id = 42;
-- Index Scan using orders_customer_id_idx
-- (cost=0.43..1234.56 rows=1000 width=128)
-- (actual time=0.123..45.678 rows=180000 loops=1)

The rows=1000 is the estimate. The actual rows=180000 is reality. A ratio of 100x+ between them is the signal. The fix is statistical (refresh stats, increase the statistics target for that column, add extended statistics for correlated columns) and not a query rewrite.

Cardinality estimation errors and their shape

The single most common cause of bad query plans in production is a bad row-count estimate on an intermediate step. Two flavors, each with distinctive symptoms:

Underestimates. The planner thinks a step will return 10 rows, actually returns 10 million. The plan picks a nested loop (good for a small outer side), which now runs 10 million iterations. A query that should have been a 50ms hash join takes 50 minutes. The telltale sign in EXPLAIN ANALYZE is loops=10000000 on an inner node that was costed for a handful.

Overestimates. The planner thinks 10 million rows, actually 10. The plan allocates a hash table sized for millions, spills to disk under memory pressure, and runs a 5ms lookup in 5 seconds. Less common but more insidious, because the query didn’t “fail” in any obvious way; it just used more memory and I/O than it needed.

Both are failures of the statistics, not the query. Both are especially hard to diagnose because the query text is identical in the fast and slow cases; only the planner’s belief about the data changed. When the ratio between estimated and actual is large and consistent, the problem is upstream of the query.

Correlated columns break the independence assumption

The planner estimates the selectivity of a compound predicate WHERE a = x AND b = y by multiplying the individual selectivities, assuming the columns are statistically independent. When they’re not, the estimate can be off by orders of magnitude.

The canonical example is (country, state):

1
2
3
4


EXPLAIN ANALYZE
SELECT * FROM addresses WHERE country = 'US' AND state = 'CA';
-- Estimate: (0.25) * (0.02) * N = 0.5% of rows
-- Reality: ~2% of rows - state = 'CA' implies country = 'US'

The planner assumed the two filters cut the rowcount independently. In reality, state = 'CA' already determines country = 'US' (there are no California rows with a different country) so the compound filter isn’t as selective as the multiplication suggests.

PostgreSQL 10+ supports extended statistics to fix this:

1
2
3


CREATE STATISTICS country_state_corr (dependencies, ndistinct)
 ON country, state FROM addresses;
ANALYZE addresses;

The dependencies statistic captures functional dependencies (one column determines another); ndistinct captures the distinct combinations of the column set. Both are used during planning to correct the independence-assumption multiplication.

MySQL has no equivalent. Correlated-column estimation errors there are harder to fix at the planner level; the workaround is usually to restructure the query (force a specific join order, introduce an intermediate CTE, or add a covering index that captures the correlated access pattern directly).

UNIQUE as a planner signal, not just a guardrail

A UNIQUE constraint is also a proof the planner can use. Knowing a column is unique lets the optimizer reason about the shape of joins and aggregates in ways it can’t when uniqueness is only implicit:

Deduplication elimination. SELECT DISTINCT u.id FROM users u JOIN orders o ON o.user_id = u.id can skip the DISTINCT step entirely if the planner knows users.id is unique. The join already produces at most one row per u.id per matching order, and the DISTINCT becomes a no-op. Without the declared uniqueness, the planner has to run the dedup pass.
Join elimination. When joining A to B on a unique column of B, and selecting only columns from A, the planner can drop the join entirely in some cases (it proved the join doesn’t change the output). This is a real optimization on star-schema queries.
Reorderable joins. Unique constraints make certain join orderings provably equivalent, giving the optimizer more plan shapes to choose from. The more plans it can try, the more likely it finds a good one.
Index-only scan eligibility. Unique indexes are natural targets for index-only scans, which skip the heap/table access when every column the query needs is already in the index.

Schemas that leave uniqueness implicit (enforced in application code, promised in a wiki) can still produce correct results, but the planner can’t trust assumptions it can’t see. The constraint is what turns uniqueness from a property of the data into a property of the schema that the planner reads as a fact.

What UNIQUE tells a schema-reading model

The planner isn’t the only consumer of declared uniqueness. Schema-reading assistants (Copilot, MCP-backed agents, text-to-SQL tools) read information_schema.TABLE_CONSTRAINTS and pg_constraint the same way they read column types. A declared UNIQUE is the only signal in the catalog that says “at most one row per X.” Without it, the model has no way to prove 1:1 semantics and either hedges with a defensive LIMIT 1 it can’t justify or writes GROUP BY / DISTINCT passes that shouldn’t be necessary. ON CONFLICT and ON DUPLICATE KEY UPDATE targeting is especially fragile: the model picks the column name that matches the prompt (“upsert by email”) and the query either fails at runtime because no unique constraint exists on that column, or silently targets a different constraint than intended.

Selectivity is the part the model has even less access to. Planner statistics (pg_stats.n_distinct, MySQL’s information_schema.STATISTICS cardinality estimates) aren’t part of the prompt for most schema-aware tools, and the model has no way to query them mid-generation. Asked “how do I speed this query up?” the assistant’s default answer is “add an index,” regardless of whether the indexed column has two distinct values or two million. The same schema discipline that keeps the planner honest (declared unique constraints on every at-most-one relationship, composite primary keys on bridge tables, column-level comments that describe the value shape) is what gives catalog-reading models enough context to produce queries that don’t require a second human pass.

Diagnosing the usual suspects

Three patterns cover most of the uniqueness/selectivity-shaped bugs in production:

“This query got slow and nothing changed.” Run EXPLAIN ANALYZE. Compare estimated to actual row counts on each node. A large ratio (10x+) means the planner has stale statistics, missing extended statistics on correlated columns, or both. Refresh stats with ANALYZE; add extended statistics if a compound predicate is the source.

“I built an index and the planner ignores it.” Check the column’s selectivity directly: distinct values over total. Below ~5%, a sequential scan is usually the right choice and the planner isn’t wrong. If selectivity is high, check for functions in the WHERE clause (non-SARGable predicates), implicit type casts (an indexed BIGINT column filtered with a VARCHAR literal can fall off the index), or stale statistics underreporting the column’s uniqueness.

“My UPSERT corrupts data under load.” Check which unique key it’s targeting. In MySQL, ON DUPLICATE KEY UPDATE fires on conflict with any unique key, including ones added after the query was written. In PostgreSQL, partial unique indexes require the predicate in ON CONFLICT; mismatches silently fall through to insert rather than update.

The mental model

Uniqueness and selectivity collapse to two questions that both the planner and the engineer need answered for every table and query:

How many rows per key? Uniqueness. Determines whether joins multiply, whether UPSERTs target the right constraint, and whether aggregations can be trusted.
How many distinct values relative to total? Selectivity. Determines whether indexes help, which join order the planner picks, and how badly a compound filter will miss.

Both answers are visible to the planner if the constraints are declared and the statistics are current. Both become guesswork when they’re not. The habit that pays off isn’t heroic query tuning. It’s keeping the database’s model of the data honest: declare the unique constraints that exist (including composite ones on bridge tables), refresh statistics on busy tables, add extended statistics where correlation has burned you before, and read EXPLAIN ANALYZE for the ratio between estimated and actual rows every time a query slows down.

Query-Performance on EXPLAIN ANALYZE

Covering Index Traps: When Adding One Column Breaks Your Query

What’s actually happening

The fix: match the index or extend it

When covering isn’t the right call

One more column, silently uncovered

The bigger picture

Non-SARGable Predicates: How a Function in WHERE Kills Your Index

What SARGable means in practice

The collation case

Implicit type conversions are the subtler version

When non-SARGable is acceptable

Why natural-language-to-SQL tilts non-SARGable

The bigger picture

Database Deadlocks, Part 2: Diagnosis, Retries, and Prevention

Reading the MySQL deadlock log

Reading the PostgreSQL deadlock log

PostgreSQL SERIALIZABLE: serialization failures are not deadlocks

Retry architecture that doesn’t hide the bug

What schema-reading assistants write, and what they skip

Consistent lock ordering: the highest-leverage fix

Isolation-level trade-offs

Hot-row isolation: removing the pattern instead of retrying it

Long-running transactions are a deadlock amplifier

Triggers and cascades are invisible lock sources

innodb_autoinc_lock_mode and the AUTO-INC lock

DDL migration windows and lock_timeout

NOWAIT and SKIP LOCKED as prevention primitives

Monitoring that actually catches regressions

The mental model for Part 2

Database Deadlocks, Part 1: The Patterns

What a deadlock actually is

The canonical lock-ordering deadlock

InnoDB gap locks turn inserts into deadlock sources

Unique-index deadlocks are a category of their own

Foreign keys take shared locks you didn’t ask for

Index scans lock more rows than queries return

Parallelism-induced deadlocks

Engine-level differences that shape the patterns

Why schema-reading assistants hit these patterns

What’s in Part 2

Mental model for the patterns

NULL in SQL: Three-Valued Logic and the Silent Bug Factory

NULL is not a value

WHERE clauses filter out UNKNOWN

NOT IN is a trap

COUNT and NULL: skipped, not zero

SUM, AVG, MIN, MAX: also skip NULL

GROUP BY and DISTINCT treat NULLs as equal

NULL-safe comparison operators

ORDER BY: NULL placement varies by engine

JOINs don’t match on NULL

Foreign keys are nullable by default

What NULL actually means is context-dependent

Sentinel values: the alternative, and its own problems

Diagnosing a NULL bug

The mental model

Joins That Lie: The Cardinality Problem

Cardinality, briefly

The row multiplication problem

The bridge table trap

A related trap: date filters across joined tables

How to diagnose it

How to solve it

Aggregate at the correct grain, then join

Use DISTINCT inside the aggregate, with caution

Window functions for “per parent” aggregates

LATERAL joins and correlated subqueries

Schema-level defenses

The pause that schema-reading assistants don’t take

The mental model

Putting it together

Uniqueness and Selectivity: The Two Numbers That Drive Query Plans

Partial and filtered unique indexes

Partitioned tables force uniqueness compromises

UPSERT targeting is more specific than it looks

Unique indexes also concentrate deadlock pressure

Composite index column ordering

MySQL clustered indexes flip the rule

The planner doesn’t read the data - it reads statistics

`innodb_autoinc_lock_mode` and the AUTO-INC lock

DDL migration windows and `lock_timeout`

`NOWAIT` and `SKIP LOCKED` as prevention primitives

Use `DISTINCT` inside the aggregate, with caution