A schema that looks clean on paper can hide two categories of problem that only surface under load. The first is uniqueness: a UNIQUE constraint that should exist and doesn’t, letting duplicates quietly accumulate until a query downstream multiplies rows or an UPSERT silently does the wrong thing. The second is selectivity: an index that looked reasonable at design time but the planner never uses, or a join order the planner picks based on stale statistics that no longer reflect the data.
These two ideas are tightly coupled. Uniqueness is the shape of the data — how many distinct rows per key. Selectivity is what the planner knows about that shape. Get them right and queries are correct and fast. Get them wrong and the symptoms show up as slow queries, duplicate rows, and aggregations that don’t tie out.
Uniqueness: what the database guarantees
A UNIQUE constraint is the only mechanism the database has to guarantee that no two rows share the same value (or combination of values) for a given column or column set. Without it, the application layer can promise uniqueness and still fail to deliver it — every direct write path (backfill script, migration, admin UI, new microservice) is a potential source of duplicates.
|
|
The composite primary key on user_roles is doing real work. Without it, an application bug that inserts the same (user_id, role_id) pair twice would silently create a duplicate — and any query joining through user_roles would multiply rows. This is precisely the row multiplication trap covered in Joins That Lie: the bridge table’s composite uniqueness is what prevents the cardinality explosion on the join.
The write-time cost of a unique constraint is negligible — a lookup against an index the database already maintains for the constraint. The diagnostic cost of a duplicate that shipped to production and propagated downstream is measured in days.
Partial and filtered unique indexes
PostgreSQL supports partial unique indexes, which enforce uniqueness only for rows matching a predicate. This is the right tool for one of the most common real-world cases: “email must be unique among active users.”
|
|
Without the partial index, a deleted user would block re-registration with the same email. A plain UNIQUE (email) constraint forces a trade-off: either allow re-registration (and lose referential integrity by reusing emails) or block it (and frustrate users). The partial index lets both constraints coexist — uniqueness where it matters, flexibility where it doesn’t.
MySQL doesn’t support partial unique indexes directly. The workaround is a generated column that’s NULL for rows the uniqueness shouldn’t apply to, exploiting the fact that UNIQUE treats NULLs as distinct (in most engines — a quirk covered in NULL: Three-Valued Logic).
|
|
The pattern is awkward, but it works because MySQL’s UNIQUE allows multiple NULLs. The constraint effectively fires only for rows where the generated column is non-NULL, which is exactly the partial-index semantics.
UPSERT correctness depends on uniqueness
INSERT ... ON CONFLICT (PostgreSQL) and INSERT ... ON DUPLICATE KEY UPDATE (MySQL) only work if the conflict target corresponds to an actual unique constraint.
|
|
The engine doesn’t invent uniqueness — it matches the ON CONFLICT target against defined constraints. If the constraint doesn’t exist, the UPSERT fails (PostgreSQL) or does something surprising (MySQL’s ON DUPLICATE KEY UPDATE applies to any unique key, which can be different from the one the query author had in mind).
This is especially fragile with partial unique indexes. In PostgreSQL, ON CONFLICT (email) matches a plain unique index on email but does not match a partial unique index with a WHERE deleted_at IS NULL clause — you need ON CONFLICT (email) WHERE deleted_at IS NULL to target the partial index specifically.
|
|
The broader point: UPSERT semantics are tied to the exact shape of the unique constraints. Changing or removing a constraint can silently change the behavior of queries that reference it, often in ways that only surface under a specific sequence of writes.
Missing unique constraints cause silent duplicates
A constraint missing from the schema is a duplicate waiting to happen. The failure mode is predictable: a column that should be unique (username, external ID, composite bridge key) accumulates duplicates over months or years; some dashboards start showing inflated counts; when someone finally notices, adding the constraint requires a cleanup migration because the duplicates already exist.
|
|
The migration becomes a multi-step project: find the duplicates, decide which to keep, merge or delete the rest, update any downstream systems that depend on the duplicates, then add the constraint. It’s also a correctness audit — every system that touched this data while duplicates existed is suspect.
The fix, as always, is to add the constraint on day one and let the database reject duplicates at write time. Retrofitting is a second-class option.
Selectivity: what makes an index useful
Selectivity is the ratio of distinct values to total rows in a column. A column with two distinct values (a boolean flag) has selectivity close to zero. A column with unique values has selectivity of one. The planner uses this number — specifically, its estimate of this number — to decide whether to use an index, which index to use, and in what order to join tables.
The intuition: if a filter narrows the row count down to a small fraction of the table, an index lookup is cheap. If the filter matches most of the table, a sequential scan is cheaper because reading pages in order beats jumping around via the index.
|
|
The break-even point is engine- and hardware-dependent, but the rule of thumb is that an index starts to lose to a sequential scan once the filter matches somewhere between 5% and 20% of the table. Below that, the index wins. Above that, the planner will usually skip the index entirely — even if you wrote it specifically for that query.
is_deleted column is rarely useful on its own — the planner will scan. An index on (is_deleted, created_at) can be useful for queries filtering on both, because the composite lets the planner narrow more aggressively.Composite index column ordering
For composite indexes, the order of columns matters — it’s a selectivity decision that depends on how the index will be used. The general rules:
- Put equality filters before range filters. An index on
(customer_id, created_at)is efficient forWHERE customer_id = 42 AND created_at > '2026-01-01'. The reverse —(created_at, customer_id)— forces the planner to scan a wide range ofcreated_atvalues and then filter. - Put the more selective column first for equality-only filters. If both columns are filtered with equality, put the column with more distinct values first to maximize the narrowing on the first lookup.
- Match the index order to the query’s filter order when possible. A query that always filters by
(a, b, c)benefits from an index in that exact order. Reordering the index means the leading columns aren’t being filtered, which often defeats the index entirely.
These aren’t strict rules — they interact with cardinality estimates, sort order requirements, and covering index considerations. But they’re the right starting point. The wrong starting point is “put all the filter columns in some order” without thinking about how the index will actually be used.
Planner statistics drive everything
The planner doesn’t read the data — it reads statistics about the data and makes decisions from those. PostgreSQL tracks per-column statistics in pg_stats:
|
|
MySQL surfaces similar information through information_schema.STATISTICS and information_schema.INNODB_TABLESTATS, though less comprehensively than PostgreSQL’s stats.
These statistics are gathered by ANALYZE (PostgreSQL) or maintained automatically by InnoDB (MySQL). They go stale between runs. A table that was 10 million rows when last analyzed and is now 200 million rows has planner statistics that no longer reflect reality — and the planner is making decisions based on the outdated picture.
The usual symptom of stale statistics: a query that was fast yesterday is slow today, even though nothing changed. The planner’s row estimate for some step is off by orders of magnitude, so it picks a nested loop where it should have picked a hash join, or scans a table it should have seeked into.
|
|
EXPLAIN ANALYZE comparing estimated to actual row counts is one of the highest-leverage diagnostic tools in SQL. When the ratio is out of whack, the plan is wrong for the data, and the fix is usually statistical (refresh stats, add extended statistics, increase statistics target) rather than rewriting the query.
Cardinality estimation errors
The single most common cause of bad query plans in production is a bad row-count estimate. Two flavors:
- Underestimates — the planner thinks a step will return 10 rows when it actually returns 10 million. It picks a nested loop (good for small outer side), which then runs 10 million times. What should have been a 50ms query takes 50 minutes.
- Overestimates — the planner thinks a step will return 10 million rows when it actually returns 10. It picks a hash join (good for large sets) that allocates buckets for 10 million entries and spills to disk. What should have been a 5ms query takes 5 seconds.
Both errors are failures of the statistics, not the query. And both are especially hard to diagnose because the query text looks identical in the fast and slow cases — only the underlying data (or the statistics about it) has changed.
Correlated columns break the independence assumption
The planner assumes columns are statistically independent when estimating the selectivity of a compound predicate: WHERE country = 'US' AND state = 'CA' is estimated as (fraction where country=US) × (fraction where state=CA). In reality, state = CA implies country = US — the columns are perfectly correlated, and the compound filter matches many more rows than the independence assumption predicts.
|
|
Extended statistics (PostgreSQL 10+) let you declare that certain column combinations are correlated, giving the planner better estimates for compound filters. They’re not free — statistics take space and have to be maintained — but for a few specific high-traffic queries, they can turn a disastrous plan into a good one.
MySQL doesn’t have an equivalent facility. Correlated-column estimation errors are harder to fix at the planner level; the workaround is typically to restructure the query or add indexes that capture the correlated access pattern.
UNIQUE constraints help the planner
A UNIQUE constraint isn’t just a correctness tool — it’s also a signal to the planner. Knowing a column is unique lets the optimizer prove that a join on that column produces at most one match per input row, which enables specific plan optimizations:
- Deduplication elimination. If joining A to B on a unique B column, the planner knows the result set can’t have duplicates and can skip any deduplication step.
- Join order reordering. Unique columns are guaranteed 1:1 or 1:0, which constrains how joins can be reordered without changing semantics. More options usually means better plans.
- Index-only scan eligibility. Unique constraints are backed by unique indexes, which are often the right target for index-only scans.
The practical implication: schemas that declare their unique constraints let the planner do more. Schemas that rely on application-level uniqueness get correct results but may get worse plans, because the planner can’t trust assumptions it can’t see.
Diagnosing uniqueness and selectivity issues
Both problems show up in predictable ways. A short playbook:
“Why is this query slow?”
First, run EXPLAIN ANALYZE and compare estimated to actual row counts.
|
|
If estimated and actual are within a factor of 2 or 3, the plan is reasonable and the slowness is elsewhere (missing index, wrong index, large result set). If they differ by 10x or more, the statistics are off or the columns are correlated. Refresh stats (ANALYZE), check for correlation, and re-examine the plan.
“Why isn’t my index being used?”
Check the column’s selectivity:
|
|
If selectivity is below ~5%, the planner may be skipping the index in favor of a sequential scan — and that’s often the right choice. If selectivity is high and the index still isn’t being used, check for functions in the WHERE clause (non-SARGable predicates), type mismatches (implicit casts that prevent index use), or stale statistics.
“Why do I have duplicates?”
Find them:
|
|
If duplicates exist, the schema is missing a unique constraint that should have been there. Adding the constraint requires cleaning up first, which is a separate migration. Going forward, every write path has to be considered — an application change that prevents new duplicates doesn’t help the duplicates that already slipped through other paths.
“Why is my UPSERT doing the wrong thing?”
Check which unique constraint the ON CONFLICT (or ON DUPLICATE KEY) target is matching against. In MySQL, ON DUPLICATE KEY UPDATE matches any unique key that conflicts — if a row has a unique constraint on both email and external_id, an insert that conflicts on either triggers the update. In PostgreSQL, ON CONFLICT (col) requires a specific named column or constraint and fails if one doesn’t match — which is annoying when writing the query but means the behavior is explicit.
The mental model
Uniqueness and selectivity map to two questions the planner and the engineer both need answered for every table and every query:
- How many rows per key? (Uniqueness: 1:1, 1:N, N:M.) This determines whether joins multiply, whether
UPSERTs do the right thing, and whether aggregations can be trusted. - How many distinct values relative to total? (Selectivity.) This determines whether indexes are useful and which join order the planner will pick.
Both questions are answerable from the schema if the constraints are declared and the statistics are current. Both questions become hand-waving if they’re not. Schemas that leave uniqueness implicit — enforced in application code, documented in a wiki, “known” by the senior engineers — are schemas where the database can’t help with correctness or performance. The constraints are the contract between the schema and everything that touches it.
The habit that pays off: declare every unique constraint that should exist, including the composite ones on bridge tables; keep planner statistics fresh on busy tables; use EXPLAIN ANALYZE to catch cardinality estimation errors early; and add extended statistics for the handful of correlated-column cases that show up in hot paths. None of this is glamorous. All of it compounds.
The cost of getting it wrong isn’t usually a crash. It’s a query that’s 100x slower than it should be, a report that’s silently inflated, or an UPSERT that corrupts data one weekend when traffic patterns shift. These are the failure modes that make databases feel unpredictable — and they’re almost all traceable back to the same two numbers.