Uniqueness and Selectivity: The Two Numbers That Drive Query Plans

TL;DR

Uniqueness governs correctness, selectivity governs performance. The interesting parts of both live in the edge cases: partial unique indexes and their UPSERT targeting quirks, the way partitioning weakens every uniqueness guarantee, correlated columns that defeat planner assumptions, stale statistics that turn a 5ms query into a 5-minute one. Declaring the constraints the planner can see and keeping its statistics fresh buys more than any amount of query rewriting.

Everyone who works with relational databases knows UNIQUE. What they often don’t know is how it behaves under partitioning, how ON CONFLICT targets it (and doesn’t), and what the planner actually does with it beyond rejecting duplicates. Selectivity is in the same category. The definition is trivial, but the behavior that matters lives in composite column ordering, stale statistics, and the correlated-columns problem that breaks the planner’s core assumption.

This is the territory where “the query is correct” and “the query is fast” stop being the same question, and both depend on what the database can actually prove about the data. The constraints are the contract between the schema and the planner. Everything else is inference.

Partial and filtered unique indexes

PostgreSQL supports partial unique indexes: uniqueness enforced only over rows matching a predicate. This is the right tool for the common real-world case “email must be unique among active users”:

1
2
3
4
-- PostgreSQL: email unique only among non-soft-deleted rows.
CREATE UNIQUE INDEX users_active_email_uniq
  ON users (email)
  WHERE deleted_at IS NULL;

A plain UNIQUE (email) forces a choice: either allow re-registration (and lose referential integrity by reusing emails across deleted and active rows) or block it (and frustrate users whose accounts were long ago soft-deleted). The partial index lets both coexist.

MySQL doesn’t support partial unique indexes directly. The workaround exploits MySQL’s treatment of NULL as distinct under UNIQUE (covered in NULL: Three-Valued Logic):

1
2
3
4
5
-- MySQL idiom: generated column that's NULL for deleted users.
ALTER TABLE users
  ADD COLUMN email_active VARCHAR(255)
  GENERATED ALWAYS AS (CASE WHEN deleted_at IS NULL THEN email END) VIRTUAL,
  ADD UNIQUE KEY users_active_email_uniq (email_active);

The constraint effectively fires only for rows where email_active is non-NULL: exactly partial-index semantics, just expressed through a generated column. Awkward to write, but portable-ish and the ORMs catch on eventually.

Partitioned tables force uniqueness compromises

Partitioned tables in both PostgreSQL and MySQL require the partition key to be part of every unique constraint - including the primary key. The rule exists for correctness: without the partition key in the constraint, the database would have to scan every partition on every insert to enforce uniqueness, defeating the point of partitioning.

The practical consequence is that PRIMARY KEY (id) isn’t allowed on a table partitioned by created_at. It has to become PRIMARY KEY (id, created_at). The same applies to every other unique constraint: UNIQUE (email) on a users table partitioned by region becomes UNIQUE (email, region), which quietly weakens the guarantee. The schema now allows the same email to exist in multiple regions, whether or not the application ever intended that.

This is one of the sharper trade-offs in partitioning decisions. A uniqueness guarantee the schema used to provide gets weaker, and point lookups that used to be single-row const accesses become ref lookups because the full primary key isn’t spelled out in every query. How Partitioning Turns WHERE id = 12345 Into a 36-Partition Scan covers the full picture, including why partitioning by the primary key itself (when the PK is monotonically increasing) sidesteps the trade-off entirely.

UPSERT targeting is more specific than it looks

INSERT ... ON CONFLICT (PostgreSQL) and INSERT ... ON DUPLICATE KEY UPDATE (MySQL) bind to specific unique constraints, not to “any uniqueness that happens to apply.” The difference between the two engines is where most of the subtle bugs live.

PostgreSQL is explicit. ON CONFLICT (email) requires a unique constraint or unique index exactly matching email. If none exists, the statement errors out. If a partial unique index exists instead of a plain one, ON CONFLICT (email) does not match it; you need the full predicate:

1
2
3
4
5
-- Must match the partial index's predicate to target it.
INSERT INTO users (email, name)
VALUES ('[email protected]', 'Alice')
ON CONFLICT (email) WHERE deleted_at IS NULL
DO UPDATE SET name = EXCLUDED.name;

If the partial index changes (predicate tightened, column added), every ON CONFLICT targeting it has to change too. This is explicit coupling, but it’s coupling.

MySQL is implicit and more dangerous. ON DUPLICATE KEY UPDATE fires on conflict with any unique key on the table, not just the one the query author had in mind. If the table has UNIQUE (email) and UNIQUE (external_id), an insert that conflicts on either key triggers the update. For rows where the inserted email matches one existing row and the external_id matches a different one, the behavior depends on which index is checked first and is undefined as far as the language is concerned.

The practical implication: adding a new unique key to a table can silently change the semantics of every existing INSERT ... ON DUPLICATE KEY UPDATE against that table. There’s no error, no warning, just different behavior on the next conflict that falls into the new key’s path. On large schemas with dozens of unique keys, this is the UPSERT equivalent of action at a distance.

The mitigation on MySQL is to prefer INSERT ... ON DUPLICATE KEY UPDATE only when there’s a single obvious unique key, and to reach for REPLACE or explicit SELECT ... FOR UPDATE + conditional UPDATE/INSERT flows when the semantics need to be explicit.

Unique indexes also concentrate deadlock pressure

Two deadlock patterns are specific to unique indexes and show up almost nowhere else:

Duplicate-key inserts take locks even when they fail. When InnoDB detects a duplicate on insert, it doesn’t just raise the error; it first acquires a shared next-key lock on the conflicting row. Under REPEATABLE READ (the default), that lock covers the gap too. Two concurrent transactions inserting near the same unique key can deadlock on those shared locks before either sees the duplicate-key error. The most common production signature: a batch-upsert worker hitting the same hot row ranges from multiple threads.

ON DUPLICATE KEY UPDATE batches deadlock when key ordering differs. Each row in a batch insert acquires its lock when the row is processed, not when the batch starts. Two batches touching overlapping keys (A, B) vs (B, A) take locks in opposite order and cycle. The fix is either sorting rows by unique key before the batch (so lock acquisition order is consistent across workers) or switching to INSERT ... ON CONFLICT DO NOTHING plus a separate targeted UPDATE pass.

Neither of these shows up the same way with non-unique indexes; the uniqueness check itself is what forces the extra locking. It’s the cost of making the database enforce the guarantee, and it scales badly once the hot-key set is small and write concurrency is high. (Database Deadlocks, Part 1 covers the broader patterns; Part 2 covers reading the log, retries, and prevention.)

Composite index column ordering

The order of columns in a composite index is a selectivity decision that determines whether the index helps the query it was built for. The usual rules compress to three:

Equality filters before range filters. An index on (customer_id, created_at) is efficient for WHERE customer_id = 42 AND created_at > '2026-01-01'. Reversed ((created_at, customer_id)), the index has to scan a wide range of created_at values and filter customer_id as a secondary step, which is usually worse than a sequential scan.
More selective column first for equality-only predicates. For filters of the form WHERE a = ? AND b = ?, the column with more distinct values goes first so the first lookup narrows more aggressively.
Match the query’s access pattern. An index on (a, b, c) serves queries filtering by a, (a, b), or (a, b, c). It does not serve queries filtering by b alone, c alone, or (b, c). The leading column is load-bearing.

These interact with covering index considerations, sort order requirements, and the planner’s ability to combine multiple indexes via bitmap scans. But the starting point is: think about how the index will be read, not what columns are available to throw at it.

MySQL clustered indexes flip the rule

The above applies cleanly to secondary indexes. The MySQL InnoDB primary key is a different animal: a clustered index, meaning the PK’s leaf pages are the table. The ordering of PK columns decides physical row order on disk, and that often matters more than selectivity.

The canonical example is PRIMARY KEY (tenant_id, id) on a multi-tenant table. tenant_id has maybe 10K distinct values (low selectivity); id is near-unique. By “most selective first,” the answer would be (id, tenant_id), and it would be wrong:

Physical clustering. All rows for one tenant sit contiguously in the B-tree. Tenant-scoped range scans read a narrow slice of pages sequentially, and the buffer pool caches a single tenant’s hot data together. (id, tenant_id) scatters that same tenant’s rows across the whole table.
Secondary index lookups cost less. InnoDB secondary indexes store the PK, not a row pointer. A query that uses a secondary index and then needs a full row does a PK lookup per match. With (tenant_id, id), those lookups for one tenant cluster together. With (id, tenant_id), each is random I/O across the table.
Insert locality. If id is monotonically increasing within a tenant, inserts land on recent pages per tenant, avoiding page splits scattered across the index.

The rule for an InnoDB PK is: put the column that represents the dominant access pattern first, even if it’s less selective. Selectivity cuts rows; clustering cuts I/O. On a large clustered index, I/O usually dominates.

This is also why PRIMARY KEY (id) plus INDEX (tenant_id) on a multi-tenant table is often slower than PRIMARY KEY (tenant_id, id); the secondary index forces a PK-lookup hop on every read that the clustered choice avoids entirely.

PostgreSQL’s primary key is a separate B-tree unique index, not clustered (a CLUSTER command exists but isn’t maintained as rows are inserted), so the ordering logic there stays closer to the secondary-index rules.

The planner doesn’t read the data - it reads statistics

The planner’s entire decision-making process rests on statistics that summarize the data, not the data itself. PostgreSQL’s per-column statistics live in pg_stats:

1
2
3
4
5
6
7
8
9
SELECT
  attname,
  n_distinct,        -- estimated distinct values (negative means fraction)
  null_frac,         -- fraction of rows that are NULL
  most_common_vals,  -- top values by frequency
  most_common_freqs, -- corresponding frequencies
  histogram_bounds   -- distribution of non-common values
FROM pg_stats
WHERE tablename = 'orders' AND attname = 'customer_id';

MySQL exposes similar information through information_schema.STATISTICS and INNODB_TABLESTATS, though less granularly than PostgreSQL’s statistics. MySQL lacks per-column histograms on most versions (8.0+ has optional histograms, off by default).

These statistics are gathered by explicit ANALYZE in PostgreSQL and maintained automatically by InnoDB in MySQL. They go stale between runs. A table that was analyzed at 10M rows and is now 200M rows has planner statistics that no longer reflect reality. Join reorderings based on those estimates are decisions made on outdated data.

The usual symptom is a query that was fast yesterday and slow today, with no schema or query change. The planner’s row estimate for some step has drifted far enough from reality that the plan shape flipped: nested loop where it should have been hash join, or a sequential scan where an index seek would have won. EXPLAIN ANALYZE with its estimated-vs-actual row counts is the fastest way to confirm this:

1
2
3
4
EXPLAIN ANALYZE SELECT * FROM orders WHERE customer_id = 42;
-- Index Scan using orders_customer_id_idx
--   (cost=0.43..1234.56 rows=1000 width=128)
--   (actual time=0.123..45.678 rows=180000 loops=1)

The rows=1000 is the estimate. The actual rows=180000 is reality. A ratio of 100x+ between them is the signal. The fix is statistical (refresh stats, increase the statistics target for that column, add extended statistics for correlated columns) and not a query rewrite.

Cardinality estimation errors and their shape

The single most common cause of bad query plans in production is a bad row-count estimate on an intermediate step. Two flavors, each with distinctive symptoms:

Underestimates. The planner thinks a step will return 10 rows, actually returns 10 million. The plan picks a nested loop (good for a small outer side), which now runs 10 million iterations. A query that should have been a 50ms hash join takes 50 minutes. The telltale sign in EXPLAIN ANALYZE is loops=10000000 on an inner node that was costed for a handful.

Overestimates. The planner thinks 10 million rows, actually 10. The plan allocates a hash table sized for millions, spills to disk under memory pressure, and runs a 5ms lookup in 5 seconds. Less common but more insidious, because the query didn’t “fail” in any obvious way; it just used more memory and I/O than it needed.

Both are failures of the statistics, not the query. Both are especially hard to diagnose because the query text is identical in the fast and slow cases; only the planner’s belief about the data changed. When the ratio between estimated and actual is large and consistent, the problem is upstream of the query.

Correlated columns break the independence assumption

The planner estimates the selectivity of a compound predicate WHERE a = x AND b = y by multiplying the individual selectivities, assuming the columns are statistically independent. When they’re not, the estimate can be off by orders of magnitude.

The canonical example is (country, state):

1
2
3
4
EXPLAIN ANALYZE
SELECT * FROM addresses WHERE country = 'US' AND state = 'CA';
-- Estimate: (0.25) * (0.02) * N = 0.5% of rows
-- Reality:  ~2% of rows - state = 'CA' implies country = 'US'

The planner assumed the two filters cut the rowcount independently. In reality, state = 'CA' already determines country = 'US' (there are no California rows with a different country) so the compound filter isn’t as selective as the multiplication suggests.

PostgreSQL 10+ supports extended statistics to fix this:

1
2
3
CREATE STATISTICS country_state_corr (dependencies, ndistinct)
  ON country, state FROM addresses;
ANALYZE addresses;

The dependencies statistic captures functional dependencies (one column determines another); ndistinct captures the distinct combinations of the column set. Both are used during planning to correct the independence-assumption multiplication.

MySQL has no equivalent. Correlated-column estimation errors there are harder to fix at the planner level; the workaround is usually to restructure the query (force a specific join order, introduce an intermediate CTE, or add a covering index that captures the correlated access pattern directly).

UNIQUE as a planner signal, not just a guardrail

A UNIQUE constraint is also a proof the planner can use. Knowing a column is unique lets the optimizer reason about the shape of joins and aggregates in ways it can’t when uniqueness is only implicit:

Deduplication elimination. SELECT DISTINCT u.id FROM users u JOIN orders o ON o.user_id = u.id can skip the DISTINCT step entirely if the planner knows users.id is unique. The join already produces at most one row per u.id per matching order, and the DISTINCT becomes a no-op. Without the declared uniqueness, the planner has to run the dedup pass.
Join elimination. When joining A to B on a unique column of B, and selecting only columns from A, the planner can drop the join entirely in some cases (it proved the join doesn’t change the output). This is a real optimization on star-schema queries.
Reorderable joins. Unique constraints make certain join orderings provably equivalent, giving the optimizer more plan shapes to choose from. The more plans it can try, the more likely it finds a good one.
Index-only scan eligibility. Unique indexes are natural targets for index-only scans, which skip the heap/table access when every column the query needs is already in the index.

Schemas that leave uniqueness implicit (enforced in application code, promised in a wiki) can still produce correct results, but the planner can’t trust assumptions it can’t see. The constraint is what turns uniqueness from a property of the data into a property of the schema that the planner reads as a fact.

What UNIQUE tells a schema-reading model

The planner isn’t the only consumer of declared uniqueness. Schema-reading assistants (Copilot, MCP-backed agents, text-to-SQL tools) read information_schema.TABLE_CONSTRAINTS and pg_constraint the same way they read column types. A declared UNIQUE is the only signal in the catalog that says “at most one row per X.” Without it, the model has no way to prove 1:1 semantics and either hedges with a defensive LIMIT 1 it can’t justify or writes GROUP BY / DISTINCT passes that shouldn’t be necessary. ON CONFLICT and ON DUPLICATE KEY UPDATE targeting is especially fragile: the model picks the column name that matches the prompt (“upsert by email”) and the query either fails at runtime because no unique constraint exists on that column, or silently targets a different constraint than intended.

Selectivity is the part the model has even less access to. Planner statistics (pg_stats.n_distinct, MySQL’s information_schema.STATISTICS cardinality estimates) aren’t part of the prompt for most schema-aware tools, and the model has no way to query them mid-generation. Asked “how do I speed this query up?” the assistant’s default answer is “add an index,” regardless of whether the indexed column has two distinct values or two million. The same schema discipline that keeps the planner honest (declared unique constraints on every at-most-one relationship, composite primary keys on bridge tables, column-level comments that describe the value shape) is what gives catalog-reading models enough context to produce queries that don’t require a second human pass.

Diagnosing the usual suspects

Three patterns cover most of the uniqueness/selectivity-shaped bugs in production:

“This query got slow and nothing changed.” Run EXPLAIN ANALYZE. Compare estimated to actual row counts on each node. A large ratio (10x+) means the planner has stale statistics, missing extended statistics on correlated columns, or both. Refresh stats with ANALYZE; add extended statistics if a compound predicate is the source.

“I built an index and the planner ignores it.” Check the column’s selectivity directly: distinct values over total. Below ~5%, a sequential scan is usually the right choice and the planner isn’t wrong. If selectivity is high, check for functions in the WHERE clause (non-SARGable predicates), implicit type casts (an indexed BIGINT column filtered with a VARCHAR literal can fall off the index), or stale statistics underreporting the column’s uniqueness.

“My UPSERT corrupts data under load.” Check which unique key it’s targeting. In MySQL, ON DUPLICATE KEY UPDATE fires on conflict with any unique key, including ones added after the query was written. In PostgreSQL, partial unique indexes require the predicate in ON CONFLICT; mismatches silently fall through to insert rather than update.

The mental model

Uniqueness and selectivity collapse to two questions that both the planner and the engineer need answered for every table and query:

How many rows per key? Uniqueness. Determines whether joins multiply, whether UPSERTs target the right constraint, and whether aggregations can be trusted.
How many distinct values relative to total? Selectivity. Determines whether indexes help, which join order the planner picks, and how badly a compound filter will miss.

Both answers are visible to the planner if the constraints are declared and the statistics are current. Both become guesswork when they’re not. The habit that pays off isn’t heroic query tuning. It’s keeping the database’s model of the data honest: declare the unique constraints that exist (including composite ones on bridge tables), refresh statistics on busy tables, add extended statistics where correlation has burned you before, and read EXPLAIN ANALYZE for the ratio between estimated and actual rows every time a query slows down.