Anti-Patterns on EXPLAIN ANALYZE

Covering Index Traps: When Adding One Column Breaks Your Query

Wed, 22 Apr 2026 00:00:00 +0000

TL;DR

An index-only scan is the fastest way a relational database can answer a query — the engine reads the index and never touches the table. Adding a single column to the SELECT list that isn’t in the index silently breaks that optimization, and the query that ran in a millisecond now takes seconds on the same data. The fix isn’t “never SELECT extra columns” — it’s knowing that the SELECT list is part of the query’s performance contract with the index.

Here’s a query that ran in production for a year with sub-millisecond latency:

`1`	`SELECT status, created_at FROM orders WHERE customer_id = 42;`

The orders table has a composite index on (customer_id, status, created_at). Every column the query needs — customer_id for the filter, status and created_at for the output — is in that index. The database reads the index, returns the results, and never touches the table. This is an index-only scan: one of the most significant optimizations a relational engine makes, and the mechanism behind “covering” queries.

Then a feature request: “show the order total on this page.” The change looks trivial.

`1`	`SELECT status, created_at, total_cents FROM orders WHERE customer_id = 42;`

One column added. The query is still correct. The index still matches the filter. But total_cents isn’t in the index — so for every matching row, the engine now follows a pointer back to the table to fetch that one extra column. On a table with millions of rows, that’s a random I/O per match. The query that was 0.4 ms is now 1243 ms.

The obvious fix is “just don’t add columns to queries.” That doesn’t work — features need data. The slightly-less-obvious fix is “always project the minimum columns,” which is fine as advice and ignored in practice because every ORM defaults to SELECT *. The actual fix is to treat the SELECT list as part of the query’s performance contract with the index, and to know what that contract is before changing it.

What’s actually happening

The execution plan tells the whole story:

-- Before: index-only scan
EXPLAIN ANALYZE SELECT status, created_at FROM orders WHERE customer_id = 42;
-- Index Only Scan using idx_orders_cust_status_created on orders
-- Heap Fetches: 0
-- Execution Time: 0.4 ms

-- After: index scan + table lookups
EXPLAIN ANALYZE SELECT status, created_at, total_cents FROM orders WHERE customer_id = 42;
-- Index Scan using idx_orders_cust_status_created on orders
-- Execution Time: 1243.7 ms

Same index. Same filter. Same rows returned. The only difference is the select list, and it moves the query from a pure index walk to an index walk plus one random I/O per matching row.

Buffer pool pollution compounds the damage. When the engine fetches full rows from the table instead of reading compact index entries, it loads entire data pages into the buffer pool. Those pages — carrying every column of every matched row, most of which the query doesn’t need — evict pages that other queries do need. On a busy system with a finite buffer pool, one query losing its covering index degrades performance for unrelated queries across the database. The slow query you noticed is rarely the only thing getting slower.

Nothing in the query results tells you. The rows come back correctly. The response looks the same. A SELECT COUNT(*) returns the same count. The only place the degradation is visible is in the execution plan — and nobody checks the execution plan when the feature ships.

ORM defaults

Most ORMs emit SELECT * unless explicitly told otherwise. ActiveRecord needs .select(:id, :status); Django needs .only('id', 'status'); SQLAlchemy needs explicit column specification; Prisma needs an explicit select block. On a high-traffic table, a one-line change to project only the needed columns is one of the highest-leverage optimizations available. Worth checking what your ORM actually generates on the query paths that matter — the generated SQL is the contract, not the method call.

The fix: match the index or extend it

There are two workable fixes when a query loses its covering index, and they trade different costs:

Project only what the index covers. If the new column isn’t worth fetching from the table on every row, don’t fetch it. Split the query: one covered query for the list view, a targeted lookup for the detail row the user actually wants. Most feature requests that “need” an extra column on a list page are actually fine with lazy-loading the value on click.

Extend the index to include the new column. If the column is genuinely needed on every row, add it to the index — either as an additional indexed column or (in PostgreSQL) as an INCLUDE clause that adds the value to the leaf pages without making it part of the B-tree ordering:

-- PostgreSQL: add total_cents as a non-key included column
CREATE INDEX idx_orders_cust_status_created_total
    ON orders (customer_id, status, created_at)
    INCLUDE (total_cents);

INCLUDE is the right tool when you need the column covered but don’t want it affecting the sort order or filter path. The trade-off is write cost: the index is now larger, and every update to total_cents has to update the index entry. On a write-heavy table that’s meaningful; on a read-heavy table it’s usually negligible compared to the read speedup.

MySQL (InnoDB) doesn’t support INCLUDE but has a natural equivalent: every secondary index already contains the primary key at its leaves, and you can extend the secondary index to cover additional columns by adding them as regular key columns. The planner is smart enough to use the covered form when the column is present.

When covering isn’t the right call

Covering indexes aren’t a universal good. Three cases where chasing a covering index is the wrong move:

Low-selectivity filters. If customer_id = 42 matches 80% of the table, the planner won’t use the index at all — a sequential scan is cheaper. Index-only scans matter when the filter is selective. On a low-selectivity predicate, covering changes nothing.

Write-heavy tables. Every index slows writes. A table taking 50,000 inserts per second with five secondary indexes already pays a real cost for every index entry. Adding a covering variant of an existing index to shave read latency from 15 ms to 3 ms is a bad trade if the table is write-dominated — the write penalty compounds on every row, and only the reads benefit.

Rapidly changing projections. If the feature team is adding and removing columns from the list view every sprint, chasing the covering index is a losing game. Freeze the list-view columns as a contract, document them in the schema, and let the index match that contract — or don’t bother indexing for coverage at all.

The bigger picture

The SELECT list is a performance contract in most code reviewers’ blind spot. WHERE clauses get scrutinized because they’re obviously performance-relevant. JOINs get scrutinized because cardinality mistakes are visible. The SELECT list gets waved through because “it’s just what we display” — and then a one-column addition drops a query from 0.4 ms to 1243 ms with no code-review signal to catch it.

EXPLAIN ANALYZE is the only authority here. Reading execution plans isn’t glamorous, but it’s the difference between a query that works and a query that works at scale — and between a select-list change that’s free and one that silently broke the optimization the index existed to enable. On the queries that carry the most traffic, the execution plan belongs in code review alongside the query itself.

God Tables: 150 Columns and the Quiet Cost of 'Just Add a Column'

Wed, 22 Apr 2026 00:00:00 +0000

TL;DR

A wide table looks cheap because every column was added for a real reason — the expensive part is that rows grow, every write amplifies, and every secondary index inherits the bloat. The fix isn’t aggressive normalization (which trades one wide table for six-way joins on every read) but splitting by access pattern: columns read together stay together, rarely-touched columns move out.

The schema started clean four years ago: users(id, email, password_hash, created_at) — four columns. Today the table is renamed customers and has 184 columns. Billing address. Shipping address. Three additional shipping addresses numbered 2 through 4. preferences_json for user settings. Twelve feature-flag TINYINTs. Three Stripe identifiers from three processor migrations. last_login_at, last_seen_at, last_purchase_at, last_notification_sent_at. Forty more columns whose meaning lives in Confluence, if anywhere. No single ALTER TABLE ADD COLUMN was unreasonable at the time. The accumulated result is an average row size of 6KB, an UPDATE to last_login_at that rewrites every byte of it, and a buffer pool holding four customer rows per page instead of forty.

The obvious fix is to normalize it — split into customer_profile, customer_billing, customer_addresses, customer_preferences, customer_feature_flags, customer_audit. That’s the textbook answer and it’s the one that breaks the moment you look at the dominant read. The list view on the admin page needs name, email, status, last login, Stripe status, and total spent — now it’s a six-way join on every page load. The fix that looked clean in the migration doc makes the most-frequent query more expensive, not less. The read cost moves to the place it’s paid most often, and somebody — usually a few months later — proposes a materialized view to “just flatten it back out,” which is the god table returning through a different door.

How a row-store actually reads a row

Before the cost math makes sense: OLTP engines like InnoDB and PostgreSQL’s heap store complete rows laid out contiguously on fixed-size pages — typically 16KB in InnoDB, 8KB in PostgreSQL. A page holds as many rows as fit. When a query needs one column of one row, the engine doesn’t read that column alone; it locates the row’s page via an index lookup or scan, loads the whole page into the buffer pool, and reads the requested column out of the in-memory row image.

The one exception is the index-only scan: if every column the query projects and filters on is already present inside an index, the base table doesn’t have to be touched and only the index pages are loaded. See Covering Index Traps for how quickly this optimization disappears — usually the moment a SELECT list grows by one column. Every other read path goes through the row, which means the row’s width sets the floor on how much data the engine moves per lookup. Reading email from a 184-column customer row loads 6KB into memory to return 50 bytes; reading the same column from an 800-byte row loads 800 bytes. The buffer pool is a fixed size and every byte of unused column data in it is displacing something another query needs.

Column stores (ClickHouse, BigQuery, Parquet-backed warehouses) invert this entirely — data is laid out by column, so reading one column reads only that column’s storage. The wide-table cost math doesn’t apply there, which is why this anti-pattern is specifically a row-store OLTP problem and why denormalized fact tables in analytical warehouses are fine at 300 columns.

What 150 columns actually costs

The individual cost of one column is negligible. The system-level cost shows up in several places at once, and none of them are visible in a diff that adds one more.

Row size and write amplification. InnoDB stores full rows on disk pages, and an UPDATE rewrites the entire row even if only one column changed. On a 184-column table averaging 6KB per row, updating last_login_at on every sign-in rewrites 6KB, not 8 bytes. PostgreSQL doesn’t rewrite in place — MVCC creates a new tuple for every UPDATE and marks the old one dead — but the new tuple is 6KB too, and VACUUM has that much more to reclaim. Either engine, the write cost per logical change scales with row width.

Buffer pool density. The page-per-read mechanism above means buffer-pool efficiency scales inversely with row width. At 6KB per row, an InnoDB 16KB page holds two rows; at 400 bytes per row it holds forty. A database with 10GB of buffer pool has the effective working set of a much smaller instance once rows get wide — queries that used to run hot start touching disk for no reason other than that the rows they cared about no longer fit in memory alongside the rows other queries cared about.

Secondary indexes inherit the width problem. Every secondary index in InnoDB carries a copy of the primary key at its leaves; every index entry is a key-columns + PK-copy record. A wide table tends to accumulate indexes — you index email, Stripe ID, last-login, phone, region, account-manager-ID, each for a different query path. Six secondary indexes on a 184-column table isn’t unusual, and each of them is physically larger than it would be on a narrow table, because the PK copy and fill-factor choices interact with row density. Covering indexes are also harder to arrange: the list view wants eight columns projected, and indexing eight columns of a 184-column table to cover one query is an expensive trade.

Lock and transaction width. Every UPDATE acquires a row-level lock. Transactions that touch a wide row hold that lock for the duration of the transaction, and because the row spans many concerns — billing, preferences, audit timestamps — transactions from unrelated code paths contend on the same row. A background job updating last_seen_at now serializes against a billing job updating stripe_customer_id on the same customer, because both paths lock the same row. In the split-by-concern shape, they’d contend on different rows of different tables.

Schema migrations get more expensive. ALTER TABLE ADD COLUMN on a 184-column table is slower, holds metadata locks longer, and has a larger blast radius if it fails. MySQL’s online DDL is usually fine for NULL-default additions; PostgreSQL is generally fast for the same case. But any migration that needs to rewrite rows (changing a column type, adding NOT NULL with a backfill) scales with row size, and a 6KB row rewrite on 200 million rows is a different operation than an 800-byte row rewrite on the same count.

Every column is a commitment

The cost of adding a column is small and immediate. The cost of having 150 columns is systemic and deferred — buffer-pool density, index size, write amplification, lock contention, migration cost. None of the deferred costs are visible in the PR that adds one more column, which is why they accumulate uncorrected until the table is painful.

Why LLMs make this worse

Schema drift in the wide-table direction is what language models reinforce by default. A model generating ALTER TABLE for a feature request reads the current schema and proposes the smallest change that makes the feature work — which is almost always adding columns to the table that already holds the related data. Proposing a split requires understanding the access pattern, the transaction boundaries, and the write frequency of the new columns versus the existing ones. None of that is in the CREATE TABLE.

The loop reinforces itself: the wider the table gets, the more natural it is for the next change to widen it further. “Where do loyalty tier and tier expiry go?” The model sees customers has every other user-attached concept in it and adds two columns. The alternative — CREATE TABLE customer_loyalty (customer_id PK FK, tier, expires_at) — requires the model to argue for a split, and splits are rare in the training data compared to additions because splits are rare in real codebases for the same reason: they’re harder to ship than additions. The model is correctly pattern-matching on what humans actually do, which is exactly the problem.

ORMs compound this. One model equals one table is the default shape in ActiveRecord, Django ORM, Prisma, SQLAlchemy, and Ecto. Refactoring a Customer model into three co-owned tables is a change that touches every query, every serializer, every test. The ORM makes “add a column to the existing model” a five-line change and “split the model” a project. Engineers pick the cheap option every time, and the wide table ratchets.

Split by access pattern, not by concept

“Normalize it” isn’t the fix because normalization is a property of data shape, not query cost. The fix is to look at what columns are actually read and written together, and keep those co-located; the rest moves out.

A workable decomposition for the customers example:

Core hot table — the columns read on nearly every query: id, email, name, status, tier, stripe_customer_id, created_at. Maybe twenty columns. This is what the list view, the auth path, and most API responses need.
1:1 cold tables — concerns that are read rarely or in specific flows: customer_audit for login/seen/purchase timestamps, customer_preferences for user settings, customer_feature_flags for the twelve TINYINT flags. Each is a separate table with customer_id as PK and FK, joined only when the flow actually needs it. Writes to last_login_at stop rewriting the billing row.
1:N tables for repeating groups — addresses, payment methods, anything that was modeled as shipping_address_2, shipping_address_3, shipping_address_4 is an addresses table with a FK and a type. This collapses polymorphic-ish schema decisions that shouldn’t have been made at the column level in the first place; see Polymorphic References for the related pattern where doing this without a FK goes wrong.

The trade-off is that some queries now join two or three tables instead of reading one. On the hot path this is fine — the joins are on PK-equals-FK, the join tables are small, and the read is usually cheaper than scanning a fat row. The cold path is where it matters: the audit screen now joins customers to customer_audit, which costs one indexed lookup and nobody notices. The place to be careful is the query that reads from three of the split tables on every request — if that’s dominant, one of those tables probably belongs merged back in.

When a wide table is actually fine

Not every 100-column table is a god table. Three cases where width is defensible:

Analytical and reporting tables on columnar storage. As noted above, warehouses like ClickHouse, BigQuery, and Redshift invert the cost calculus — reading one column doesn’t load the rest, and the normalization pressure flips: denormalize aggressively because joins are expensive and per-column reads are cheap. This anti-pattern is specifically a row-store OLTP problem.

Small tables that stay small. A tenants table with 80 columns and 500 rows fits entirely in the buffer pool. The write amplification is paid a few thousand times a day, not a few million. The secondary-index cost is negligible because the indexes are small. Width matters when row count is large enough for the per-row cost to dominate — on small tables it doesn’t.

Every query reads every column. Uncommon but real. If the dominant read is “fetch the full customer record for display” and the split would produce a join that runs on every request anyway, the split doesn’t help. The test is whether the queries you actually run touch disjoint column sets — if they do, the split has a real win; if they don’t, it’s architecture for its own sake.

The bigger picture

Relational databases aren’t built for developer convenience. They’re built for storage efficiency and retrieval speed — narrow rows, well-placed indexes, joins on indexed keys, query plans that read only what they need. Normalization isn’t an academic ideal; it’s the shape that lines up with how the engine actually pays its bills. Every cost mechanism in this post — buffer-pool density, write amplification, index bloat, row-lock width — is the engine reporting the same thing in different dialects: the shape you’re asking it to hold isn’t the shape it was optimized for. The SELECT-*-and-done dream is the developer’s cost model, not the database’s.

God tables aren’t designed; they’re the limit of a sequence of rational local decisions where the global cost is invisible at each step. The column count of a mature production table is usually a decent proxy for how long the team has been making the cheap choice, which is most teams most of the time — and that is not by itself a failure. The failure is that the cost goes uncounted. A 6KB row is a write-amplification multiplier on every UPDATE, a buffer-pool multiplier on every read, and an index-size multiplier on every secondary index. None of those costs are on the PR that adds a column; all of them are on the dashboard that shows p99 drifting up quarter after quarter.

The lever is to count the cost at the system level when the table hits a certain width — pick a threshold, sixty columns, a hundred, whatever fits — and make the next column addition a conversation about whether this concern belongs here, not a line in a migration. The answer is often still yes, but it shouldn’t be the default answer. When it’s no, the split is far cheaper at column sixty than at column one-eighty; the table doesn’t care, but every caller of the table does, and the rewrite’s blast radius scales with how long the drift went uncorrected.

Non-SARGable Predicates: How a Function in WHERE Kills Your Index

Wed, 22 Apr 2026 00:00:00 +0000

TL;DR

A predicate is SARGable — Search ARGument able — if the database can use an index to evaluate it. Wrapping a column in a function makes the predicate non-SARGable: the engine has to compute the function on every row before it can filter, which means a full table scan no matter what indexes exist. The fix isn’t always to rewrite the predicate — sometimes the column’s type or collation is wrong and the code is masking it — but every non-SARGable predicate on a hot path is a performance bug waiting for the table to grow.

Here are two queries that return the exact same rows:

-- Version A
SELECT id, status FROM events
WHERE YEAR(created_at) = 2025;

-- Version B
SELECT id, status FROM events
WHERE created_at >= '2025-01-01' AND created_at < '2026-01-01';

On a 10,000-row events table, both run in under a millisecond and nobody notices the difference. On a 200-million-row events table with an index on created_at, version A does a sequential scan and takes 45 seconds; version B does an index range scan and takes 12 milliseconds. Neither query is wrong. They don’t even disagree about the answer. One just does the same work in a way the planner can’t optimize.

The obvious fix is “rewrite every function-wrapped predicate as a range.” That works for the date-extraction case and a few others. For WHERE LOWER(email) = 'alice@example.com', the rewrite needs to know whether the column’s collation is case-insensitive — and if it isn’t, there’s no direct equivalent, only a functional index or a schema change. The fix depends on why the function is there, and “why” usually points back at something in the schema that’s pretending to be something it isn’t.

What SARGable means in practice

An index on created_at is a sorted structure: the engine can jump to any date range in O(log n) time by walking the B-tree. For the planner to use that index on a predicate, the predicate has to be expressible as “the column is in this range” — a direct comparison between the column and a constant or parameter.

created_at >= '2025-01-01' meets that contract. The planner translates it to “walk the index to the first entry ≥ 2025-01-01, read forward from there.” That’s a range scan.

YEAR(created_at) = 2025 doesn’t meet the contract. The value being compared isn’t created_at; it’s the output of YEAR() applied to created_at. The index on created_at doesn’t know the output of YEAR() for any row without computing it. So the planner falls back to evaluating the function on every row — a sequential scan — and only then filtering.

Common forms of the same mistake:

-- Non-SARGable: function on column → full scan
WHERE LOWER(email) = 'alice@example.com'
WHERE DATE(created_at) = '2025-01-15'
WHERE CAST(price AS INT) > 100
WHERE CONCAT(first_name, ' ', last_name) = 'Alice Smith'

-- SARGable equivalents
WHERE email = 'alice@example.com'              -- if collation is case-insensitive
WHERE created_at >= '2025-01-15' AND created_at < '2025-01-16'
WHERE price > 100                              -- fix the type at the schema level
WHERE first_name = 'Alice' AND last_name = 'Smith'

Three of the four non-SARGable forms have clean rewrites. The first one — LOWER(email) — depends on collation, which is where a lot of real-world cases live.

The collation case

WHERE LOWER(email) = 'alice@example.com' is almost always a tell that the email column has a case-sensitive collation and the application is hiding it at query time. Two real fixes, one cosmetic fix:

Fix the column. If the data should be matched case-insensitively, give the column a case-insensitive collation. In PostgreSQL that’s CITEXT or a COLLATE "und-x-icu" with the ICU provider; in MySQL it’s a _ci collation (which is usually the default anyway). Once the column’s collation handles the case folding, WHERE email = 'alice@example.com' is SARGable and fast. This is the right fix when case-insensitivity is a property of the data.

Add a functional (expression) index. If you can’t change the column’s collation — there’s a case-sensitive comparison elsewhere in the schema that depends on the current behavior — index the expression itself:

-- PostgreSQL: functional index
CREATE INDEX idx_users_email_lower ON users (LOWER(email));
-- Now WHERE LOWER(email) = '...' uses the index

-- MySQL 8.0+: expression index (requires the same constant-folding fix)
ALTER TABLE users ADD INDEX idx_email_lower ((LOWER(email)));

This works, with caveats. The index’s storage and write cost is real. The predicate has to match the indexed expression exactly — LOWER(email) is indexed, but UPPER(email) isn’t, and the planner won’t translate between them. Every non-SARGable expression you want fast needs its own index.

Cosmetic fix: case-fold at write time. Store the email as already-lowercased. WHERE email = 'alice@example.com' is now SARGable directly, no expression index needed. This usually requires application changes — whoever’s writing has to remember to case-fold — which is why the functional index is more popular even though it’s heavier. Where business logic lives covers the general shape of this decision; case-folding at the database with a generated column (GENERATED ALWAYS AS (LOWER(email)) STORED) is often the cleanest answer when the application can’t be trusted to normalize consistently.

Implicit type conversions are the subtler version

The function isn’t always in the query. Sometimes the planner is adding one:

1
2

-- account_id is VARCHAR, literal is numeric
WHERE account_id = 12345

MySQL will silently cast every account_id value to a number for comparison — a per-row function call that kills index usage just as effectively as an explicit CAST(). PostgreSQL is stricter and usually errors, but can still do implicit conversions between compatible types that undermine indexes.

The fix is matching types in both directions: the column type should be what the column is (a numeric ID should be BIGINT, not VARCHAR), and the query should write the literal in the column’s type (WHERE account_id = '12345' if the column is genuinely a string). Either fix works; matching the column type to the data’s real shape is usually the durable answer.

This is also where mixed PK strategies show up — joining a BIGINT id to a UUID id doesn’t just return wrong results; on MySQL it coerces one side to a string, which is the same implicit-function problem dressed up as a join.

When non-SARGable is acceptable

Not every non-SARGable predicate is a bug. Three cases where it’s fine:

Small tables. A 5,000-row lookup table with a function-wrapped predicate scans in microseconds. The planner isn’t going to use an index on that size anyway. WHERE UPPER(code) = 'NY' on a 50-row states table is not worth worrying about.

One-off analytical queries. A one-time data extract that scans a large table is going to scan it regardless. If the query will never run again, the function call isn’t the bottleneck — the table size is — and adding a functional index to optimize one query isn’t worth the write cost on every future insert.

When the function genuinely can’t be avoided. Some predicates legitimately need to compute. WHERE haversine_distance(lat, lng, user_lat, user_lng) < 10 on a geospatial query can’t be rewritten as a simple range; you need a spatial index (PostGIS, MySQL spatial extensions) to make it SARGable in the geometric sense. The fix is a different kind of index, not a rewrite.

The bigger picture

Non-SARGable predicates are easy to write, and they come from somewhere — almost always a schema decision that’s being papered over at query time. LOWER(email) hides a collation mismatch. CAST(price AS INT) hides a type that should have been NUMERIC from the start. DATE(created_at) hides the fact that the query is answering a date-range question but written in a way that reads more naturally as an equality. Every one of these is a query-level workaround for a schema-level issue, and every one of them costs an index when the table grows large enough to care.

EXPLAIN ANALYZE is the diagnostic. If the plan shows a sequential scan on a predicate that should hit an index, the predicate is almost certainly non-SARGable — look at what’s wrapping the column. Fix the schema if you can, add a functional index if you can’t, and treat non-SARGable predicates on hot paths as latent performance bugs, not style issues.

Random UUIDs as Primary Keys: The B-Tree Penalty

Wed, 22 Apr 2026 00:00:00 +0000

TL;DR

UUIDv4 primary keys are globally unique and coordination-free, and the cost is paid every time you write a row: random B-tree positions, page splits, secondary indexes bloated with 16- or 36-byte key copies, and a working set that stops fitting in the buffer pool once the table is large enough. UUIDv7 fixes the insert-locality problem (time-ordered, sortable) without changing storage size; the full fix is picking v7, storing as BINARY(16) or native uuid, and keeping UUIDs at the API boundary rather than internal to every join.

A table configured like this on day one looks unremarkable:

CREATE TABLE orders (
    id CHAR(36) PRIMARY KEY,  -- UUIDv4, generated by the application
    ...
);

Inserts are fast, reads are fast, the ORM is happy. At 100,000 rows, it’s still fine. At 10 million, the nightly ingest job gets noticeably slower. At 200 million, inserts take 50 ms each instead of 2 ms, the buffer pool is constantly churning, and the secondary indexes are three to four times the size they’d be with a BIGINT primary key. Nothing about the schema changed. The table just got large enough for a design decision to start charging rent.

The obvious fix is “use BIGINT auto-increment.” That’s the right answer in a lot of cases and the wrong one in others — it reintroduces coordination requirements, leaks row counts through URL-exposed IDs, and doesn’t work for schemas that need to be generated offline or across shards. UUIDs exist because those constraints are real. The sharper question is: what exactly is UUIDv4 costing you at scale, and which of those costs have cheaper alternatives?

What random keys do to a B-tree

B-tree indexes are sorted structures. When the primary key is an auto-incrementing integer, every new row goes to the end — the rightmost leaf page is the only one that gets written to, and the rest of the index stays in cache undisturbed. Inserts are sequential and cheap.

UUIDv4 is random by design. Every new row lands at a random position in the B-tree. Instead of appending to one page, the engine has to:

Find the right page somewhere in the middle of the tree.
Load it into the buffer pool if it isn’t already (on a large table, it usually isn’t).
Split it if it’s full.
Write both halves back.

On a table with hundreds of millions of rows, the index doesn’t fit in memory, so most inserts trigger a random disk read before they can do anything else. The write amplification is real and measurable — factor of 5 to 10× versus sequential inserts isn’t unusual.

The damage doesn’t stop at the primary-key index. In InnoDB (MySQL), every secondary index includes a copy of the primary key at its leaves. A 36-byte CHAR(36) UUID embedded in every secondary index entry means larger indexes, more pages, more I/O — compared to an 8-byte BIGINT. Secondary indexes on a UUID-keyed table are routinely 3–4× the size of the same indexes on a BIGINT-keyed table. Every lookup through a secondary index reads more pages to cover the same rows.

PostgreSQL handles storage differently — its heap means the primary key is just another index, so the physical table isn’t ordered by it. But the primary-key index still suffers the same random-insertion pathology, and the write amplification from random page loads still applies.

Page splits compound over time. When a new UUID lands in a full page, InnoDB splits the page in two, each roughly half full. Over millions of inserts, the index develops internal fragmentation — pages allocated but only partially used. The index is physically larger than it needs to be, and scans read more pages for the same row count. OPTIMIZE TABLE (MySQL) or REINDEX (PostgreSQL) can repack the index, but on a busy table it’s a maintenance window you have to schedule.

UUIDv7: the insert-locality fix

UUIDv7 is the version most new code should reach for when UUIDs are the right answer. It encodes a Unix millisecond timestamp into the high 48 bits, with random bits filling the rest. Two practical consequences:

Sortable. Sequential generation means new IDs land at the end of the B-tree, not scattered across it. Insert locality is close to a BIGINT’s. The pathological page-split behaviour of v4 goes away.
Time-parseable. The creation time is embedded in the ID, recoverable from the primary key alone — useful for log correlation, rough time-range filtering, and debugging without reaching for created_at.

-- UUIDv7: time-ordered, so inserts are roughly sequential
-- PostgreSQL 18 ships a built-in uuidv7() function
CREATE TABLE orders (
    id UUID PRIMARY KEY DEFAULT uuidv7(),
    ...
);

-- Recover creation time from the ID — no created_at column needed
SELECT id,
       uuid_extract_timestamp(id) AS created_at,
       uuid_extract_timestamp(id)::date AS created_date
FROM orders
ORDER BY id DESC  -- v7 sorts chronologically, newest first
LIMIT 10;

uuid_extract_timestamp() has existed in PostgreSQL since 17 but only returned a value for UUIDv1. PG 18 extended it to support v7 alongside the new uuidv7() generator. One caveat: calling it in a WHERE clause (WHERE uuid_extract_timestamp(id) >= '2026-04-01') is non-SARGable and forces a scan — see Non-SARGable Predicates. For indexed time-range filtering, keep a created_at column as the query target, or compare against a boundary UUID generated at the target timestamp.

MySQL 8 doesn’t ship a v7 generator or a timestamp extractor, so application-side generation is the norm there — libraries exist in every major language, and most modern ORMs default to v7 if you ask for UUIDs. Extraction is manual: for BINARY(16) storage (the recommended form), the first 6 bytes hold the millisecond timestamp.

-- MySQL: manually parse v7's timestamp prefix (BINARY(16) storage)
SELECT id,
       FROM_UNIXTIME(CONV(HEX(SUBSTRING(id, 1, 6)), 16, 10) / 1000) AS created_at,
       DATE(FROM_UNIXTIME(CONV(HEX(SUBSTRING(id, 1, 6)), 16, 10) / 1000)) AS created_date
FROM orders
ORDER BY id DESC  -- v7 sorts chronologically
LIMIT 10;

For CHAR(36) storage, the extraction strips hyphens first: CONCAT(SUBSTRING(id, 1, 8), SUBSTRING(id, 10, 4)) gives the 12 hex characters of the timestamp prefix. If your v1 UUIDs were stored with UUID_TO_BIN(id, 1) (the swap flag that reorders bytes for v1 index locality), the byte layout differs and the substring offsets change. Most v7-generating libraries skip the swap because v7 is already time-ordered without it — check what yours does before trusting the extraction.

What v7 doesn’t change. It’s still 16 bytes on disk, and still 36 if you stored it as CHAR(36) — the insert-locality win doesn’t come with a storage discount, so the overhead versus a BIGINT is the same as v4. The readable creation timestamp is usually a feature and occasionally a problem: in systems where row-creation time is sensitive (order IDs revealing traffic patterns to competitors, user IDs exposing signup timing), it’s the one property v4 had that v7 gives up.

CHAR(36) is the silent tax

The worst-case UUID storage — CHAR(36) — is what most ORM-generated schemas default to, because it’s the portable representation. BINARY(16) in MySQL or the native uuid type in PostgreSQL cuts storage by more than half and keeps comparisons on fixed-width integers instead of strings. Pick the narrow form on day one; retrofitting it later is a full-table rewrite that touches every secondary index.

UUID-to-integer mapping: keep UUIDs at the edge

The other workable fix is structural: expose UUIDs externally, use integers internally. A single lookup table maps the external UUID to an internal BIGINT, and every other table in the database uses the BIGINT as its foreign key. The UUID lookup happens once — at the API boundary — and everything downstream is fast, compact, 8-byte integer joins.

CREATE TABLE users (
    id BIGINT AUTO_INCREMENT PRIMARY KEY,
    external_id BINARY(16) NOT NULL UNIQUE,  -- the UUID the outside world sees
    ...
);

-- Every other table references the BIGINT, not the UUID
CREATE TABLE orders (
    id BIGINT AUTO_INCREMENT PRIMARY KEY,
    user_id BIGINT NOT NULL REFERENCES users(id),
    ...
);

-- API request comes in with a UUID — one indexed lookup to resolve it
SELECT id FROM users WHERE external_id = UUID_TO_BIN('a1b2c3d4-...');
-- From here on, everything uses the BIGINT

The UUID column has a unique index, so the lookup is a single index seek — sub-millisecond regardless of table size. The rest of the schema gets 8-byte keys everywhere: smaller indexes, faster joins, no page splits, no secondary-index bloat. The external-facing API still uses UUIDs, so you don’t leak sequence information or row counts.

The trade-off is an extra layer of indirection. Every inbound request resolves the UUID before anything else; in practice this is negligible (one indexed lookup), but it means the schema has two identity systems to maintain. For long-lived OLTP applications where every join on every table pays the UUID cost, this structure is often worth the extra lookup.

When random UUIDs are actually fine

Not every schema needs to bend. Three cases where UUIDv4 as a primary key is a defensible choice:

Small tables that stay small. A configuration table, a lookup table, a feature-flag table. At 50,000 rows the page-split pathology doesn’t show up, secondary indexes are tiny, and the convenience of client-generated IDs outweighs any cost.

Write rates low enough that random I/O doesn’t matter. An admin tool recording 50 events per minute doesn’t care about write amplification. The index fits in cache, every page is warm, page splits happen rarely enough that fragmentation stays manageable. “Doesn’t survive scale” is only a problem at scale.

Information-leak concerns that outweigh performance. If hiding creation-order is a hard requirement (competitive, privacy, or security), v7’s embedded timestamp is a non-starter and v4 is the only UUID version that meets the requirement. Pay the write-amplification cost and use the UUID-to-integer mapping to contain the damage.

The bigger picture

UUIDv4 is a tool that solved a coordination problem — distributed ID generation without central authority — and accidentally became the default for everything, including the cases where coordination wasn’t a problem and the cost of random writes is non-trivial. “Pick a UUID for your PK” is a decision most schemas make without ever being explicit about what they’re trading.

The decision matrix is short. Do you need globally unique, coordination-free IDs? If no, use BIGINT. If yes, use UUIDv7 and store it as BINARY(16) or native uuid — never CHAR(36). If v7’s embedded timestamp is a problem, use v4 but keep it at the API boundary and use integers inside the schema. Each of those decisions costs almost nothing on day one and saves a lot of rework at 100 million rows — which is the point where “UUIDs as primary keys” stops being a default and starts being a choice with real consequences.