Random UUIDs as Primary Keys: The B-Tree Penalty

TL;DR

UUIDv4 primary keys are globally unique and coordination-free, and the cost is paid every time you write a row: random B-tree positions, page splits, secondary indexes bloated with 16- or 36-byte key copies, and a working set that stops fitting in the buffer pool once the table is large enough. UUIDv7 fixes the insert-locality problem (time-ordered, sortable) without changing storage size; the full fix is picking v7, storing as BINARY(16) or native uuid, and keeping UUIDs at the API boundary rather than internal to every join.

A table configured like this on day one looks unremarkable:

1
2
3
4
CREATE TABLE orders (
    id CHAR(36) PRIMARY KEY,  -- UUIDv4, generated by the application
    ...
);

Inserts are fast, reads are fast, the ORM is happy. At 100,000 rows, it’s still fine. At 10 million, the nightly ingest job gets noticeably slower. At 200 million, inserts take 50 ms each instead of 2 ms, the buffer pool is constantly churning, and the secondary indexes are three to four times the size they’d be with a BIGINT primary key. Nothing about the schema changed. The table just got large enough for a design decision to start charging rent.

The obvious fix is “use BIGINT auto-increment.” That’s the right answer in a lot of cases and the wrong one in others; it reintroduces coordination requirements, leaks row counts through URL-exposed IDs, and doesn’t work for schemas that need to be generated offline or across shards. UUIDs exist because those constraints are real. The sharper question is: what exactly is UUIDv4 costing you at scale, and which of those costs have cheaper alternatives?

What random keys do to a B-tree

B-tree indexes are sorted structures. When the primary key is an auto-incrementing integer, every new row goes to the end. The rightmost leaf page is the only one that gets written to, and the rest of the index stays in cache undisturbed. Inserts are sequential and cheap.

UUIDv4 is random by design. Every new row lands at a random position in the B-tree. Instead of appending to one page, the engine has to:

Find the right page somewhere in the middle of the tree.
Load it into the buffer pool if it isn’t already (on a large table, it usually isn’t).
Split it if it’s full.
Write both halves back.

On a table with hundreds of millions of rows, the index doesn’t fit in memory, so most inserts trigger a random disk read before they can do anything else. The write amplification is real and measurable: factor of 5 to 10× versus sequential inserts isn’t unusual.

The damage doesn’t stop at the primary-key index. In InnoDB (MySQL), every secondary index includes a copy of the primary key at its leaves. A 36-byte CHAR(36) UUID embedded in every secondary index entry means larger indexes, more pages, more I/O compared to an 8-byte BIGINT. Secondary indexes on a UUID-keyed table are routinely 3–4× the size of the same indexes on a BIGINT-keyed table. Every lookup through a secondary index reads more pages to cover the same rows.

PostgreSQL handles storage differently. Its heap means the primary key is just another index, so the physical table isn’t ordered by it. The primary-key index still suffers the same random-insertion pathology, and the write amplification from random page loads still applies.

Page splits compound over time. When a new UUID lands in a full page, InnoDB splits the page in two, each roughly half full. Over millions of inserts, the index develops internal fragmentation: pages allocated but only partially used. The index is physically larger than it needs to be, and scans read more pages for the same row count. OPTIMIZE TABLE (MySQL) or REINDEX (PostgreSQL) can repack the index, but on a busy table it’s a maintenance window you have to schedule.

UUIDv7: the insert-locality fix

UUIDv7 is the version most new code should reach for when UUIDs are the right answer. It encodes a Unix millisecond timestamp into the high 48 bits, with random bits filling the rest. Two practical consequences:

Sortable. Sequential generation means new IDs land at the end of the B-tree, not scattered across it. Insert locality is close to a BIGINT’s. The pathological page-split behaviour of v4 goes away.
Time-parseable. The creation time is embedded in the ID, recoverable from the primary key alone: useful for log correlation, rough time-range filtering, and debugging without reaching for created_at.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
-- UUIDv7: time-ordered, so inserts are roughly sequential
-- PostgreSQL 18 ships a built-in uuidv7() function
CREATE TABLE orders (
    id UUID PRIMARY KEY DEFAULT uuidv7(),
    ...
);

-- Recover creation time from the ID - no created_at column needed
SELECT id,
       uuid_extract_timestamp(id) AS created_at,
       uuid_extract_timestamp(id)::date AS created_date
FROM orders
ORDER BY id DESC  -- v7 sorts chronologically, newest first
LIMIT 10;

uuid_extract_timestamp() has existed in PostgreSQL since 17 but only returned a value for UUIDv1. PG 18 extended it to support v7 alongside the new uuidv7() generator. One caveat: calling it in a WHERE clause (WHERE uuid_extract_timestamp(id) >= '2026-04-01') is non-SARGable and forces a scan; see Non-SARGable Predicates. For indexed time-range filtering, keep a created_at column as the query target, or compare against a boundary UUID generated at the target timestamp.

MySQL 8 doesn’t ship a v7 generator or a timestamp extractor, so application-side generation is the norm there - libraries exist in every major language, and most modern ORMs default to v7 if you ask for UUIDs. Extraction is manual: for BINARY(16) storage (the recommended form), the first 6 bytes hold the millisecond timestamp.

1
2
3
4
5
6
7
-- MySQL: manually parse v7's timestamp prefix (BINARY(16) storage)
SELECT id,
       FROM_UNIXTIME(CONV(HEX(SUBSTRING(id, 1, 6)), 16, 10) / 1000) AS created_at,
       DATE(FROM_UNIXTIME(CONV(HEX(SUBSTRING(id, 1, 6)), 16, 10) / 1000)) AS created_date
FROM orders
ORDER BY id DESC  -- v7 sorts chronologically
LIMIT 10;

For CHAR(36) storage, the extraction strips hyphens first: CONCAT(SUBSTRING(id, 1, 8), SUBSTRING(id, 10, 4)) gives the 12 hex characters of the timestamp prefix. If your v1 UUIDs were stored with UUID_TO_BIN(id, 1) (the swap flag that reorders bytes for v1 index locality), the byte layout differs and the substring offsets change. Most v7-generating libraries skip the swap because v7 is already time-ordered without it - check what yours does before trusting the extraction.

What v7 doesn’t change. It’s still 16 bytes on disk, and still 36 if you stored it as CHAR(36). The insert-locality win doesn’t come with a storage discount, so the overhead versus a BIGINT is the same as v4. The readable creation timestamp is usually a feature and occasionally a problem: in systems where row-creation time is sensitive (order IDs revealing traffic patterns to competitors, user IDs exposing signup timing), it’s the one property v4 had that v7 gives up.

CHAR(36) is the silent tax

The worst-case UUID storage, CHAR(36), is what most ORM-generated schemas default to, because it’s the portable representation. BINARY(16) in MySQL or the native uuid type in PostgreSQL cuts storage by more than half and keeps comparisons on fixed-width integers instead of strings. Pick the narrow form on day one; retrofitting it later is a full-table rewrite that touches every secondary index.

UUID-to-integer mapping: keep UUIDs at the edge

The other workable fix is structural: expose UUIDs externally, use integers internally. A single lookup table maps the external UUID to an internal BIGINT, and every other table in the database uses the BIGINT as its foreign key. The UUID lookup happens once (at the API boundary) and everything downstream is fast, compact, 8-byte integer joins.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
CREATE TABLE users (
    id BIGINT AUTO_INCREMENT PRIMARY KEY,
    external_id BINARY(16) NOT NULL UNIQUE,  -- the UUID the outside world sees
    ...
);

-- Every other table references the BIGINT, not the UUID
CREATE TABLE orders (
    id BIGINT AUTO_INCREMENT PRIMARY KEY,
    user_id BIGINT NOT NULL REFERENCES users(id),
    ...
);

-- API request comes in with a UUID; one indexed lookup to resolve it
SELECT id FROM users WHERE external_id = UUID_TO_BIN('a1b2c3d4-...');
-- From here on, everything uses the BIGINT

The UUID column has a unique index, so the lookup is a single index seek, sub-millisecond regardless of table size. The rest of the schema gets 8-byte keys everywhere: smaller indexes, faster joins, no page splits, no secondary-index bloat. The external-facing API still uses UUIDs, so you don’t leak sequence information or row counts.

The trade-off is an extra layer of indirection. Every inbound request resolves the UUID before anything else; in practice this is negligible (one indexed lookup), but it means the schema has two identity systems to maintain. For long-lived OLTP applications where every join on every table pays the UUID cost, this structure is often worth the extra lookup.

When random UUIDs are actually fine

Not every schema needs to bend. Three cases where UUIDv4 as a primary key is a defensible choice:

Small tables that stay small. A configuration table, a lookup table, a feature-flag table. At 50,000 rows the page-split pathology doesn’t show up, secondary indexes are tiny, and the convenience of client-generated IDs outweighs any cost.

Write rates low enough that random I/O doesn’t matter. An admin tool recording 50 events per minute doesn’t care about write amplification. The index fits in cache, every page is warm, page splits happen rarely enough that fragmentation stays manageable. “Doesn’t survive scale” is only a problem at scale.

Information-leak concerns that outweigh performance. If hiding creation-order is a hard requirement (competitive, privacy, or security), v7’s embedded timestamp is a non-starter and v4 is the only UUID version that meets the requirement. Pay the write-amplification cost and use the UUID-to-integer mapping to contain the damage.

Why v4 keeps showing up as the default

Schema-reading assistants and scaffolding tools reinforce UUIDv4 as the default answer, and the reason is mostly inertial. Training corpora are heavy on examples where uuid4() is the canonical “globally unique ID” call; CREATE TABLE ... id UUID DEFAULT gen_random_uuid() appears in orders of magnitude more tutorials than the v7 equivalent. Asked for a new table schema, a model produces the v4 version because that’s what the surrounding code it learned from produced. B-tree locality and write amplification don’t show up in the DDL (they’re runtime properties of the key distribution) so the catalog gives no signal that v4 and v7 behave differently at 100M rows. Both look identical in information_schema: uuid or char(36), primary key, not null.

The fix is the same discipline this post already makes the case for, with one amplified beat: document the choice where a schema reader can find it. A comment on the PK column ('UUIDv7: time-ordered; required for insert locality' or 'UUIDv4: randomized; chosen to hide creation order') turns a silent convention into a machine-readable decision. The next reader (teammate or model) sees why the column is what it is and why the alternative was rejected. Without the comment, the next table scaffolded by an assistant inherits whichever version the training data sampled, and the schema drifts toward whichever default produces the most plausible-looking DDL.

The bigger picture

UUIDv4 is a tool that solved a coordination problem (distributed ID generation without central authority) and accidentally became the default for everything, including the cases where coordination wasn’t a problem and the cost of random writes is non-trivial. “Pick a UUID for your PK” is a decision most schemas make without ever being explicit about what they’re trading.

The decision matrix is short. Do you need globally unique, coordination-free IDs? If no, use BIGINT. If yes, use UUIDv7 and store it as BINARY(16) or native uuid, never CHAR(36). If v7’s embedded timestamp is a problem, use v4 but keep it at the API boundary and use integers inside the schema. Each of those decisions costs almost nothing on day one and saves a lot of rework at 100 million rows.