Featured image of post Designing Partitioning You Don't Have to Babysit

Designing Partitioning You Don't Have to Babysit

Most partitioning schemes fail the same way: a date column gets bolted onto the primary key, every query now needs to include that column, and the partition key leaks into application code forever. There's a better way — partition by what you already have, and let a service manage the boundaries.

TL;DR
The standard way teams adopt range partitioning — bolting created_at onto the primary key so they can range-partition on it — forces every query to include that column in the WHERE clause or suffer a scan across every partition. Partition by the primary key instead, and let a background service manage boundaries based on observed growth. Queries keep using the keys they already have, partition pruning works automatically, and the partition column never leaks into application code. The same “service watches and adjusts” pattern applies to hash and list partitioning with different operations.

Two decisions determine whether partitioning becomes ongoing operational work or quietly disappears into the schema: the column you partition on, and how partition boundaries get managed over time. The first constrains the second — the wrong key choice locks the schema into manual lifecycle work forever, the right one makes a small background service enough.

Partitioning is one of those features that looks great in slides and turns into a maintenance burden in production. The pitch is straightforward — break a huge table into smaller physical pieces, queries only touch the pieces they need, maintenance operations (vacuum, reindex, drops) get cheap. The reality is that most partitioning schemes introduce a new coupling between the schema and every query that touches the table. That coupling tends to be invisible until someone runs a query that should be fast and gets a full-table scan back.

The partition key problem

Both PostgreSQL and MySQL require the partition key to be part of any primary key or unique constraint on the table. That rule exists for correctness — if the primary key didn’t include the partition key, the database couldn’t enforce uniqueness without scanning every partition.

The consequence is that if you want to partition by created_at, you can’t just have PRIMARY KEY (id) anymore. You need PRIMARY KEY (id, created_at). The date column is now part of the primary key whether your application needed it to be or not.

The more subtle cost is that id is no longer unique in the eyes of the database. Uniqueness is enforced on the tuple (id, created_at) — the database will cheerfully accept two rows with the same id as long as they have different timestamps. The application probably still treats id as unique, but nothing in the schema guarantees it. And you can’t recover the guarantee with a separate UNIQUE (id) constraint: both MySQL and PostgreSQL require every unique constraint on a partitioned table to include the partition key columns. The uniqueness property has effectively been traded away.

This isn’t purely cosmetic — it changes the query plans the optimizer is willing to generate:

  • With PRIMARY KEY (id), WHERE id = 1 is a constant-time lookup. MySQL’s EXPLAIN shows this as the const access type; the optimizer knows exactly one row matches and the executor stops after finding it. Joins on id are eq_ref, the fastest join access type.
  • With PRIMARY KEY (id, created_at), the same query becomes a ref lookup — a prefix scan on the leftmost index column that could, as far as the database is concerned, return multiple rows. Joins that used to be eq_ref become ref. Cardinality estimates fall back to index statistics instead of the guaranteed “one row” assumption, which can push the optimizer toward worse plans further up the query tree.

To get the old const plan back, every lookup has to spell out the full primary key:

1
2
3
4
5
-- Was a const lookup, now a ref lookup (one of potentially many rows)
SELECT * FROM orders WHERE id = 1;

-- Back to const — but only if the caller knows the created_at
SELECT * FROM orders WHERE id = 1 AND created_at = '2026-04-01 12:34:56';

That’s the same leakage as partition pruning, just from a different angle — the partition key has forced its way into queries that had nothing to do with dates, first to get pruning and now to get single-row access.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
-- Before partitioning
CREATE TABLE orders (
    id BIGINT AUTO_INCREMENT PRIMARY KEY,
    customer_id BIGINT NOT NULL,
    total_cents INT NOT NULL,
    created_at DATETIME NOT NULL
);

-- After partitioning by month
CREATE TABLE orders (
    id BIGINT AUTO_INCREMENT,
    customer_id BIGINT NOT NULL,
    total_cents INT NOT NULL,
    created_at DATETIME NOT NULL,
    PRIMARY KEY (id, created_at)  -- created_at forced into the PK
)
PARTITION BY RANGE (TO_DAYS(created_at)) (
    PARTITION p202601 VALUES LESS THAN (TO_DAYS('2026-02-01')),
    PARTITION p202602 VALUES LESS THAN (TO_DAYS('2026-03-01')),
    ...
);

At this point everything still works. The table accepts inserts, queries return correct results, and the partition boundaries exist. The problem shows up the first time someone runs a query that doesn’t include created_at in the WHERE clause.

Partition pruning only works if you ask for it

Partition pruning is the optimization that makes partitioning worth doing — when a query’s WHERE clause restricts the partition key, the database can skip partitions that can’t possibly match. A query for last week’s orders only reads the one or two partitions that contain last week’s data.

That optimization depends entirely on the partition key appearing in the WHERE clause. A query that filters on anything else doesn’t get pruned — it scans every partition.

1
2
3
4
5
-- This query scans every partition. There are 36 of them.
SELECT * FROM orders WHERE id = 12345;

-- This one prunes to a single partition
SELECT * FROM orders WHERE id = 12345 AND created_at >= '2026-03-01' AND created_at < '2026-04-01';

The first query is the kind of lookup that happens constantly — fetch an order by its primary key. On a non-partitioned table, it’s a single index seek. On a partitioned table where the pruning key isn’t in the WHERE clause, it’s a separate index probe against every partition — 36 index lookups instead of one. Still fast in absolute terms, but dramatically worse than the non-partitioned version, which is the opposite of why partitioning was introduced.

The “fix” teams usually land on is to add the partition key to every query that touches the table. That’s a leaky abstraction — a physical storage decision has become a contract with every piece of code that queries the table. New code needs to remember to add the partition filter. Old code needs to be audited. ORMs need to be configured. And there’s no error when someone forgets — the query still returns correct results, just slowly and with a lot of extra I/O.

Static partition boundaries don’t age well

The other thing that tends to go wrong is hardcoding partition boundaries at table creation time. The initial layout reflects whatever the team’s growth projection looked like at that moment. Six months later the traffic pattern has changed, some partitions are 10x larger than others, and the p_future catch-all partition is holding half the table.

1
2
3
4
5
6
7
-- Defined at creation: looks reasonable
PARTITION p2026_q1 VALUES LESS THAN (100000000),
PARTITION p2026_q2 VALUES LESS THAN (200000000),
...

-- Six months later: growth accelerated, p_future is now the entire active workload
PARTITION p_future VALUES LESS THAN MAXVALUE  -- 800M rows and growing

Manually splitting and rebalancing partitions is operational work nobody wants to own. It requires scheduling maintenance windows, running ALTER TABLE ... REORGANIZE PARTITION against tables that might be hundreds of gigabytes, coordinating with application teams, and not making a mistake. It tends not to happen until there’s a performance incident, and at that point the fix is expensive.

The shape of the better approach

The primary key already exists. For tables using BIGINT AUTO_INCREMENT, it’s monotonically increasing — newer rows have larger IDs. That’s exactly the property needed for range partitioning. The primary key is the partition key.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
CREATE TABLE orders (
    id BIGINT AUTO_INCREMENT PRIMARY KEY,
    customer_id BIGINT NOT NULL,
    total_cents INT NOT NULL,
    created_at DATETIME NOT NULL
)
PARTITION BY RANGE (id) (
    PARTITION p0001 VALUES LESS THAN (100000000),
    PARTITION p0002 VALUES LESS THAN (200000000),
    PARTITION p0003 VALUES LESS THAN (300000000),
    PARTITION p_future VALUES LESS THAN MAXVALUE
);

Every query that filters by id — which is most of them — gets partition pruning for free, with no changes to application code. Range queries by ID prune across a small number of partitions. Point lookups prune to exactly one. The primary key is already in every WHERE clause that matters, because it’s the primary key.

The apparent trade-off is that partition boundaries aren’t directly defined by time anymore, which looks like it breaks time-based retention. In practice this is less of a trade-off than it looks — the point of partitioning often isn’t retention but keeping index sizes manageable, making maintenance operations cheap, and bounding the blast radius of a bad query. And when retention is a goal, boundaries can still be chosen to align with time — they just get picked at DDL time by the partitioner service, rather than baked into the schema. See Time-aligned boundaries without a date in the key.

Automating range partition management

Everything up to this point assumes range partitioning — partitions defined by continuous boundaries on an ordered value (an ID range, a date range). The operational work is mechanical: watch the active partition fill up, split the MAXVALUE catch-all into a new bounded partition before that happens, and drop partitions that have fallen past the retention threshold. A small service running on a schedule is enough to keep the layout healthy. The hard part isn’t the logic, it’s doing it safely — running DDL against a large table without locking out writes, handling partial failures, and recovering cleanly if the service crashes mid-operation.

1
2
3
4
5
6
-- Split the catch-all partition into a new bounded partition + new catch-all
-- This is the operation the service runs periodically
ALTER TABLE orders REORGANIZE PARTITION p_future INTO (
    PARTITION p0037 VALUES LESS THAN (3700000000),
    PARTITION p_future VALUES LESS THAN MAXVALUE
);

REORGANIZE PARTITION on an empty catch-all partition is fast — there’s nothing to move. The key insight is that if you split the catch-all before any rows land above the split point, the operation is essentially metadata-only. The service’s job is to stay ahead of the write workload: split the catch-all when it’s still small or empty, not when it’s already holding hundreds of millions of rows.

There’s no single right target — it depends on what’s driving the partitioning in the first place. If the goal is keeping the OLTP working set small via retention, the boundary spacing is really a business decision: how long does the data need to stay queryable in the hot store — one year, seven years, somewhere in between. If the goal is performance, sizing each partition so its indexes fit comfortably in memory is a reasonable rule of thumb, provided there’s no significant key skew concentrating reads or writes on a single partition. The service can be configured against either target and adjust boundary spacing based on observed growth.

Time-aligned boundaries without a date in the key

Partitioning by id doesn’t mean giving up time-based boundaries — it just means choosing them after the fact. The service can run a single query against the live table to find the ID boundary for any point in time:

1
2
3
-- Where was the ID pointer at the start of March?
SELECT MAX(id) FROM orders WHERE created_at < '2026-03-01';
-- -> 3700842139

That value becomes the upper bound for the next bounded partition. The catch-all stays above it, and future partitions get cut at time-aligned ID boundaries:

1
2
3
4
5
-- Partition is still defined by ID range — but chosen to align with a month boundary
ALTER TABLE orders REORGANIZE PARTITION p_future INTO (
    PARTITION p2026_03 VALUES LESS THAN (3700842140),
    PARTITION p_future VALUES LESS THAN MAXVALUE
);

The resulting partition p2026_03 contains roughly all orders from March 2026 — but created_at never appears in the primary key, never needs to be in any WHERE clause to get pruning, and never leaks into application code. The date column is used once, at boundary-creation time, by the service running the DDL. Queries continue to filter by id and get pruning for free.

Retention works the same way. To drop data older than twelve months, the service runs SELECT MAX(id) FROM orders WHERE created_at < NOW() - INTERVAL 12 MONTH, identifies every partition with an upper bound below that ID, and drops them. The MAXVALUE catch-all is what makes this pattern work — there’s always a place for new rows to land while the service is deciding where to cut next.

What the service looks like

The service itself is small. It runs on a schedule — hourly for high-throughput tables, daily for slower-moving ones — and on each tick it does a handful of things:

  • Inventory. Read the current partition layout from the catalog: partition names, upper bounds, and approximate row counts.
  • Sizing check. Look at the active partition (the bounded one just below the catch-all). If it’s filled past a configured threshold of the target size, it’s time to cut the next boundary.
  • Boundary selection. Pick where to cut. For time-aligned partitions, query the live table for the ID that was current at the next month boundary; that ID becomes the upper bound of the new partition.
  • Split. Reorganize the catch-all into a new bounded partition plus a fresh catch-all above it. This is essentially metadata-only as long as the catch-all is still empty when the split runs.
  • Retention pruning. Translate the retention window (e.g. twelve months) into an ID via the same created_at-to-id lookup, then drop any partition whose upper bound sits below that ID.
  • Concurrency guard. A single advisory lock or leader election so two instances don’t run DDL against the same table simultaneously.
  • Metrics and alerting. Per-partition size and row count, time-since-last-tick, and a clear alert if the active partition starts filling faster than the service is splitting ahead of it.

Run it as a cron job, a Kubernetes CronJob, or a tiny always-on worker — the operational footprint is intentionally small. The bulk of the production-readiness work goes into the concurrency guard, naming conventions, and handling of split failures from lock contention or concurrent DDL; none of it changes the core idea, which is that a few catalog reads and one DDL statement per tick are enough to keep the layout healthy.

Automating hash and list partitioning

The same “service watches and adjusts” idea transfers to other partitioning strategies, but the operations change.

Hash partitioning distributes rows across a fixed number of partitions via a hash function. With good cardinality, every partition grows at roughly the same rate — there’s nothing to split based on growth. What there is to monitor is skew: a low-cardinality column or a hot key produces one partition that grows faster than the others, which is the failure mode partitioning was supposed to prevent.

Automation here isn’t about adjusting boundaries. Changing the hash partition count rebuilds the entire table, which isn’t something a background service should trigger. The useful work is detection: track per-partition size and growth, surface skew early, run OPTIMIZE or VACUUM FULL on partitions as they bloat. The service flags problems; a human decides whether to reshape the table. The consequence is that hash partition count is a decision to get right the first time — over-provisioning (64 partitions when 16 would do today) is cheap insurance against a later full rebuild.

List partitioning maps enumerated values to specific partitions — one partition per region, per significant tenant, etc., with the long tail in a DEFAULT catch-all. The automation problem is partition promotion: when a value in the catch-all grows large enough to deserve its own partition.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
-- Starting state: named partitions for known-large tenants, DEFAULT for the rest
CREATE TABLE events (
    id BIGINT AUTO_INCREMENT,
    tenant_id BIGINT NOT NULL,
    payload JSON NOT NULL,
    PRIMARY KEY (id, tenant_id)
)
PARTITION BY LIST (tenant_id) (
    PARTITION p_tenant_42 VALUES IN (42),
    PARTITION p_tenant_73 VALUES IN (73),
    PARTITION p_default VALUES IN (DEFAULT)
);

-- The service notices tenant_id = 108 is now 15% of p_default and growing quickly.
-- It promotes that tenant into its own partition.
ALTER TABLE events REORGANIZE PARTITION p_default INTO (
    PARTITION p_tenant_108 VALUES IN (108),
    PARTITION p_default VALUES IN (DEFAULT)
);

Promotion is more expensive than splitting an empty range catch-all — the rows for that tenant have to physically move out of DEFAULT into the new partition — but it can be batched and scheduled during low-traffic windows. Dormant values can be merged back into DEFAULT in the reverse direction to keep the partition count bounded.

What actually belongs in the partition key

The question worth asking before adopting any partitioning scheme is: “what column is already in every query that matters?” In most OLTP systems the answer is the primary key, and that should be the default choice — it’s already in every lookup, every join, every foreign-key fetch, so partitioning by it gets pruning for free without leaking into application code. The other answers are real but narrower: tenant_id works in a multi-tenant system only if every query is genuinely tenant-scoped, and a date column works only when the workload is fundamentally time-series and every query already filters by date. If those conditions don’t hold, the partition key will leak into application code the first time someone writes a query without it. Default to the primary key; pick something else only when you can prove it’s already in every query that matters.

The failure mode is partitioning by a column that isn’t already in every query, then retrofitting the query layer to add it. That’s the path that ends with AND created_at >= ? stapled onto queries that have nothing to do with dates, just to avoid a 36-partition scan. Pick a column that’s already in the WHERE clauses, automate the boundary management, and don’t let the scheme leak into application code — that’s what separates partitioning that helps from partitioning that just adds a new category of ongoing work.

SELECT insights FROM experience WHERE downtime = 0; -- Ruslan Tolkachev