created_at onto the primary key so they can range-partition on it — forces every query to include that column in the WHERE clause or suffer a scan across every partition. Partition by the primary key instead, and let a background service manage boundaries based on observed growth. Queries keep using the keys they already have, partition pruning works automatically, and the partition column never leaks into application code. The same “service watches and adjusts” pattern applies to hash and list partitioning with different operations.Two decisions determine whether partitioning becomes ongoing operational work or quietly disappears into the schema: the column you partition on, and how partition boundaries get managed over time. The first constrains the second — the wrong key choice locks the schema into manual lifecycle work forever, the right one makes a small background service enough.
Partitioning is one of those features that looks great in slides and turns into a maintenance burden in production. The pitch is straightforward — break a huge table into smaller physical pieces, queries only touch the pieces they need, maintenance operations (vacuum, reindex, drops) get cheap. The reality is that most partitioning schemes introduce a new coupling between the schema and every query that touches the table. That coupling tends to be invisible until someone runs a query that should be fast and gets a full-table scan back.
The partition key problem
Both PostgreSQL and MySQL require the partition key to be part of any primary key or unique constraint on the table. That rule exists for correctness — if the primary key didn’t include the partition key, the database couldn’t enforce uniqueness without scanning every partition.
The consequence is that if you want to partition by created_at, you can’t just have PRIMARY KEY (id) anymore. You need PRIMARY KEY (id, created_at). The date column is now part of the primary key whether your application needed it to be or not.
The more subtle cost is that id is no longer unique in the eyes of the database. Uniqueness is enforced on the tuple (id, created_at) — the database will cheerfully accept two rows with the same id as long as they have different timestamps. The application probably still treats id as unique, but nothing in the schema guarantees it. And you can’t recover the guarantee with a separate UNIQUE (id) constraint: both MySQL and PostgreSQL require every unique constraint on a partitioned table to include the partition key columns. The uniqueness property has effectively been traded away.
This isn’t purely cosmetic — it changes the query plans the optimizer is willing to generate:
- With
PRIMARY KEY (id),WHERE id = 1is a constant-time lookup. MySQL’s EXPLAIN shows this as theconstaccess type; the optimizer knows exactly one row matches and the executor stops after finding it. Joins onidareeq_ref, the fastest join access type. - With
PRIMARY KEY (id, created_at), the same query becomes areflookup — a prefix scan on the leftmost index column that could, as far as the database is concerned, return multiple rows. Joins that used to beeq_refbecomeref. Cardinality estimates fall back to index statistics instead of the guaranteed “one row” assumption, which can push the optimizer toward worse plans further up the query tree.
To get the old const plan back, every lookup has to spell out the full primary key:
|
|
That’s the same leakage as partition pruning, just from a different angle — the partition key has forced its way into queries that had nothing to do with dates, first to get pruning and now to get single-row access.
|
|
At this point everything still works. The table accepts inserts, queries return correct results, and the partition boundaries exist. The problem shows up the first time someone runs a query that doesn’t include created_at in the WHERE clause.
Partition pruning only works if you ask for it
Partition pruning is the optimization that makes partitioning worth doing — when a query’s WHERE clause restricts the partition key, the database can skip partitions that can’t possibly match. A query for last week’s orders only reads the one or two partitions that contain last week’s data.
That optimization depends entirely on the partition key appearing in the WHERE clause. A query that filters on anything else doesn’t get pruned — it scans every partition.
|
|
The first query is the kind of lookup that happens constantly — fetch an order by its primary key. On a non-partitioned table, it’s a single index seek. On a partitioned table where the pruning key isn’t in the WHERE clause, it’s a separate index probe against every partition — 36 index lookups instead of one. Still fast in absolute terms, but dramatically worse than the non-partitioned version, which is the opposite of why partitioning was introduced.
The “fix” teams usually land on is to add the partition key to every query that touches the table. That’s a leaky abstraction — a physical storage decision has become a contract with every piece of code that queries the table. New code needs to remember to add the partition filter. Old code needs to be audited. ORMs need to be configured. And there’s no error when someone forgets — the query still returns correct results, just slowly and with a lot of extra I/O.
Static partition boundaries don’t age well
The other thing that tends to go wrong is hardcoding partition boundaries at table creation time. The initial layout reflects whatever the team’s growth projection looked like at that moment. Six months later the traffic pattern has changed, some partitions are 10x larger than others, and the p_future catch-all partition is holding half the table.
|
|
Manually splitting and rebalancing partitions is operational work nobody wants to own. It requires scheduling maintenance windows, running ALTER TABLE ... REORGANIZE PARTITION against tables that might be hundreds of gigabytes, coordinating with application teams, and not making a mistake. It tends not to happen until there’s a performance incident, and at that point the fix is expensive.
The shape of the better approach
The primary key already exists. For tables using BIGINT AUTO_INCREMENT, it’s monotonically increasing — newer rows have larger IDs. That’s exactly the property needed for range partitioning. The primary key is the partition key.
|
|
Every query that filters by id — which is most of them — gets partition pruning for free, with no changes to application code. Range queries by ID prune across a small number of partitions. Point lookups prune to exactly one. The primary key is already in every WHERE clause that matters, because it’s the primary key.
The apparent trade-off is that partition boundaries aren’t directly defined by time anymore, which looks like it breaks time-based retention. In practice this is less of a trade-off than it looks — the point of partitioning often isn’t retention but keeping index sizes manageable, making maintenance operations cheap, and bounding the blast radius of a bad query. And when retention is a goal, boundaries can still be chosen to align with time — they just get picked at DDL time by the partitioner service, rather than baked into the schema. See Time-aligned boundaries without a date in the key.
Automating range partition management
Everything up to this point assumes range partitioning — partitions defined by continuous boundaries on an ordered value (an ID range, a date range). The operational work is mechanical: watch the active partition fill up, split the MAXVALUE catch-all into a new bounded partition before that happens, and drop partitions that have fallen past the retention threshold. A small service running on a schedule is enough to keep the layout healthy. The hard part isn’t the logic, it’s doing it safely — running DDL against a large table without locking out writes, handling partial failures, and recovering cleanly if the service crashes mid-operation.
|
|
REORGANIZE PARTITION on an empty catch-all partition is fast — there’s nothing to move. The key insight is that if you split the catch-all before any rows land above the split point, the operation is essentially metadata-only. The service’s job is to stay ahead of the write workload: split the catch-all when it’s still small or empty, not when it’s already holding hundreds of millions of rows.
There’s no single right target — it depends on what’s driving the partitioning in the first place. If the goal is keeping the OLTP working set small via retention, the boundary spacing is really a business decision: how long does the data need to stay queryable in the hot store — one year, seven years, somewhere in between. If the goal is performance, sizing each partition so its indexes fit comfortably in memory is a reasonable rule of thumb, provided there’s no significant key skew concentrating reads or writes on a single partition. The service can be configured against either target and adjust boundary spacing based on observed growth.
Time-aligned boundaries without a date in the key
Partitioning by id doesn’t mean giving up time-based boundaries — it just means choosing them after the fact. The service can run a single query against the live table to find the ID boundary for any point in time:
|
|
That value becomes the upper bound for the next bounded partition. The catch-all stays above it, and future partitions get cut at time-aligned ID boundaries:
|
|
The resulting partition p2026_03 contains roughly all orders from March 2026 — but created_at never appears in the primary key, never needs to be in any WHERE clause to get pruning, and never leaks into application code. The date column is used once, at boundary-creation time, by the service running the DDL. Queries continue to filter by id and get pruning for free.
Retention works the same way. To drop data older than twelve months, the service runs SELECT MAX(id) FROM orders WHERE created_at < NOW() - INTERVAL 12 MONTH, identifies every partition with an upper bound below that ID, and drops them. The MAXVALUE catch-all is what makes this pattern work — there’s always a place for new rows to land while the service is deciding where to cut next.
What the service looks like
The service itself is small. It runs on a schedule — hourly for high-throughput tables, daily for slower-moving ones — and on each tick it does a handful of things:
- Inventory. Read the current partition layout from the catalog: partition names, upper bounds, and approximate row counts.
- Sizing check. Look at the active partition (the bounded one just below the catch-all). If it’s filled past a configured threshold of the target size, it’s time to cut the next boundary.
- Boundary selection. Pick where to cut. For time-aligned partitions, query the live table for the ID that was current at the next month boundary; that ID becomes the upper bound of the new partition.
- Split. Reorganize the catch-all into a new bounded partition plus a fresh catch-all above it. This is essentially metadata-only as long as the catch-all is still empty when the split runs.
- Retention pruning. Translate the retention window (e.g. twelve months) into an ID via the same
created_at-to-idlookup, then drop any partition whose upper bound sits below that ID. - Concurrency guard. A single advisory lock or leader election so two instances don’t run DDL against the same table simultaneously.
- Metrics and alerting. Per-partition size and row count, time-since-last-tick, and a clear alert if the active partition starts filling faster than the service is splitting ahead of it.
Run it as a cron job, a Kubernetes CronJob, or a tiny always-on worker — the operational footprint is intentionally small. The bulk of the production-readiness work goes into the concurrency guard, naming conventions, and handling of split failures from lock contention or concurrent DDL; none of it changes the core idea, which is that a few catalog reads and one DDL statement per tick are enough to keep the layout healthy.
Automating hash and list partitioning
The same “service watches and adjusts” idea transfers to other partitioning strategies, but the operations change.
Hash partitioning distributes rows across a fixed number of partitions via a hash function. With good cardinality, every partition grows at roughly the same rate — there’s nothing to split based on growth. What there is to monitor is skew: a low-cardinality column or a hot key produces one partition that grows faster than the others, which is the failure mode partitioning was supposed to prevent.
Automation here isn’t about adjusting boundaries. Changing the hash partition count rebuilds the entire table, which isn’t something a background service should trigger. The useful work is detection: track per-partition size and growth, surface skew early, run OPTIMIZE or VACUUM FULL on partitions as they bloat. The service flags problems; a human decides whether to reshape the table. The consequence is that hash partition count is a decision to get right the first time — over-provisioning (64 partitions when 16 would do today) is cheap insurance against a later full rebuild.
List partitioning maps enumerated values to specific partitions — one partition per region, per significant tenant, etc., with the long tail in a DEFAULT catch-all. The automation problem is partition promotion: when a value in the catch-all grows large enough to deserve its own partition.
|
|
Promotion is more expensive than splitting an empty range catch-all — the rows for that tenant have to physically move out of DEFAULT into the new partition — but it can be batched and scheduled during low-traffic windows. Dormant values can be merged back into DEFAULT in the reverse direction to keep the partition count bounded.
What actually belongs in the partition key
The question worth asking before adopting any partitioning scheme is: “what column is already in every query that matters?” In most OLTP systems the answer is the primary key, and that should be the default choice — it’s already in every lookup, every join, every foreign-key fetch, so partitioning by it gets pruning for free without leaking into application code. The other answers are real but narrower: tenant_id works in a multi-tenant system only if every query is genuinely tenant-scoped, and a date column works only when the workload is fundamentally time-series and every query already filters by date. If those conditions don’t hold, the partition key will leak into application code the first time someone writes a query without it. Default to the primary key; pick something else only when you can prove it’s already in every query that matters.
The failure mode is partitioning by a column that isn’t already in every query, then retrofitting the query layer to add it. That’s the path that ends with AND created_at >= ? stapled onto queries that have nothing to do with dates, just to avoid a 36-partition scan. Pick a column that’s already in the WHERE clauses, automate the boundary management, and don’t let the scheme leak into application code — that’s what separates partitioning that helps from partitioning that just adds a new category of ongoing work.