It's Almost Always the Queries, Part V: Disk Has Two Alarms, Not One

TL;DR

Two alarms ride the same dashboard tile: the disk filling up, and the disk slowing down. Both have query-level and schema-level fixes that hold for years. Capacity is partition-and-archive, not DELETE. IOPS is covering indexes and the access patterns that go with them. The cloud’s autoscaling and burst credits mask both, and the bill is where the symptom finally surfaces.

On-call gets paged on RDS I/O latency at 2pm Tuesday. The Datadog graph shows read latency at four times its baseline, write latency climbing in lockstep. The engineer on rotation bumps the instance from db.t3.medium to db.t3.large, latency drops back inside the SLO inside five minutes, the page closes, and the incident channel goes quiet. Three days later: same alert, same dashboard, same “fix.” Same five-minute window. By the fourth time, somebody pulls the BurstBalance metric out of CloudWatch and the picture changes. The instance upgrade had not actually done anything to the workload. It had reset the gp2 burst-credit pool from zero back to full. The query mix was steady. The variable was the credit accounting, and the dashboard the team was looking at did not graph it.

The obvious fix and why it buys you weeks

Reach for a bigger volume, more provisioned IOPS, a beefier instance class, or all three at once. Each lever is real, and during an active incident with revenue tied to checkout latency, the right call is often whichever one moves the graph fastest. They share a property the postmortem usually skips: each rents capacity proportional to the pattern underneath - the bloat that produces dead pages, the retention nobody set, the SELECT list that stopped being covered when a column got added. The cost recurs every time growth or concurrency pushes the workload back into the same shape. Part I called this renting the bug. The disk case is two bugs sharing one alert.

Two failure modes, two upstream fixes

The disk tile collapses two genuinely different problems into one number. The disk is filling, which is capacity. The disk is slowing, which is IOPS. They have different mechanics, different upstream causes, and different fixes that hold for years rather than weeks. The 2pm incident is almost always the IOPS one. The Friday-afternoon “we’re at 87% storage” thread is the capacity one. Same dashboard, two alarms, two posts’ worth of mechanism.

Capacity is bloat plus growth, and the shape that holds up is partition-and-archive. A team’s first instinct on a too-full disk is to write a DELETE FROM events WHERE created_at < NOW() - INTERVAL '180 days' and ship it. The space does not come back. On PostgreSQL, DELETE marks tuples dead and leaves them on the page; the space is reclaimed by VACUUM, and VACUUM only returns physical space to the OS when an entire trailing extent is empty (the VACUUM FULL that does is an ACCESS EXCLUSIVE rewrite of the table, which is not a thing you run on a busy production system). On an UPDATE-heavy table, autovacuum can fall behind the dead-tuple production rate, and bloat grows unboundedly until somebody intervenes. InnoDB has its own version of the same problem: deletes and updates fragment the clustered index, and a long-running transaction (an analytics session left open, a misbehaving connection pool, an export that took longer than expected) pins the undo log via the history list and prevents purge from cleaning up. SHOW ENGINE INNODB STATUS lists “History list length” precisely so you can spot the case where purge is losing.

The pattern that holds: partition by date or by tenant, drop or detach old partitions on a schedule, offload the detached partitions to cheaper storage if compliance or analytics still need them. PostgreSQL declarative partitioning (PG 10+) with pg_partman handles the rotation; the extension’s background worker can create new partitions ahead of the curve and run the retention drop on a schedule with no external cron. ALTER TABLE ... DETACH PARTITION turns a partition into a standalone table you can dump and drop, or move to a different tablespace on slower disk. MySQL has the same shape via native PARTITION BY RANGE and ALTER TABLE ... DROP PARTITION, which on InnoDB returns the space directly because each partition is its own tablespace. The space comes back instantly, instead of waiting on VACUUM and never quite catching up. The trade is schema churn upfront, and the rest of the partitioning post is what to know before you commit to a partition key you cannot easily change later.

IOPS is access pattern, and more IOPS is the answer to the wrong question. A query that is “well-indexed” can still saturate the disk if the index does not cover the SELECT list. The classic shape: a composite index on (customer_id, status) happily serves WHERE customer_id = $1 AND status = 'open', but the SELECT projects customer_id, status, total_cents, created_at, and the engine follows a heap pointer for each of the few thousand matching rows to fetch the columns the index does not contain. A thousand random heap fetches per call, multiplied by call volume, is an IOPS load that no amount of provisioning quietly absorbs. The plan looks correct. The dashboard reads “needs more IOPS.” The covering-index post walks the diagnostic in detail; the fix is INCLUDE columns on PostgreSQL 11+ for the projection-only payload, or a reordered composite index on MySQL that puts the projected columns inside the index. Same query, two orders of magnitude fewer pages read, the heap-fetch count drops to zero in EXPLAIN (ANALYZE, BUFFERS) output.

A worked example with real numbers: a February 2026 write-up titled “Between select and disk” documents a single query reading 27,841 blocks (217 MB) to return zero rows - roughly 1,989 IOPS from a query that filtered everything out on the heap because a JSONB predicate could not be evaluated inside a B-tree on account_id. A companion query did the same shape: 12,071 rows fetched, 107 MB, ~1,944 IOPS, zero rows returned. Combined, the two queries demanded ~3,900 IOPS against a 3,000 IOPS provisioned ceiling, with reads briefly hitting 3,668 IOPS as burst credits allowed. The fix was a GIN index that let the JSONB filter run inside the index scan, instead of after the heap fetch. The disk dashboard during the incident read “IOPS saturated”; the actual cause was an index that did not match the predicate.

The query-side moves that keep an index covering: project the columns you actually need (an ORM defaulting to SELECT * defeats coverage the moment any column lands outside the index), prefer keyset pagination over deep OFFSET (a LIMIT 50 OFFSET 100000 reads a hundred thousand index entries to discard them and return the next fifty), match the index’s column order in ORDER BY so the planner skips the sort, and write WHERE predicates the planner can push down to the index leading column. Non-SARGable predicates is the third leg of this: a function on a column, a leading wildcard, an implicit cast from bigint to text, and the engine evaluates per row instead of seeking, and the IOPS graph follows. Each of these is a query-level move with no schema change, and each removes IO that more provisioned IOPS would only hide.

The managed-cloud overlay produces false fixes. Three behaviors on AWS, with analogues on Azure and GCP, make the disk dashboard easy to misread. The first is gp2 (and gp3, with different mechanics) and burst credits. A gp2 volume earns 3 I/O credits per GiB per second up to a 5.4-million-credit cap, sustains 3,000 IOPS while credits last, and falls to its baseline (as low as 100 IOPS on small volumes) when the pool drains. The AWS blog post that introduced the BurstBalance metric in 2016 is still the cleanest reference. A workload that has been steady for months can hit a credit wall during a backup window or an end-of-quarter report, and the latency graph tells you “the disk got slow” without showing you that the disk was throttled because the credit counter hit zero. Bumping the instance, or growing the volume, resets the picture. Three days later the credits drain again. Same incident, same fix, same cycle, and BurstBalance is the metric that closes the loop.

The second is Aurora’s no-disk-in-the-traditional-sense model. Aurora storage scales transparently to 128 TB, so the disk-full alarm never fires. On Aurora Standard, IO is billed per request, and the alert nobody sets is on the bill. In May 2023, AWS announced Aurora I/O-Optimized, a flat-rate pricing option that removes per-IO charges in exchange for a higher instance and storage rate. The break-even, per AWS’s own guidance, is roughly 25% of total Aurora spend going to IO; above that, I/O-Optimized wins, below that, Standard does. VGS’s case study from May 2025 puts numbers on it: their Aurora:StorageIOUsage was 30–40% of daily Aurora cost, traced to a Monday cleanup cron job concentrating millions of I/O operations into one window, and the move to I/O-Optimized cut their overall Aurora bill by roughly 20%, which at their scale was hundreds of thousands of dollars per year. The point is not the calculator. The point is that on Aurora the failure mode is not a graph that goes red, it is an invoice line item that climbs, and the cause is the same access pattern that would have shown up as IOPS saturation on RDS.

The third is RDS storage autoscaling. Enable it, and the disk-full alarm never fires because the volume grows automatically up to the configured ceiling. The bloat keeps growing, the retention policy still does not exist, and the issue surfaces six months later at finance review when the storage line is double what it was. Autoscaling is fine; running it without a retention policy underneath turns “we need to delete old data” into “we need to delete old data and reclaim a terabyte of provisioned storage we’re paying for.”

What this costs

Each upstream fix has trade-offs the postmortem should name out loud.

Partition-and-archive is schema churn upfront and operational scaffolding forever. Partition key choice, query routing across detached partitions, and the rotation tooling itself are the trade-offs worth making in advance, and the partitioning post is the canonical reference for that decision pass. The thing to keep in mind here is that none of it looks urgent until the disk is full, and that is the wrong moment to design a partition strategy.

Covering indexes are write amplification and storage overhead. Every INSERT and UPDATE to a covered column writes to every index that covers it; an INCLUDE clause adds payload columns to the index leaf without making them part of the key, which keeps the index smaller than a wide composite but still means the leaf gets updated on every write to those columns. A covering index designed for today’s SELECT list ages out the moment the SELECT list grows; the same query pattern from the covering-index post reappears six months later when a new feature adds a column. Adding INCLUDE columns interacts with PostgreSQL’s HOT update path too: HOT updates need the new tuple to fit on the same page and not modify any indexed columns, and a wider index payload combined with a fillfactor near 100% can starve the HOT optimization without changing any query. ALTER TABLE ... SET (fillfactor = 90) for write-heavy tables is the standard accompaniment to wide covering indexes, and it is the easy thing to forget.

Cloud-side moves are mostly upside, with one trap worth naming. Aurora I/O-Optimized’s break-even moves with workload. A cluster that was fine on Standard last quarter can cross the I/O-heavy threshold this quarter and nobody notices until the next bill review. AWS publishes an estimator using CloudWatch metrics for the recalculation; running it quarterly catches the drift.

Warning

The most common partition-and-archive footgun is queries that span the archive boundary returning silently incomplete results. A report that used to read three years of data still asks for three years, the partition that holds year three has been detached and archived to S3, and the query returns two years of data with no error. Once is a bug. Recurring is an architecture problem. The fix is making the boundary explicit, either by routing historical queries to a federated view that includes the archive (CREATE FOREIGN TABLE on PostgreSQL, or a UNION ALL against a separate archive schema), or by rejecting queries that ask for ranges the live table does not cover. Failing loud beats answering wrong.

When this doesn’t apply

Three cases where the hardware reading is right and the schema reading is not.

A working set that genuinely does not fit in RAM. If a hot table is 12 GB on an 8 GB instance and the top of pg_stat_statements is dominated by reads against that table, no partition strategy and no covering index change the fact that the buffer cache is too small. The wait events tell the story: IO:DataFileRead dominating the active-session-by-wait graph in Part IV’s terms. The fix is RAM, or a smaller working set, and “smaller working set” usually means partitioning so the active subset shrinks, which means the line between “more RAM” and “fewer rows” is fuzzier than the framing suggests.

Snapshot or backup operations consuming live IOPS during a known window. If the latency spike lines up with the 2am backup window, or with a once-a-month consistency check, and the rest of the day is fine, the answer is scheduling and IO throttling rather than query optimization. RDS snapshots are incremental and cheap to take, but the first snapshot on a fresh volume and any major change to the dataset force a full sweep that competes with live traffic.

A one-time migration off a system that should have moved to cheaper storage years ago. If the disk is full of 2017 data that nobody has read in three years, the fix is dumping it to S3 once and reclaiming the space, not designing a rotation strategy for data that is not being produced anymore. Partition-and-archive is for recurring patterns. One-off cleanup is one-off cleanup.

The bigger picture

Capacity and IOPS are the slowest-to-alert resources a relational database has, and on managed cloud the autoscaling, burst credits, and per-IO billing models hide the cause while the bill quietly absorbs the symptom. Fix-once strategies survive workload growth in a way that “bigger instance” does not: a partition rotation dropping a month every month is not less effective the year you triple traffic, and a covering index that touches zero heap pages is not less effective when the table grows tenfold. The diagnostic discipline is the same one running through Parts II–IV. Pull the top-10 from pg_stat_statements by total_exec_time, read the plan with BUFFERS, check BurstBalance and the storage autoscaling history before resizing the volume. The instance type is the last thing to change.