Featured image of post It's Almost Always the Queries, Part IV: When the Sort Spills

It's Almost Always the Queries, Part IV: When the Sort Spills

work_mem, shared_buffers, innodb_buffer_pool_size: three knobs, and the one that caused your OOM is the one nobody resized. A backend gets SIGKILLed, the postmaster can't trust shared memory, and the whole database runs crash recovery.

TL;DR
A database alerting on memory pressure is almost never a workload that needs more RAM. The dangerous allocation is per-sort, per-hash, per-connection and transient, so the instance-level memory graph never shows it until data size and concurrency line up at the same instant. The exception is a genuinely large working set with a low cache hit ratio, and that one is real.

A reporting endpoint sorts orders by created_at for an account’s quarterly export. Last quarter it ran in about 200 ms. This quarter it takes four seconds, and nobody changed the query. The EXPLAIN ANALYZE output is the whole story:

1
2
3
4
5
6
7
8
9
-- last quarter
Sort  (cost=8420.11..8556.34 rows=54492 width=84)
  Sort Key: created_at
  Sort Method: quicksort  Memory: 3072kB

-- this quarter
Sort  (cost=41922.88..42698.05 rows=310068 width=84)
  Sort Key: created_at
  Sort Method: external merge  Disk: 24960kB

The data grew, which is the one thing that always happens. The sort set crossed work_mem, the executor stopped sorting in memory and started an external merge against a temp file on disk, and a CPU-bound operation became an IO-bound one. No crash, no OOM kill, no page. The memory dashboards show nothing, because a temp file written by one backend for the duration of one sort is not a number that appears on an instance-level “RAM used %” graph. The query just got slow, and the only place the cause is visible is in a plan nobody ran.

More RAM doesn’t raise work_mem

The reflex is to resize the box. The endpoint is “running out of memory,” the instance has 32 GB, so move it to 64 and the sort has room. It’s a clean story and it is wrong, for a reason that is structural rather than situational.

The sort spilled because it exceeded work_mem, and work_mem is a fixed per-operation budget that has nothing to do with how much RAM the box has. The default is 4 MB. It is 4 MB on a 4 GB instance and 4 MB on a 256 GB instance. Doubling the instance does not move that number by a byte. MySQL’s per-thread buffers behave the same way: sort_buffer_size defaults to 256 KB and stays 256 KB on a 4 GB box and on a 256 GB box, exactly as work_mem does. The export’s sort set is 24 MB; it will spill on the bigger box exactly as it spilled on the smaller one, because the limit it crossed is a config value, not a quantity of physical memory. Part I called this renting the bug. The memory case is the cleanest version of the metaphor: you can buy RAM the database is structurally unable to point at the operation that needs it.

So the real fix looks like raising work_mem. Set it to 64 MB, the export’s sort fits in memory, the endpoint is fast again. That works for the export. It also arms a multiplier across every other connection on the system, and the multiplier is the actual subject of this article.

The memory the dashboard can’t see

work_mem is not a global pool the server draws down. It is a per-operation allowance, granted independently to every sort, hash, and materialize node, in every query, in every session, at the same time. The PostgreSQL “Resource Consumption” documentation states the consequence directly:

Note that a complex query might perform several sort and hash operations at the same time, with each operation generally being allowed to use as much memory as this value specifies before it starts to write data into temporary files. Also, several running sessions could be doing such operations concurrently. Therefore, the total memory used could be many times the value of work_mem.

“Many times” is the load-bearing phrase, and it has three separate multipliers stacked inside it.

The first is plan shape. A report query that joins four tables, aggregates, and sorts the result does not allocate work_mem once. It allocates it per memory-using node: a hash for each hash join, a sort for the ORDER BY, another for a DISTINCT, a hash for the GROUP BY. Four memory-using nodes in one plan is ordinary, not pathological. That single query’s peak is 4 × work_mem, not work_mem.

The second is hash_mem_multiplier. Since PostgreSQL 13 a hash_mem_multiplier setting governs this, and since PostgreSQL 15 it defaults to 2.0, so hash-based nodes get a larger allowance than sort-based ones. The “Resource Consumption” docs: “The final limit is determined by multiplying work_mem by hash_mem_multiplier. The default value is 2.0, which makes hash-based operations use twice the usual work_mem base amount.” A hash join in that plan is not budgeted at work_mem. It is budgeted at 2 × work_mem, on a current default, before anyone has touched a setting.

The third is parallel workers. A parallel sequential scan feeding a parallel hash or sort gives each worker its own work_mem allocation for its slice of the work. A query with max_parallel_workers_per_gather set to 4 can have five processes (the leader plus four workers) each holding a work_mem-sized hash for the same node.

Now do the arithmetic the way Part III did it with CPU-seconds. You raised work_mem to 64 MB to stop the export from spilling. A moderately complex query has four memory-using nodes, two of them hashes that get the 2.0 multiplier. Call its peak a conservative 64 MB × 6 once the hash multiplier is counted, roughly 384 MB for one execution. Your Rails app runs 300 worker processes, each holding a connection, and on a busy afternoon 80 of them are running a query of about that shape at the same instant. 384 MB × 80 is just over 30 GB of transient sort and hash memory, none of it in shared_buffers, none of it visible on the memory graph until the second it is all allocated at once. The instance has 32 GB. The export’s sort fit. The OOM killer arrives anyway, two weeks after the change, with no obvious deploy to blame, because the change that caused it was a config edit and the trigger was an ordinary Tuesday with slightly more concurrency than Monday.

The arithmetic above assumes the plan goes as costed. But the planner chose that plan from estimates, not from the rows it would actually find. Sort versus no-sort, hash join versus merge join, HashAggregate versus GroupAggregate are all decided at planning time from row-count guesses in pg_statistic. A stale estimate, or an n_distinct that was always wrong, and the planner picks a memory-hungry plan accurate stats would have avoided, or under-sizes an allocation that blows past work_mem at runtime. GROUP BY is the sharp case. Before PostgreSQL 13, released 2020-09-24, a HashAggregate chosen on a too-low n_distinct estimate had no disk fallback at all: the hash table grew past work_mem with no bound and could OOM the server. PG13 added the spill, and its release notes state the new behavior plainly. Hash aggregation “was avoided if it was expected to use more than work_mem memory”; now the plan can be chosen anyway, and “the hash table will be spilled to disk if it exceeds work_mem times hash_mem_multiplier.” An uncapped OOM became the same external merge trade the export opened on, and the data never had to grow to trigger it. The estimate only had to go stale.

The same structure runs on MySQL with different knob names. The per-thread buffers (sort_buffer_size, join_buffer_size, read_rnd_buffer_size) are allocated per connection, and the MySQL documentation is explicit that join_buffer_size can be allocated more than once for a single query, once per join that cannot use an index. Multiply the per-thread total by max_connections and you have MySQL’s version of the same blind multiplier. MySQL’s optimizer is just as estimate-bound: it picks join order and access paths from index cardinality (refreshed by ANALYZE TABLE, stored under innodb_stats_persistent) and optional column histograms, and stale cardinality pushes it onto a join that cannot use an index and falls back to a block-nested-loop on join_buffer_size. Internal temporary tables add a second multiplier: under the MySQL 8.0 TempTable engine, an in-memory temp table grows until it hits tmp_table_size, at which point, in the documentation’s words, “MySQL automatically converts the in-memory internal temporary table to an InnoDB on-disk internal temporary table.” temptable_max_ram (default 1 GiB) caps the engine’s total RAM before it spills to memory-mapped files. The MySQL spill is the same event as the Postgres external merge, reached through a slightly different accounting path.

Note
This is also why “give the database more memory” so often goes to the wrong place. Told to add memory, teams enlarge shared_buffers (or innodb_buffer_pool_size). That is the buffer cache, fixed at server startup, and it does nothing for sorts and hashes, which allocate from a separate per-backend region. Worse, oversizing it starves the resource the database quietly depends on. PostgreSQL uses no direct IO; every page read goes through the operating system page cache, and RAM you did not hand to shared_buffers or to backends is not idle, it is that cache. The PostgreSQL wiki tuning guide puts the starting point at “1/4 of the memory in your system” and warns that “it’s unlikely you’ll find using more than 40% of RAM to work better than a smaller amount,” precisely because the OS cache needs the rest. Every memory area in this article is a facet of one accounting problem: the instance metric sums physical pages, and none of the allocations that actually break the database are visible at that resolution.

The OOM killer is where the silent spill stops being silent. Linux overcommits memory by default: it hands out address space freely on the assumption that not every process touches all of it. When resident memory across the system exceeds RAM plus swap, the kernel invokes the OOM killer, which picks a victim by oom_score and sends it SIGKILL. If the victim is a PostgreSQL backend, the damage does not stop at that one connection. The PostgreSQL documentation on managing kernel resources explains that the kernel “might terminate the PostgreSQL postmaster” outright, and even when it takes a backend instead, a backend killed by SIGKILL had no chance to release its locks or detach cleanly from shared memory. The postmaster can no longer assume shared memory is consistent, so it does the only safe thing: it terminates every other backend, and the entire instance runs crash recovery. One report query, on one connection, restarts the whole database. MySQL gets to the same place by a shorter route. mysqld is a single multi-threaded process, so the OOM killer has exactly one target; kill it and the entire server goes down at once, and on restart InnoDB runs crash recovery by replaying its redo log. Both engines end in a full restart, Postgres because the postmaster cascades the kill outward and MySQL because there was only ever one process to kill.

Warning
This is the failure mode behind a real, dated incident. In a March 2026 write-up titled “work_mem: it’s a trap!”, PostgreSQL contributor Lætitia Avrot walked through a production cluster with 2 TB of RAM that the OOM killer reaped. work_mem on that cluster was 2 MB, below the 4 MB default, not some reckless 1 GB. A single badly structured query accumulated allocations inside one ExecutorState memory context faster than anything released them. The context’s dump showed 524,059 separate chunks. That memory is not freed until the operation finishes, and the operation never finished, so it climbed until 2 TB was gone. A 2 MB setting and a 2 TB box, and the box still lost. The problem was never the size of the box.

Fixes, and what each one costs

Do not raise work_mem globally. The export needs 64 MB; the rest of the workload does not, and the global setting applies the change to every connection whether it needs it or not. Raise it where the big sort actually runs. SET work_mem = '256MB' inside the reporting session, scoped to that transaction, or ALTER ROLE analytics SET work_mem = '256MB' so the change attaches to the role the reports run as and the OLTP path keeps the small default. MySQL’s per-thread buffers are session-settable in the same way. Keep the my.cnf global values for sort_buffer_size, join_buffer_size, and tmp_table_size small, and SET SESSION sort_buffer_size = ... on the reporting connection. The cost is that this requires knowing which workload is which. It assumes the reporting queries connect as a distinguishable role or run through a distinguishable code path, and on a system where the web app and the nightly export share one database user, that separation is work you have to do first, on either engine.

A connection pooler bounds the other multiplier. The arithmetic above had 80 concurrent heavy queries because 300 app workers each held a real backend. Put PgBouncer in transaction mode in front, sized to a 40-connection pool, and the database can never run more than 40 backends no matter how many app workers exist. The multiplier is capped at 40 instead of 300. MySQL’s analogue is ProxySQL, an external connection multiplexer that fronts the server the way PgBouncer fronts Postgres, plus the thread pool plugin shipped in MySQL Enterprise Edition and in Percona Server, which caps the number of threads executing at once. Bounding concurrent threads bounds the per-thread-buffer multiplier the same way bounding backends bounds it on Postgres. Part III covered pool sizing for the CPU case, and the reasoning transfers exactly: a query that waits briefly for a pool slot and then runs is cheaper than one that starts immediately and helps exhaust memory. The cost is latency under burst, and a pool sized too small turns into its own incident when every slot is held and the queue backs up.

The fix that removes the spill instead of feeding it is fixing the query. The export sorts by created_at; an index on (account_id, created_at) lets the planner return rows already in order and the Sort node disappears from the plan entirely, no work_mem consumed because no sort happens. Keeping statistics fresh is part of the same fix: autovacuum runs ANALYZE, but a bulk load or a fast-growing table outruns it, and a manual ANALYZE (or a raised per-column target via ALTER TABLE ... ALTER COLUMN ... SET STATISTICS, or CREATE STATISTICS for correlated columns) keeps the planner’s estimate close enough to reality that it sizes the plan correctly. MySQL’s equivalent is ANALYZE TABLE for index cardinality and ANALYZE TABLE ... UPDATE HISTOGRAM ON ... for column histograms, with innodb_stats_auto_recalc governing whether InnoDB refreshes cardinality on its own. The diagnostic tell is the one Part II already named: EXPLAIN ANALYZE showing estimated rows and actual rows diverging by orders of magnitude means the planner is flying blind. A hash join’s allocation is proportional to the rows on its build side, so a predicate that filters earlier, or an index that avoids scanning rows the query then discards, shrinks the hash. This is the crossover with the rest of the series: a non-SARGable predicate that forces a scan also inflates every downstream sort and hash that scan feeds, and a covering index that quietly stopped covering when a column joined the SELECT list adds heap fetches that widen the rows a sort has to buffer. Tuning the query to be smaller per call is the per-operation answer; Part V takes the IO side of it further.

Leave shared_buffers near the conventional fraction and do not raise it to “use the memory.” The page cache needs that headroom, and on a managed service the default is usually already in the sane range. The kernel side is worth one deliberate pass: vm.overcommit_memory controls how freely Linux hands out address space, and setting the postmaster’s oom_score_adj lower than its backends’ (PostgreSQL ships PG_OOM_ADJUST_FILE and PG_OOM_ADJUST_VALUE for exactly this) means that when the OOM killer does fire, it reaps a single backend and not the supervisor. That converts a full crash-recovery restart into one dropped connection. It does not fix the multiplier; it makes the multiplier’s worst day cheaper. vm.overcommit_memory is an OS-level setting that applies to both engines, but the per-process oom_score_adj trick has nothing to work with on MySQL. There is no supervisor and worker split, so one mysqld process means nothing to spare. For MySQL the levers are the system-wide overcommit setting and the discipline of not over-provisioning the per-thread buffers in the first place.

When more RAM is the honest answer

The thesis has real exceptions, and a staff engineer wants the boundary, not the slogan.

A genuinely large working set is the first one. If the buffer-cache hit ratio is low and the wait events are dominated by IO:DataFileRead, the database is going to disk because the data it needs does not fit in RAM, and that is a capacity problem that more memory honestly solves. The tell is in the waits, not the memory graph: steady IO waits on reads, a hit ratio that has been falling for weeks. This shades into Part V’s territory, where the line between “needs more RAM for cache” and “needs more IOPS” gets drawn properly.

A correctly isolated analytical workload is the second. A reporting role that runs deliberate large sorts, on its own connection budget, behind a pooler that caps its concurrency, genuinely benefits from a high work_mem for that role. Raising it there is not the bug this article describes. The bug is raising it globally and arming it across 300 OLTP connections. Scoped to a role that runs ten concurrent queries at most, a large work_mem is a correct decision.

And plainly low concurrency. A database with a 20-connection pool and a workload that never runs more than a handful of heavy queries at once has a small multiplier, and raising work_mem globally on that system is safe because work_mem × 20 is a number the box can hold with room to spare. The multiplier is only dangerous when it is large. Measure it before assuming it is.

The number that isn’t on the graph

The memory graph on the instance dashboard is an honest number. It reports resident physical pages, sampled every minute, and it is the wrong instrument for this failure for the same reason the slow-query log was the wrong instrument in Part III. The slow-query log filters for queries that are individually expensive and misses the cheap query run a million times. The memory graph reports memory that is allocated right now and misses the allocation that is small per operation, multiplied by plan nodes, by the hash multiplier, by parallel workers, and by concurrent connections, and exists only in the instant all of those line up. The export sort that spilled to a 24 MB temp file and the 2 TB cluster the OOM killer reaped are the same failure at two scales: a per-operation cost the instance metric was never built to see. Resize the box and you move the ceiling that cost climbs toward. You do not change the cost, and the cost is set by the query and the connection count, the same two numbers every part of this series keeps coming back to.

SELECT insights FROM experience WHERE downtime = 0; -- Ruslan Tolkachev