Featured image of post It's Almost Always the Queries, Part III: When the CPU Is Pegged

It's Almost Always the Queries, Part III: When the CPU Is Pegged

Forty analysts with a dashboard auto-refreshing, twelve microservices each polling a status badge, a connection pool sized four times past the core count. None of it trips the slow-query log, and all of it lands on the same cores.

TL;DR
A relational database pinned at 100% CPU is almost never running one expensive query. It’s running a cheap one too many times. The slow-query log and mean-time sorting both look right past it; total_exec_time is the only view that finds it.

A client portal has a status dropdown at the top of the orders page. It shows a count next to “Open”: SELECT COUNT(*) FROM orders WHERE status = 'open' AND account_id = $1. For any given account that’s around 200 rows. The query runs in 0.4 ms. It has run in 0.4 ms for two years.

Then marketing buys a Super Bowl ad. Thirty seconds of airtime, a short URL, and for the next twenty minutes the portal takes the kind of traffic it normally sees in a quarter. Every visitor lands on the orders page. Every page render fires that COUNT(*). The primary’s CPU graph goes from a comfortable 35% to a flat 100% ceiling and stays there. Checkout latency triples. The on-call engineer pulls up the slow-query log to find the offending statement and the log is empty. Nothing crossed the 100 ms threshold. The slowest query in pg_stat_activity right now is 12 ms. Sorted by mean execution time, the dropdown count doesn’t appear until page four.

It is, by a wide margin, the cheapest query in the system. It is also the entire problem.

More CPU rents the bug

Resize the instance. Double the vCPUs, the graph drops from 100% to 55%, the incident closes. This works, and during a live traffic spike with revenue on the line it is often the correct first move. Part I called this renting the bug, and the CPU case is the cleanest example of why the metaphor holds.

The cost of that COUNT(*) is linear in traffic. A box twice the size moves the ceiling, it does not change the slope. The dropdown still fires once per page render, the render count still tracks visitor count, and visitor count for a business that just discovered TV advertising only goes up. The next campaign, or organic growth over two quarters, walks the bigger box back to the same 100%. Each round of resizing buys time proportional to the headroom purchased, and the bill recurs.

The other reflex is more replicas. For a read-heavy workload that genuinely is read-heavy, spreading reads across replicas is sound. It does not help here for a reason worth being precise about: the COUNT(*) is not slow because the primary is contended. It is slow-in-aggregate because it does real work every single call, and that work executes on whatever node serves the query. Move it to a replica and the replica’s CPU pegs instead. You have not removed the work. You have bought another machine to do it.

Why a 0.4 ms query saturates a core

The arithmetic is the whole mechanism. A query that averages 0.4 ms and fires 50,000 times a minute consumes 50,000 × 0.0004 = 20 CPU-seconds of work every 60 seconds of wall-clock time. That’s one core, a third occupied, by one statement. Push the campaign traffic to 150,000 calls a minute and that statement alone wants a full core. Add the other queries the orders page fires (the order list, the account lookup, the session check) and a handful of cores disappear into a workload where no individual query is doing anything you’d call slow.

This is why the slow-query log is the wrong instrument. A log thresholded at 100 ms is a filter for queries that are individually expensive. The CPU-bound failure mode is a population of queries that are individually trivial and collectively enormous. The log is working exactly as designed; it is designed to miss this. Mean execution time has the same blind spot. The dropdown count’s mean is 0.4 ms and will stay 0.4 ms while it burns four cores, because the mean says nothing about how often the query runs.

The view that sees it is total execution time. pg_stat_statements ordered by total_exec_time, or MySQL’s events_statements_summary_by_digest ordered by SUM_TIMER_WAIT, multiplies per-call cost by call count, which is the number that actually maps to CPU consumed. Sort by that column and the dropdown COUNT(*) is on the first row, with a call count an order of magnitude above anything else in the list. Part II’s Step 6 is the procedure for getting there; this is the failure mode it was written for.

Note
pg_stat_statements aggregates since the last pg_stat_statements_reset() or server start, so the top of the list reflects history, not the current minute. During an incident, reset it and wait sixty seconds, or compare two snapshots a minute apart. A query that dominates a clean sixty-second window is the one burning CPU right now, not the one that happened to run a lot last Tuesday.

COUNT(*) has no shortcut, and that’s structural

The reason this particular query does real work every call, rather than returning a cached number, is MVCC. Under multi-version concurrency control, two transactions running at the same instant can correctly see different row counts for the same table, because each sees the snapshot consistent with its own start. There is no single true count the database could cache and hand back. The PostgreSQL wiki’s “Slow Counting” page states it plainly: PostgreSQL “must walk through all rows to determine visibility,” which “normally results in a sequential scan reading information about every row in the table.” Citus’s 2016 write-up on counting performance puts the same point in one sentence: “There is no single universal row count that the database could cache, so it must scan through all rows counting how many are visible.”

InnoDB works the same way for the same reason. The MySQL documentation notes that InnoDB does not keep an internal stored row count, because a single counter cannot be correct for all transactions at once, and processes COUNT(*) by traversing the smallest available index. This is the detail behind a stubborn piece of stale advice. MyISAM, the old default engine, did keep an exact row count in table metadata, so SELECT COUNT(*) against a MyISAM table really was a constant-time metadata read. Advice written in that era (“COUNT(*) is free, don’t worry about it”) survived the engine that made it true. On InnoDB it is wrong.

An index helps, with conditions. A COUNT(*) filtered by an indexed column scans the index instead of the heap, and PostgreSQL’s index-only scan can satisfy a count from the index alone, but only for the pages the visibility map marks all-visible. The visibility map is maintained by VACUUM. On a table with steady write traffic and autovacuum falling behind, a growing fraction of pages are not marked all-visible, the index-only scan falls back to heap fetches for those pages, and the count gets slower precisely when the table is busiest. The shortcut exists. It is conditional on vacuum keeping pace, and a write-heavy table under a traffic spike is the exact case where vacuum is least likely to be winning.

The CPU-bound family

The dropdown COUNT(*) is the canonical case because it is so cheap per call that it defeats every individually-focused diagnostic. The same shape (cheap-or-medium per call, ridden to 100% by call volume or concurrency) shows up in four other forms worth recognizing on sight.

The first is aggregation. GROUP BY, SUM, AVG, DISTINCT, and the dashboard rollups built on them spend CPU on hashing and sorting, and the work is proportional to rows scanned, not rows returned. A “revenue by region this month” tile that scans 4 million order rows to return 6 numbers does 4 million rows of work every time someone loads the dashboard. One analyst with the dashboard open and a 30-second auto-refresh is 2,880 full rollups a day. A team of forty analysts, each with it open, is a standing CPU load that has nothing to do with how many people are actually looking. The query is not slow. It is medium, and it runs constantly.

The second is parallel-worker starvation. PostgreSQL runs large scans and aggregates across parallel workers drawn from a shared, server-wide pool capped by max_parallel_workers. The PostgreSQL documentation on parallel query is explicit that workers come from one pool and “the requested number of workers may not actually be available at run time.” A few heavy analytics queries, each fanning out to max_parallel_workers_per_gather workers, can drain that pool. Everything else then runs with fewer workers than the planner costed for, or serially. The symptom is strange: CPU pegged, yet many sessions appear to be waiting rather than working, because the plan they got is not the plan they were costed for.

The third is the connection pool sized past the core count. A database with 16 cores can do at most 16 things at once. Point 400 active connections at it and the operating system time-slices 400 runnable processes across 16 cores, and an increasing share of every core’s time goes to context switching rather than query execution. The PgBouncer documentation and the conventional sizing guidance both land near the same place: a pool sized close to the core count, often (cores × 2), beats a much larger pool. The HikariCP “About Pool Sizing” page makes the underlying point bluntly: running two queries sequentially is always faster than time-slicing them across one core, and its Oracle benchmark cut response times from roughly 100 ms to 2 ms by shrinking a pool from 2,048 connections to 96. CPU reads as 100%, but a measurable fraction of it is the scheduler shuffling processes, not the database answering questions.

The fourth is fan-out from a distributed service mesh. Picture twelve microservices, each running six instances, and every instance polling the database on its own timer for state it treats as live: a config row it reloads every few seconds, a routing table it re-reads on a schedule. Each of the seventy-two issues a modest, reasonable number of queries, and none of it looks alarming on that service’s own dashboard. The database underneath sees the sum, a baseline load that no team owns and no team’s monitoring displays. What makes this its own shape, rather than the volume problem restated, is that every caller is asking the identical question and getting back the identical answer. The dropdown COUNT was parameterized per account, and every execution did different work for a different row. Here, seventy-two instances poll for one config row that reads the same for all of them, because the database is the single place every service agrees holds the truth. The work is real and redundant: seventy-one of those reads per tick exist only because nothing in front of the database kept the answer the seventy-second already fetched.

Warning
The unread-count badge is the fan-out problem hiding in your own frontend. A “you have N notifications” badge in the global nav re-runs its COUNT on every page load of every authenticated user. It is not one feature’s query, it is a tax on every route in the application. An ORM N+1 in a list view is the same pattern from the ORM coupling: one page render silently expands into one query per row, and at list-page volume that is a CPU load with no single slow statement to point at.

There is a fifth shape that is per-call expensive rather than per-call cheap, and it belongs here because it pins CPU the same way: predicates the planner cannot turn into an index seek. A LIKE '%term%' with a leading wildcard, a regex match, JSONB extraction in the WHERE clause, a function wrapped around a column, an implicit type cast because the column is bigint and the parameter arrived as text. Each forces the engine to evaluate an expression on every candidate row, which is CPU work that no amount of indexing removes until the predicate itself is rewritten. Non-SARGable predicates covers the rewrite. The reason it shares this article is the diagnostic: a non-SARGable filter under concurrency reads as CPU saturation, and total_exec_time is again where it surfaces.

Fixes, and what each one costs

For the dropdown COUNT(*), the first question is whether the number needs to be exact. Often it does not. A status badge that says “Open: 204” is not measurably more useful than one that says “Open: ~200,” and an estimate is close to free. PostgreSQL’s planner statistics already hold a row estimate in pg_class.reltuples; the PostgreSQL wiki’s “Count estimate” page gives the query and the caveat, that reltuples is maintained by VACUUM and ANALYZE and is only as fresh as the last run. For a filtered count, parsing the row estimate out of EXPLAIN output gets you a per-predicate estimate. The trade-off is accuracy: an estimate can be off by a few percent, and it is wrong to use one where the count drives a financial total or a correctness check.

When the number must be exact, the choice is between caching it and maintaining it. A cached count (in Redis, or a materialized view refreshed on a schedule) turns thousands of COUNT(*) executions into one, at the cost of staleness equal to the refresh interval. A counter table or a counter column, incremented and decremented by trigger or by application code, keeps the count exact and reads in constant time, and it moves the cost to writes. Every insert and delete now also writes the counter row, and if that counter is global, every writer contends on one row. That contention is its own CPU and lock problem, sometimes a worse one than the count you started with. Per-account or per-shard counters spread the contention; a global “total orders” counter concentrates it. The pragmatic middle for a UI badge is the bounded count: SELECT COUNT(*) FROM (SELECT 1 FROM orders WHERE status='open' AND account_id=$1 LIMIT 100) t, which stops scanning at 100 and lets the interface render “99+”. The product question of whether anyone needs the exact number above 99 is usually answered “no” the moment you ask it.

Aggregation rollups want to stop running at read time, and the obvious instrument is a materialized view. On an OLTP primary it is more dependency than it looks. REFRESH MATERIALIZED VIEW still runs the full scan; it has only moved the work off the viewers and onto a schedule, on the same cores you are trying to protect, in periodic spikes rather than a steady drip. Plain REFRESH holds an ACCESS EXCLUSIVE lock that blocks reads of the view until it finishes; REFRESH ... CONCURRENTLY trades that lock for a mandatory unique index and a slower full recompute. Add a scheduler to run it and a staleness window to reason about, and a feature that looked like one line of DDL is a small system to operate. The version that actually removes the scan is a summary table the write path maintains: each order insert also bumps the per-region, per-day total, so the dashboard reads a few pre-computed rows and the 4-million-row scan never runs. The cost shifts onto writes, a couple of extra row updates per transaction, and the numbers stay current with no refresh job at all. Where the data tolerates lag, the other move is to get the aggregation off the primary: point the dashboard’s GROUP BY at a read replica so the scan burns the replica’s cores, or feed a separate analytics database and let the heavy rollups live there. A revenue tile that updates every five minutes is fine; an inventory count a customer sees at checkout usually is not.

Parallel-worker starvation is a sizing problem. Cap max_parallel_workers_per_gather so a single analytics query cannot drain the pool, and size max_parallel_workers against the cores you can spare after the OLTP workload has what it needs. The connection-pool case is the same discipline: size the pooler near the core count rather than near peak concurrency, and let connections queue briefly in the pooler instead of oversubscribing the scheduler. Both fixes feel like throttling, and they are. A query that waits 5 ms for a pool slot and then runs at full speed beats one that starts immediately and fights 399 others for a core.

Fan-out has two fixes, and which applies depends on whether the callers want the same answer. When they do — a config row, a feature-flag set that every instance reads identically — a cache in front of the database is the direct fix. One instance’s read populates a shared cache, the other seventy-one read from the cache, and the database serves the query once per TTL instead of once per instance per tick. When the callers genuinely need different answers, no single query is wrong and the lever is governance: a platform-level view of aggregate QPS broken down by calling service, and a per-service query budget treated as a real constraint. Health checks can poll every 30 seconds instead of every 2. Polling can become a push. None of that happens without someone holding the number that no individual team’s dashboard shows.

Note
Across every fix here, the move is the same: do the work fewer times. Estimate instead of count, cache instead of recompute, refresh on a schedule instead of per-view, poll less often, queue instead of oversubscribe. Tuning a query to be faster per call is the Part IV and Part V conversation. The CPU-bound failure is a volume problem, and volume is what these fixes attack.

When it really is the load

Sometimes the workload genuinely needs the cores. A reporting database that runs heavy analytical queries, an ETL window, a system doing real per-row computation that no rollup can precompute because the parameters change every call, will sit at high CPU because that is the job. The tell is in total_exec_time: when the top of the list is a spread of genuinely heavy statements rather than one trivial query with an enormous call count, you are looking at a workload that wants capacity, and adding cores is the honest answer. The diagnostic distinguishes the two cases; the dropdown COUNT(*) at the top of the list means volume, a flat distribution of expensive queries means load.

And the one-day spike can be a legitimate reason to rent. A Super Bowl ad, a product launch, a Black Friday window: a known, bounded surge where the cost of engineering a permanent fix before the date exceeds the cost of a bigger box for 48 hours. Scale up Friday, scale down Monday, fix the COUNT(*) in the next sprint with the incident graph as the justification. That is renting the bug on purpose, with a return date. The failure is renting it by reflex, with no return date and no ticket, so the bigger box becomes the permanent baseline and the next spike starts the cycle again.

The dropdown count on the orders page was always going to break. The Super Bowl ad only decided the date. A query that does work proportional to traffic, on a system whose traffic only grows, has a ceiling it will reach; the box size sets the date, and the query sets the slope. Sort by total_exec_time before you size the instance, and you find out which one you’re actually fighting.

SELECT insights FROM experience WHERE downtime = 0; -- Ruslan Tolkachev