It's Almost Always the Queries, Part II: Troubleshooting Steps

TL;DR

Database troubleshooting is a learnable skill with a repeatable sequence: observe what’s happening now, categorize the wait, narrow to the specific cause, then act. The sequence matters more than the tools. This article walks through it for engineers who don’t do this often.

Your APM lights up. Endpoint latency on /api/checkout has tripled in the last three minutes. The graph shows a wall of slow requests, no deploy in the window, no traffic spike. Something changed in the database layer. You have maybe fifteen minutes before someone senior asks what’s happening. What do you actually do?

If your instinct is to paste the alert into an LLM and ask what to do, pause. The model will give you something that looks like an answer. It might suggest killing the longest-running query, or restarting the connection pool, or adding an index. It pattern-matches on symptoms (duration, state labels, error messages) without access to the causal structure underneath. If you don’t understand why it’s suggesting what it’s suggesting, you can’t tell when it’s wrong. This applies even if you’re running an agent with MCP access to your database (hopefully read-only). The agent can query pg_stat_activity faster than you can type it, but if you don’t understand what the output means and can’t evaluate whether the agent’s next step is appropriate, you’ve handed control of a production incident to something that can’t distinguish a victim from a cause. When it’s wrong during a live incident, you make things worse. This article builds the mental model that lets you troubleshoot yourself. Use LLMs to learn these concepts on your own time. Don’t rely on them at 3am.

The sequence below is designed to give you understanding before you reach the “act” step.

If you have SQL access to the primary

This is the fuller diagnostic path. You can query the system tables directly. What follows assumes you can connect to the primary (or a replica that exposes these views). System tables show what’s happening right now; dashboards show what happened over the past hour. Both have a place, and Step 7 covers what to look for in dashboards. Learn whatever monitoring you have before you need it. Figuring out which tab shows wait events at 3am is wasted time.

The first thing you want to see is the process list, filtered to active queries and ordered by time. On PostgreSQL that’s pg_stat_activity. On MySQL that’s SHOW PROCESSLIST or, better, performance_schema.processlist.

Before you run anything, protect your own session:

1
2
3
4
5
-- PostgreSQL
SET LOCAL statement_timeout = '5s';

-- MySQL (per-query)
SELECT /*+ MAX_EXECUTION_TIME(5000) */ ...

Why: if the database is under heavy pressure, your diagnostic query competes for the same resources. A five-second timeout means your debugging doesn’t pile onto the problem. If your diagnostic can’t finish in five seconds, that itself tells you something (extreme contention, buffer pressure, WAL pressure).

Now pull the active sessions with the columns that actually help you categorize:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
-- PostgreSQL
SELECT
  pid,
  state,
  wait_event_type,
  wait_event,
  now() - xact_start AS xact_duration,
  now() - query_start AS query_duration,
  pg_blocking_pids(pid) AS blocked_by,
  LEFT(query, 100) AS query_snippet
FROM pg_stat_activity
WHERE state != 'idle'
  AND pid != pg_backend_pid()
ORDER BY xact_start NULLS LAST;

Most diagnostic snippets you’ll find online show pid, query, state, and duration. They skip two columns that matter: wait_event_type and wait_event. These tell you why a query is taking long, not just that it’s taking long.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
-- MySQL (use performance_schema, not INFORMATION_SCHEMA)
SELECT
  t.PROCESSLIST_ID AS pid,
  t.PROCESSLIST_STATE AS state,
  t.PROCESSLIST_TIME AS duration_sec,
  w.BLOCKING_THREAD_ID AS blocked_by_thread,
  LEFT(t.PROCESSLIST_INFO, 100) AS query_snippet
FROM performance_schema.threads t
LEFT JOIN performance_schema.data_lock_waits w
  ON t.THREAD_ID = w.REQUESTING_THREAD_ID
WHERE t.PROCESSLIST_COMMAND != 'Sleep'
  AND t.TYPE = 'FOREGROUND'
ORDER BY t.PROCESSLIST_TIME DESC;

Warning

On MySQL, use performance_schema.threads or performance_schema.processlist (8.0.22+), not INFORMATION_SCHEMA.PROCESSLIST. MySQL Bug #94077 (January 2019) documented a 70% performance drop from polling INFORMATION_SCHEMA.PROCESSLIST under load. Bug #100049 (June 2020) showed the same query causing pending queries to pile up until the server became unresponsive. Both trace to a mutex held during execution. The diagnostic query you run during an incident should not itself become part of the incident.

Note

Worth asking your DBA or platform team to wrap these queries into views (active_queries, blocking_chains, long_transactions) so that during an incident you’re running SELECT * FROM active_queries instead of assembling joins from memory. You don’t want to be copy-pasting SQL from a blog post at 3am.

Step 3: Look for obvious offenders

Start with the longest-running queries in the output. Some problems are visible from the query snippet alone. A SELECT * with no WHERE clause on a large table, a function wrapping a column in the predicate (non-SARGable), a COUNT(*) over millions of rows, an N+1 pattern showing up as dozens of identical queries with different IDs. If you recognize the shape, you already know what to fix.

Also look for DDL in the list. An ALTER TABLE, CREATE INDEX, or DROP INDEX that’s been running longer than you’d expect is usually not doing the work it looks like it’s doing. It’s waiting on a lock. On MySQL this shows up as a metadata lock (MDL): the DDL waits for every transaction still holding the table open to commit or roll back, and meanwhile every new query against the table queues behind the DDL. On PostgreSQL the equivalent is an ACCESS EXCLUSIVE lock at the relation level, with the same cascade: the DDL waits on active transactions, and everything else waits on the DDL.

The non-obvious version is when no heavy query is running but the server is wedged anyway. The trigger is usually something innocuous: a connection pooler holding a session ‘idle in transaction’, an analytics job that opened a transaction and never closed it, a long SELECT still holding its ACCESS SHARE lock, or an autovacuum touching the same table. The DDL blocks on whichever of those it is, then every new query against the table queues behind the DDL. The process list shows a wall of waiting sessions against one table, the DDL at the head of the queue, and the actual root cause somewhere further down (or in a connection that doesn’t look problematic at all).

If nothing jumps out from the query text alone, move to Step 4 and Step 5 to dig deeper into the suspect query. If nothing looks slow at all, skip to Step 6.

Step 4: Check the schema

Once you have a suspect query, look at the table it’s hitting. Pull the table definition (\d tablename in psql, SHOW CREATE TABLE tablename in MySQL) and compare it against what the query needs.

Walk the clauses one at a time. WHERE columns need indexes the optimizer can actually use, and ‘usable’ depends on the predicate shape: equality on a high-cardinality column is the straightforward case, range predicates (>, <, BETWEEN) work but constrain what can follow them in a composite, and a large IN (...) list may flip the planner to a sequential scan even when an index exists. JOIN columns need an index on the inner side of the join (the side being looked up once per outer row); in PostgreSQL, foreign-key columns are not indexed automatically, which is a frequent cause of joins that ran fine at low volume and fell over at scale. ORDER BY can sometimes use an index to skip the sort entirely, but only when the index’s leading columns line up with the ORDER BY columns. GROUP BY is the same shape: an index on the grouping columns lets the planner stream the aggregation instead of building a hash, which on a multi-million-row table can be the difference between a sub-second query and one that exhausts work_mem and spills to disk.

A missing index on a high-cardinality filter column is the single most common cause of queries that worked fine at low volume and fell over at scale. Composite index column order is the runner-up. An index on (status, created_at) serves WHERE status = 'pending' ORDER BY created_at. An index on (created_at, status) does not, even though it contains the same columns. The general rule is equality columns first, then the range column, then columns used only for sort.

Also check for covering index gaps: an index that covered the query last month might have stopped covering it after a column was added to the SELECT list, forcing a heap lookup per row where there used to be an index-only scan.

Step 5: Read the execution plan

If the schema looks right and you still can’t explain the behavior, ask the database what it’s actually doing. EXPLAIN (PostgreSQL) or EXPLAIN (MySQL) shows the planner’s chosen strategy without executing the query. EXPLAIN ANALYZE executes it and shows actual row counts alongside the estimates. Reading these outputs is the most important skill in query troubleshooting. Every claim about what a query ‘should’ do is wrong until the plan confirms it; the optimizer might pick a different index than you expect, fall back to a sequential scan because of stale statistics, or choose a join order you’d never write by hand. The plan is the ground truth.

Run this on a replica if you have one. The plan will be the same (assuming similar data and stats), and you avoid adding load to a primary that’s already under pressure. If you don’t have a replica, plain EXPLAIN (without ANALYZE) gives you the plan without executing the query. It’s an estimate, not a measurement, but it’s often enough to spot the problem.

What to look for in the output: sequential scans on large tables (the planner couldn’t find a usable index), rows estimated vs. rows actual diverging by orders of magnitude (stale statistics or a bad cardinality guess), reading a million rows to return ten (missing or ignored index), nested loops where each iteration does its own index lookup against a large table (the N+1 shape at the engine level).

Warning

EXPLAIN ANALYZE executes the query fully. On a SELECT that takes 30 seconds in production, it will take 30 seconds when you run it too. If the query modifies data (INSERT, UPDATE, DELETE), wrap it in a transaction and roll back: BEGIN; EXPLAIN ANALYZE UPDATE ...; ROLLBACK;. On a system already under pressure, be deliberate about what you choose to execute.

Step 6: If nothing looks slow, check volume

Sometimes every query in the process list finishes in a few milliseconds and nothing looks wrong individually. The problem isn’t one slow query. It’s thousands of fast ones hitting the same resources concurrently.

This is where digest-level views help. pg_stat_statements (PostgreSQL) and performance_schema.events_statements_summary_by_digest (MySQL) aggregate queries by their normalized pattern and track call counts. Sort by total_exec_time (PostgreSQL) or SUM_TIMER_WAIT (MySQL), not by mean time. A query that averages 5 ms but fires 10,000 times per second consumes 50 CPU-seconds per wall-second. It will never show up in a “longest running” list, but it dominates the workload.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
-- PostgreSQL
SELECT
  LEFT(query, 80) AS query_snippet,
  calls,
  total_exec_time::int AS total_ms,
  (total_exec_time / calls)::int AS avg_ms
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 10;

-- MySQL (join current statements to their digest summary)
SELECT
  LEFT(d.DIGEST_TEXT, 80) AS query_snippet,
  d.COUNT_STAR AS calls,
  ROUND(d.SUM_TIMER_WAIT / 1e9) AS total_ms,
  ROUND(d.AVG_TIMER_WAIT / 1e9) AS avg_ms
FROM performance_schema.events_statements_current esc
JOIN performance_schema.events_statements_summary_by_digest d
  ON esc.DIGEST = d.DIGEST
JOIN performance_schema.threads t
  ON esc.THREAD_ID = t.THREAD_ID
WHERE t.PROCESSLIST_COMMAND != 'Sleep'
ORDER BY d.SUM_TIMER_WAIT DESC
LIMIT 10;

This is the case from Part I, where the COUNT(*) averaged 40 ms and never triggered the slow-query log. Part III walks the CPU-bound version of this in detail.

Step 7: Read the dashboards

Dashboards see what the system tables don’t: history. By the time you query pg_stat_activity, you’ve lost the picture of what was happening five minutes ago. Whatever you have (Datadog DBM, Aurora and RDS Performance Insights, pganalyze, Grafana with the right exporters), this is where you bring it in. If you have no SQL access at all, the dashboard is your entire diagnostic toolkit; everything you can learn, you’ll learn here.

Start with timing. A graph with a hard step-change at 14:23 is a different problem from one that climbed slowly over the past hour. The hard step points at a discrete event: a deploy, a config change, a single long-running query that started blocking others. The slow climb points at a workload trend or a plan regression that compounded as the working set grew. Overlay deployment markers, autoscaling events, and scheduled job runs on the latency graph if you can. A latency jump that exactly tracks a deploy is the deploy until proven otherwise.

Active session count is the other timing graph worth pulling. A flat baseline that doubles at 14:23 points at a blocking event. A slow climb over an hour points at workload growth or queue buildup. Session count is harder to lie to than latency: a single slow query can hide in a p99 average, but it can’t hide in the count of sessions waiting on it.

Which resource is pinned tells you which dimension is the cause and which is following. CPU at 100% with IOPS low and stable is a CPU-bound workload, often a missing index causing repeated sorting or hashing, or a regex or JSON predicate doing per-row work the planner can’t push down. IOPS pinned with CPU low and waits on IO:DataFileRead (PostgreSQL) is a buffer-cache miss problem: the working set has outgrown RAM and every query is going to disk. RAM climbing steadily for days while the buffer cache hit rate falls at the same rate is a forecast, not an incident.

Waits are where dashboards earn their keep. Aurora Performance Insights, pganalyze, and Datadog DBM all stack active sessions by wait class over time, which is exactly the information pg_stat_activity can’t give you historically. Locks dominating means contention: a deadlock storm, or an open long-running transaction blocking everything that touches the same rows. IO dominating means buffer pressure or storage saturation. CPU dominating means the queries themselves are doing more work per call than they used to. Client reads dominating points at a slow consumer: the app isn’t reading results fast enough and sessions stack up waiting to send the next page.

WAL and checkpoint pressure are easy to miss because they don’t appear in the active query list. WAL generation rate climbing with no proportional traffic increase points at write amplification: a runaway UPDATE rewriting the same rows, a hot index getting bloated, a trigger writing more than it needs to. Checkpoint duration climbing, or checkpoint frequency increasing, means the system can’t keep pace with the write rate. On MySQL the same signal shows up as InnoDB log file utilization approaching its configured size, with checkpoint-related stalls visible in SHOW ENGINE INNODB STATUS. These often correlate with sudden IOPS spikes, because the checkpoint flush is what saturates the disk, not the workload directly.

Workload composition completes the picture. Most monitoring breaks queries-per-second down by type. A 10x spike in write QPS with read QPS flat is a different incident from a 10x read spike. Within reads, the ratio of index scans to sequential scans is a leading indicator of plan regression: in PostgreSQL, pg_stat_user_tables.seq_scan climbing on a table that previously got index scans; in MySQL, Handler_read_rnd_next rising relative to Handler_read_key. A jump in read-ahead activity (InnoDB’s Innodb_buffer_pool_read_ahead, Aurora’s read-IOPS metrics) often signals large scans that weren’t there before: a new query, or an old query whose plan changed.

You’re not trying to find the exact query from the dashboard. The goal is to narrow the category before going back to the system tables: read-side vs write-side, query-level vs workload-level, lock contention vs buffer pressure vs CPU saturation. That tells you where to focus in the earlier steps, and if you have no SQL access, it’s enough to escalate with specifics. “40 sessions in Lock wait starting at 2:43, no deploy in the window, getting worse” is something whoever owns the database can act on. “It’s slow” is not.

Before the next incident

Everything above assumes you can run the queries when you need them, that you know what their output means, and that you’ve seen what ’normal’ looks like so the abnormal stands out. None of that is true at 3am unless you’ve done the work before then.

Know your access. Can you connect to the primary, or only to a replica? Read-only or with permission to call pg_terminate_backend? Does your role have access to pg_stat_statements (which requires pg_read_all_stats on PostgreSQL 13+) and to the MySQL performance_schema tables? On managed services, some catalog views are hidden behind parameter groups that take a restart to enable. The time to discover you don’t have pg_stat_statements is not the moment you need it.

Know your team’s views. Ops teams often wrap the queries from this article into named views: active_queries, blocking_chains, long_transactions, top-N digest views. Find out whether yours exist. If they do, learn the column names and what they filter out (some hide replication workers, autovacuum, or your own session, which can mislead during an incident if you don’t know). If they don’t, ask your DBA whether they’d take a pull request, or write them yourself. A view you can SELECT * FROM is faster to run and harder to typo at 3am than a 15-line join assembled from memory.

Run the queries on a quiet system. Pull pg_stat_activity against your dev database while nothing stressful is happening. Note what idle connections look like (the pool’s keep-alives, your IDE’s introspection queries, your monitoring’s polling traffic). Pull a pg_stat_statements snapshot and read through the top 20. The point is to know what your environment looks like at rest, so during an incident the abnormal jumps out instead of getting lost in baseline noise.

Read the column documentation once. wait_event_type and wait_event in PostgreSQL have a documented enumeration with dozens of values. MySQL’s performance_schema instruments follow a naming convention you can learn in fifteen minutes. Knowing that IO:DataFileRead means a buffer cache miss and Lock:transactionid means waiting on another transaction’s row lock turns opaque output into a diagnosis.

Practice reading EXPLAIN output. Pull a slow-ish query from your own codebase and run EXPLAIN (ANALYZE, BUFFERS) on it in a quiet environment. The ‘what to look for’ in Step 5 is more useful when you’ve spent thirty minutes staring at a plan in low-stakes context first. Visualizers like explain.depesz.com and explain.dalibo.com help with PostgreSQL output, and MySQL’s tree-format EXPLAIN is more readable than the default table format. But none of those substitutes for knowing what each node type means. Read the docs for sequential scan, index scan, bitmap heap scan, nested loop, hash join, and merge join once.

What this article doesn't cover

None of the seven steps is comprehensive. The full diagnostic surface (lock-graph reconstruction, planner cost-model tuning, statistics histograms, MVCC and vacuum internals, the catalog views nobody talks about) takes years to learn and a book to lay out. What’s here is less than the minimum a DBA assumes you already know - enough to triage with confidence and escalate with specifics, not enough to skip the call to whoever owns the database.