Featured image of post Your Alert Triage Doesn't Need an Agent

Your Alert Triage Doesn't Need an Agent

A team's AI ops agent has access to logs, metrics, deploys, traces. Six months in, MTTR is unchanged and 'responder trusted the summary' shows up in three of the last ten post-mortems. More access did not make the summaries more correct.

TL;DR
Autonomous agents are the wrong abstraction for alert triage. A scripted playbook of RE-curated queries plus one LLM call to summarize the structured output gives the responder a triage hint with the raw data attached. The summary saves time on the easy pages; the raw data carries them through the cases the LLM gets wrong.

3:14am page. p99 latency on /api/orders checkout past the 1500ms SLO for six straight minutes. The on-call assistant’s summary at the top of the alert reads “elevated checkout latency correlated with deploy of order-service r8472, 14 minutes ago. Recommend rollback.” The responder pages the deploy author and starts the rollback. Latency stays past SLO. The actual cause is a worker that hung yesterday holding an open transaction, idle for 18 hours, blocking vacuum the entire time. Bloat on the orders table is what made the checkout query slow enough to finally cross the SLO during normal early-morning traffic. The idle session has been visible in pg_stat_activity the whole time, and the assistant had access to pg_stat_activity. It picked the deploy as the cause because the deploy was the most legible recent change in the data it pulled. The responder did not read pg_stat_activity because the summary said roll back the deploy. Twenty-three minutes page-to-fix, twenty on the wrong path.

The senior reader’s first response is “give the agent better access. Wire in pg_locks, slow-query log, replication slot state, recent lag, the works.” The agent in the scenario already had pg_stat_activity. Missing data was never the problem. The agent had the data and picked the most legible recent change as the cause, because that is what the model defaults to on partial structured data. Adding more sources gives it a longer list to pattern-match against. The summary that comes back is more confident without being more correct, and a responder defers more readily to a confident summary. That is how the failure mode shifts from “agent missed the data” (visible, fixable in tooling) to “agent had the data and misattributed cause” (invisible, the post-mortem has to reconstruct what the responder would have seen without the summary).

What the agent abstraction blends together

“Agent” in the current vendor pitch means autonomy in tool selection: the LLM reads the alert, decides which MCP servers to consult, what queries to run, in what order, what to do with the results. That bundle does two jobs at once. Summarizing structured data into prose is the job where an LLM hallucinates least, though it still fabricates values and misreads fields on the way. Deciding which queries to run for a given symptom is detective work that depends on a system model the LLM does not have. The reliability engineer has the model. They have been on call. They have read every post-mortem. They know that a replication-lag alert wants the slot state, the publisher’s WAL position, the largest active transaction’s age, and the last three deploys, in that order, every time. The LLM does not know that. It can pattern-match toward it on familiar shapes and miss it on the rest.

The right design splits the two jobs. The reliability engineer curates the playbook: for this alert ID, run these queries against these systems, with this scope. A script runs the playbook on every fire of that alert. One LLM call at the end takes the structured output and writes a paragraph: what is affected, what is notable in the data, what jumped out. No tool selection by the model. No causal claims unsupported by the queries the playbook ran. The model is doing the safest thing it can do.

This is not a smaller version of an agent. The autonomy in tool selection has been removed entirely. The reliability engineer chose the queries, the playbook runs them, the LLM formats the result.

In practice, no team that runs an agent trusts its autonomy unsupervised either. Tool descriptions accumulate, system prompts get tuned with hints like “for slow-query alerts, consider checking pg_stat_activity, pg_locks, and the last three deploys.” That natural-language playbook lives inside the prompt, with no guarantee the model executes it on any given run. The engineer is authoring a playbook either way. The only choice is whether the playbook lives in code that runs every time, or in a prompt the model may or may not honor on this alert.

What every round trip costs

The autonomous-agent design pays for the same data several times. Each tool call’s output goes into the input prompt of the next step, and the loop is serial: read alert, decide query, run query, read result, decide next query, run query, read result. A pg_stat_activity dump fetched at step one is in the input for steps two, three, four, and five. If the dump is 10KB and the run takes six steps to land on a summary, that is roughly 50KB of redundant input the model reads and is billed for, plus the output tokens spent emitting tool-call JSON at each step. At a page rate of a few hundred a day across a platform team, the bill compounds.

The agent also does this fresh on every alert. It carries no schema knowledge between runs the way the reliability engineer does. On a given page it queries pg_stat_activity with a column name that is wrong (renamed, deprecated, or hallucinated), reads the SQL error in the next prompt, retries with a different guess, gets it wrong again, queries information_schema to discover what columns actually exist, dumps that result into the conversation, and finally runs the query it should have run from the start. Every error and every discovery dump piles into the next prompt. Per alert. The playbook does this once when the RE writes it, in version control, against a real database.

And the loop takes time. Every step is one LLM call to decide what to query, one tool call to run the query, and another LLM call to read the result and decide what comes next. LLM calls run several seconds each. A six-step investigation, end to end, is the better part of a minute before any summary text comes back. The responder paged at 3am has opened three dashboards manually by then. The supposed time savings of the agent are negative against a responder who already knows where to look.

The playbook design fetches everything once, in parallel, and passes one shaped bundle into one LLM call. The shaping is where most of the token savings come from. Raw pg_stat_activity output is verbose JSON with thirty columns per row, half of them irrelevant to a triage summary. A playbook can project the four columns the prompt actually needs (pid, state, query_start, query), format them as a small table rather than nested JSON, truncate long query text, and pass a hundred bytes where the agent would have passed ten kilobytes. Page-to-summary time is the slowest single query plus one summarization call, regardless of how many queries the playbook fetches.

The alert artifact

What the responder gets has three layers.

Tier sits at the top, set by the routing layer: prod page, non-prod channel, Jira queue. The tier picks the playbook. A P0 page activates the prod-read playbook, which can hit replicas and recent deploy state. A Jira ticket runs only the runbook-lookup playbook with no live read access. Tier-as-scope is the security half of the design and falls out for free once the playbook is the unit of action.

Raw data sits in the middle: every query the playbook ran, with its output. The pg_stat_activity snapshot. The lock graph. Replication slot state. Last five deploys with author and SHA. Slow-query log entries from the last fifteen minutes. The artifact attaches all of it because the playbook already paid the cost to fetch it. Re-running the queries from the responder’s terminal at 3am is exactly the time the design exists to save.

Summary sits on top: one paragraph from one LLM call, generated from the structured output of the playbook. “Replication lag of 47 seconds. Slot pub_orders is held with restart_lsn 18 hours stale. No recent deploys touch the publisher service. Largest active transaction is session 88234, idle in transaction for 18h2m.” That sentence is doing the job an LLM hallucinates least on: compressing structured input into readable prose, with the inputs visible to the responder one scroll below. It is not claiming the slot is the cause. The responder reads the summary, scrolls to confirm in the slot-state output, kills the session, slot drains, lag recovers.

The summary is a reading hint, not a source of truth. On the bulk of pages where it is right (bad CPU, lag, slow query, full disk), the responder saves a few minutes of dashboard-tab opening. On the cases where it is wrong, they scroll past the summary, read the raw output the playbook already gathered, and override. They never have to wait on the model to fetch anything.

Lower tiers do not run the playbook automatically. A non-prod channel post or a Jira ticket lands with the alert payload and an “investigate” button. Most of those alerts get glanced at and dismissed: known flake, the synthetic that fires every Tuesday morning. Running a playbook and a summarization call on every one wastes tokens and clutters the channel. The button is for the alerts the responder decides to look at; pressing it runs the playbook and attaches raw data and summary the same way a P0 page would have them. P0 pages skip the button because the responder is already committed; the summary is there the moment the page opens.

What the post-mortem actually changes

Post-mortem deltas in this design land somewhere specific. Usually the playbook needs another query: pg_prepared_xacts was missing, or the lock graph was dumped without the waiter chain. Sometimes the prompt template needed to surface a signal the playbook already gathered but the LLM ignored. Occasionally the routing tier was wrong and the alert hit the wrong playbook entirely. All three ship as a pull request a reviewer can read.

The same post-mortem in an autonomous-agent setup is harder to reason about. The agent decided to run queries A, B, C this time. It might run D, E, F next time on a similar-looking alert. The prompt and the run are intertwined, and the fix is “tune the agent’s tool descriptions” with no guarantee the next run reaches for the right tool.

Where the design strains

A few real caveats. None of them the agent design solves either.

Playbook maintenance is work, but the work is the cleanest accuracy lever the engineer has. Adding a query directly improves the summary’s grounding, because the model now reads more of the data the cause lives in. Tuning an agent’s prompt does not have the same property. The model can still ignore the hint, conflict it with another instruction, or pick a different tool, and there is no deterministic check that any of those did not happen. The bundle either has the data or it does not. The strain is the silent failure mode on the maintenance side. When a query references a column that was renamed or a service that moved, the query returns empty, the bundle gets thinner, and the summary gets less informative without anything visibly breaking. The discipline that catches it is owned playbooks (one team, one engineer named in the file) plus a cadence: post-mortems produce playbook deltas, and a periodic review flags queries that have returned zero rows on every recent run. Without that, the playbook decays.

The summary can still be wrong on data the playbook surfaced. Curating the input does not fix the model’s tendency toward confident misattribution. The bundle might include the idle-in-transaction session and the recent deploy side by side, and the model can still pick the deploy because it pattern-matches better to recent-change framing. The raw-data layer is the floor under the summary. The responder scrolls, reads the idle session, overrides. Curating the input does not change the model’s tendency to misattribute; it changes how easy the misattribution is to catch.

The summary can also fabricate facts the playbook did not produce. Even with a curated bundle as input, the model can describe values that were not in the data (a lag of 47 seconds when no lag query ran), invent observations from a single-row snapshot, or restate the bundle in a way that adds confidence the data does not support. The raw-data layer is again the floor: the responder catches a fabricated number by reading the actual query output the playbook attached. The agent design has the same failure plus a worse one. The agent’s summary can claim observations that no tool call ever produced, because in an agent trace the summary text and the actual tool calls are separate artifacts and the responder rarely reads both.

Prompt injection is a real exposure. Raw strings from user-controlled fields end up in the bundle: query text, application_name, log message bodies. An attacker who can write into those fields can attempt to steer the summary. Tier-as-scope helps because low-trust alerts get less context to work with, but the playbook design does not eliminate the risk any more than the agent design does. Standard mitigations apply: prompt isolation, output sanitization, and treating the summary as untrusted input to anything downstream.

Page delivery takes longer if the summary blocks. The LLM call adds 200ms to a couple of seconds, and on paging tiers that is a regression on time-to-acknowledge. The fix is the same shape as the investigate button on lower tiers: lazy. Deliver the raw alert and the playbook output the moment they are ready. The summary lands asynchronously and appends to the thread when the LLM call returns. The responder starts reading the raw data while the summary is still rendering. Blocking page delivery on the summarizer is the kind of regression the design was supposed to prevent.

When this doesn’t apply

A few places the discipline costs more than it pays.

Small systems with a single responder who has the full mental model. A two-service team with five alert types and one on-call does not need playbook authoring overhead and a per-alert LLM call. The responder’s pattern-match resolves the page in thirty seconds and the summary is friction.

Alerts with no diagnostic surface. A boolean health check with no associated query set is not a playbook target. The alert is the data; there is nothing structured to summarize on top.

Novel incidents the playbook has not seen. By design, no playbook matches and no summary is generated. The responder gets the raw alert and reads pg_stat_activity themselves. That is the correct behavior. The alternative, an autonomous agent that reaches for whichever queries it pattern-matches to, would produce a confident summary on a problem the team has never seen, which is the worst case for both MTTR and post-mortem quality.

The bigger picture

Summarization of structured data into prose is the job where LLM hallucinations are smallest. They still happen, but the floor is higher than for tool selection or causal attribution. Where the work depends on a system model the model does not have, the hallucinations are everywhere. The same shape shows up across letting AI manage indexes and prompts as guardrails. The reliability engineer keeps the system model. The playbook is where that model lives in code. The model formats. Choosing what to fetch is the part the engineer is for.

SELECT insights FROM experience WHERE downtime = 0; -- Ruslan Tolkachev