Narrow Tools, Narrow Agents: Where Agent Reliability Actually Comes From

TL;DR

The agents that hallucinate least are the ones calling narrow, opinionated tools and given the smallest possible job. A “diagnostic” endpoint that returns a present-state boolean plus a lag number beats a generic query(sql) surface, and a scoped agent with three tools beats a wide agent with twenty. Even with both halves in place, the output still has inconsistencies on some level. The goal is “hallucinates least”, not “right”.

Here is what an on-call agent posted to the incident channel when pg_stat_replication.replay_lag on the orders cluster crossed the 60s SLO:

1
2
3
4
5
Logical replication slot pub_orders_v1 is behind on the publisher.
restart_lsn has not advanced in the last 12 minutes. Recommend
dropping the subscription and recreating it, then advancing the
slot manually with pg_replication_slot_advance(). Runbook:
ops/runbooks/postgres/replica-lag-orders.md (last reviewed 2023-08).

The slot the agent named was decommissioned twenty months earlier. This is what pg_replication_slots returned the same minute on the same replica:

1
2
3
4
=> SELECT slot_name, slot_type, active FROM pg_replication_slots;
 slot_name | slot_type | active
-----------+-----------+--------
(0 rows)

The summary was produced by an agent with RAG over the full ops repo and a standard query(sql) MCP tool against the replica. The orders cluster moved off logical replication in the Q3 2024 migration to streaming. Two paragraphs of an ADR titled “Move publication off logical replication” explain exactly why. The ADR lives in the same repo, indexed by the same embedding model, available to the same retrieval call. It ranked behind the 2023 runbook, behind two other runbooks for the same alert family, and behind a post-mortem from a different cluster entirely. The ADR’s vocabulary didn’t match the alert’s. The agent never read it. The cluster’s actual problem (a long autovacuum on the orders_2026_05 partition generating WAL faster than the replica could apply) was sitting one query away in pg_stat_progress_vacuum. The agent’s summary never reached for it.

What a smarter model wouldn’t have fixed

The familiar levers are a bigger context window, better embeddings, a smarter model. None of them touch the underlying mechanic. Embeddings retrieve by similarity, not by truth-value. The 2023 runbook scored highest because it talked about the exact alert, with the exact column names, in the exact phrasing the alert text used. A bigger window pulls in more competing documents, including more wrong ones. A better embedding model sharpens the same match against the same stale corpus. A smarter LLM produces a more confident summary on the wrong grounding. The retrieval surface is the problem, and the model is doing what models do.

Anthropic’s Writing effective tools for agents (September 2025) makes the structural version of the same point: more tools and broader tools don’t improve agent outcomes. The team behind Claude found that purposefully narrower tools (search_contacts over list_contacts, with shaped responses) beat broader ones consistently, because agents struggle to extract signal from irrelevant context and burn tokens trying. The same shape applies one layer up. A search_runbooks tool grounded against a corpus where half the docs are out of date is a broad tool dressed in narrow clothes. The narrowness has to live in what the tool actually returns.

The same alert through a narrow tool

Here is what a diagnose_replica_lag(cluster) endpoint returns when the same alert fires:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
{
  "lagging": true,
  "lag_seconds": 47,
  "replication_mode": "streaming",
  "publisher_lsn": "F1/A2C3D400",
  "replay_lsn": "F1/A2B14C00",
  "wal_replay_paused": false,
  "blocking_autovacuum": {
    "relation": "orders_2026_05",
    "phase": "scanning heap",
    "started_at": "2026-05-24T02:18:11Z",
    "duration_seconds": 5421
  }
}

The agent reads present-state structured input. No runbook, no link, no freeform “here’s what this might mean”. The endpoint is the only thing in the stack that knows the cluster moved off logical replication. Whatever slot the response names (if any) is whatever the slot is called today. The blocking autovacuum is computed by joining pg_stat_replication and pg_stat_progress_vacuum on the server side, where the join is cheap and the freshness is guaranteed. The agent’s summary against this input says what the input says. There is no 2023 runbook in the prompt to retrieve from.

The Unix mantra ports to agent tools more directly than most. Each tool does one function, does it well, and shapes its output for the consumer. The consumer is an agent with a token budget, no working memory between calls, and a fondness for whichever input most resembles a familiar pattern. The output has to be small, shaped, current, and complete enough that the agent’s job is to read it, not compose it.

A schema introspector for an agent doing query work returns the active subset of the catalog: tables, columns, and indexes touched by queries in the last thirty days, with column types, foreign keys, and the indexes that have non-zero idx_scan over the same window. It does not dump 4,200 rows of pg_catalog.pg_class joined against pg_attribute. The catalog carries years of accumulated noise: deprecated audit tables, the experiment from 2022 that never got cleaned up, the staging-only mirror copies. An agent given the full dump pattern-matches against the noise as readily as the signal.

A “currently breaking” endpoint returns a pre-joined view of active alerts, the playbook each alert routes to, the affected service, the deploy SHA from the last fifteen minutes, and the on-call’s contact. It does not return three underlying APIs and trust the agent to compose the join. The join is the question. The tool encoded it. Re-deriving the join from raw sources every call is where the agent burns tokens and where the misattributions accumulate.

Web search joins this set when the question is about something fresh. For a CVE on a specific Postgres minor, a web_search(release_notes) tool grounded against pgsql-announce or a vendor advisory beats the same question routed against an internal RAG corpus where the relevant note doesn’t exist yet. Fresh source. Narrow scope. The tool encodes the question.

The pattern across all four: the tool encodes the question, not the source. An agent calling query(sql) gets to compose every question itself, including the badly-worded ones. An agent calling a diagnostic tool only asks the questions the diagnostic was designed for. That constraint is the feature.

Warning

“Narrow” has to be narrow in what the tool returns, regardless of what it’s named. A diagnose_replica_lag tool that runs SELECT * FROM pg_stat_replication and dumps every column on every row is no better than letting the agent write the SQL itself. The shaping (which columns, with what projection, with what filtering, with what defaults) is where the reliability lives. A tool that returns 5,000 rows because nobody put a LIMIT on the server side has handed the responsibility back to the model, which is the responsibility the tool existed to remove.

Narrow tools don’t help a wide agent

Tools shaped for the right answer still get the wrong call from an agent given fifty of them and the instruction “you are the on-call assistant, help the human”. The model picks whichever tool pattern-matches best to whatever the input string looked like, and that pattern-matching is biased toward whichever description sounds most familiar. A scoped agent (one job, three tools, one decision to produce) has nothing else to reach for.

Scope is what the agent is for, expressed narrowly enough that the right tool call is mechanical. “Triage a replica-lag page” is a scope. The tools are the diagnostic, the recent-deploys lookup, and the playbook. The agent calls them in order and produces the summary. “Help the engineer with whatever they ask” is not a scope, it is the absence of one. The same model with the same tools resolves the first job consistently and freestyles the second.

Wide scope produces a failure that’s worse than picking the wrong tool. A wide-scope agent given a hybrid problem reaches for one of the relevant tools and silently drops the other half. The scoping decision lived in the model’s pattern-match against the input string, which is the worst place to put a decision the next on-call wants to debug six months from now. A scoped agent doesn’t have to decide. The scope already decided.

The two halves compound. Narrow tools make the right call mechanical inside a scope. Narrow scope makes it obvious which tools the agent will call. A wide agent with narrow tools wastes the narrowness. A scoped agent with broad tools wastes the scope. The Redis diagnostic API in internal tools are AI 10x is what the canonical version looks like in production: one endpoint, present-state answers (key counts, memory pressure, slow-log entries, replication offset), pre-shaped for whoever is asking. The agent never composes Redis commands. It calls the diagnostic and reads structured output. The same shape ports to Postgres, MySQL, Kafka, ES.

Note

The argument is about agents acting on production systems, where a bad tool call has operational consequences and the corpus the agent reads from has years of accumulated drift. Pure-text Q&A bots over a curated help-center corpus are a different regime. The corpus there is small, maintained, and the cost of a bad retrieval is the user reading a stale answer rather than the system running a destructive command. Retrieval is the right primitive in that regime. The argument here is that “retrieval over the production docs corpus” is the wrong primitive for an agent that’s going to act.

When this doesn’t earn its keep

One-shot scripts and ad-hoc analysis where the human reviews every output. The agent’s tool is query(sql) against a sandbox; the human reads the result and discards it. The blast radius is the engineer’s own time.
Low-stakes generative tasks. Commit messages, variable names, refactoring suggestions, the docstring for a private helper. The cost of a wrong output is noticing and editing.
Greenfield code where there’s no accumulated context to be stale. The agent is the only author the repo has seen, the conventions are whatever it wrote yesterday, and there’s no two-year-old runbook to mis-retrieve.
Small teams with a single agent and a single use case. Three tools is the size the agent already has. Building a narrowing layer for something already narrow is overkill.
Genuinely exploratory questions where the agent is supposed to ask broadly. “What in the catalog looks unused” is a question that wants the full catalog, not the thirty-day-active subset. Exploratory questions need exploratory tools.

The engineering work has moved. The 2023 question was how to prompt the model better. The 2026 question is what tool the agent reaches for, and how small the job is that the harness hands it. Most of the reliability budget lives in those two surfaces and not in the model weights. A team adopting agents on production systems will spend more time building agent-shaped APIs and scoping per-task harnesses than tuning prompts, because that is where the floor on hallucination actually moves. The model is going to misattribute and fabricate and confidently quote the wrong thing on some fraction of calls regardless of how the harness is built. The tool the agent calls and the scope of the call are the surfaces the engineer controls. The 2023 runbook for a slot that doesn’t exist anymore stops being a problem when the agent never reads runbooks, only calls the diagnostic that knows what the cluster actually is today.