The 10x Is Real, on Internal Tools You'd Otherwise Never Ship

Wed, 13 May 2026 00:00:00 +0000

TL;DR

AI coding agents do hit 10x or better on a specific slice of work: greenfield internal tools where the agent authors the code and the conventions it later re-reads. Outside that envelope (existing services, cross-team consumers, regulated code paths) the gain compresses to 10–30% at best and turns net negative on mature codebases, because verification cost dominates typing cost.

A DBRE has had a MySQL binlog purge script in the back of their head for six months. Current process: every other Friday, run SHOW REPLICA STATUS against each replica to find the oldest binlog any of them is still reading from, take the minimum source-log-file across the fleet, factor in any replica that is lagging, then connect to the primary and run mysql -e "PURGE BINARY LOGS TO 'mysql-bin.XXXXXX'" with a safe cutoff. By hand, in a notebook, while making sure no replica is far enough behind that the cutoff would yank a binlog the replica still needs and break replication. Pre-AI estimate to script it properly: half a day, with the replica-position scan across the fleet, the safe-cutoff math, a dry-run mode, and the README so the next on-call knows what it does. Never made the sprint. Friday afternoon with a coding agent: a small Go binary that walks each replica, parses SHOW REPLICA STATUS, computes the minimum source-log-file with a configurable safety margin, refuses to act if any replica is more than 30 seconds behind, runs PURGE BINARY LOGS TO on the primary, supports --dry-run, emits structured logs, ships with a README the agent wrote in the same pass, and has a one-shot CI job that exercises it against a sandbox primary plus replica. Forty minutes including the test. Ships Monday. The twenty minutes a week of toil it removes is the kind of work that has never made anyone’s quarterly goals.

Where the multiplier comes from

“AI is faster everywhere now” is the reflexive read of the scenario above, and it’s the wrong read. The same agent on the customer-facing payments service, modifying a checkout flow three years old with four other teams reading the code, lands at 10–20% faster on a good day and net negative on a bad one. The numbers in the literature back this up.

Stanford’s Software Engineering Productivity research on a corpus of more than 100,000 developers found AI gains of roughly 30 to 35 percent on low-complexity greenfield tasks, 10 to 15 percent on high-complexity greenfield, and brownfield work compressing further from there. Google’s 2025 DORA State of AI-Assisted Software Development report found 90% of developers using AI and over 80% reporting it made them more productive, while organizational software delivery metrics stayed flat. Individual perceived productivity is not translating into faster delivery to customers. DORA’s authors describe AI as an amplifier of existing engineering capability and flag a negative relationship between AI adoption and software delivery stability, with 30% of developers reporting little or no trust in the code AI generates. The verification tax is real: time saved on creation is getting spent on audit.

METR’s July 2025 controlled study on sixteen experienced open-source developers in repositories averaging a million-plus lines of code and a decade old found something stranger. The developers were 19% slower with AI than without, and they thought they were 20% faster. Mature codebases are an antagonistic environment for an agent and a confusing one for the human paired with it.

“AI” by itself doesn’t earn the 10x. Four other conditions have to stack on top of it: greenfield, AI-authored conventions, small audience, fast feedback. Remove one and the math compresses to single-digit percentages. Remove two and the agent is net negative.

Greenfield is cheap context

An existing codebase has conventions the agent has to infer from reading. The error-handling pattern, the test-fixture convention, the logger wrapper everyone uses, the half-deprecated config layer the new code is supposed to use instead. Some of this is written down in a CONTRIBUTING file from 2022 that is now wrong in two places. Most of it lives in tribal knowledge. The agent reads twelve files and guesses, sometimes wrongly, and the engineer’s time goes into correcting the guess. The context tax is real and it doesn’t show up on any productivity dashboard.

Greenfield collapses that tax to zero. The agent writes the first file and decides the convention, because there is no prior convention to defer to. Error handling is whatever the first handler returned. Logging is whatever the agent picked in the first module. The pattern propagates forward because the agent is reading its own work on every subsequent call. The convention isn’t reliable in the LLM sense (nothing the model does is reliable), but the distribution narrows considerably when the only style the agent has seen in this repo is the style it wrote yesterday. Less to hallucinate about. Fewer plausible alternatives competing for the next token.

The agent reads its own code and stays in pattern

The cleaner version of this story is that the agent authors both the code and the docs and re-reads its own docs on the next session, so everything stays coherent. That’s wishful. Agents leave READMEs stale routinely. A session edits the code, claims success, and silently leaves the documentation pointing at a function signature that’s been renamed since. Pretending otherwise is the same “AI is reliable now” framing that fails the moment someone trusts it.

The real driver is smaller and more mechanical. On a small greenfield repo, the agent’s first move on a new session is usually to scan the files that are already there. The code it reads is code the agent wrote last week. The patterns it produces in this session mirror what it finds in the existing files, because the model is doing what models do: sampling tokens that match the recent style it’s already seen in the context. Error handling looks the same in module five as it did in module one because the agent read module one before writing module five. Logging conventions, naming, test layout: all of it propagates from re-reading the code, not from a doc the agent has any particular discipline about.

Warning

Treat any AI-authored README as a build artifact, not a maintained source of truth. Agents skip README updates routinely, and a stale doc is worse than no doc because the next session will follow the wrong instruction confidently. If docs are sticking around, regenerate them rather than maintain them, and don’t trust any README older than the most recent code change. The code is the source of truth. The moment the tool grows to where re-reading the code on each session stops being cheap, it has graduated out of this regime and needs the discipline of any other production system.

The shape of the speedup is the same as why a single author writes a coherent short piece faster than three co-authors with the same word count. Coordination overhead is the multiplier, not raw output speed. In a human team, coordination is meetings, code review, RFC documents, the slow accumulation of shared style. In a single-agent greenfield repo, coordination is one session re-reading the small artifact it wrote yesterday and matching the patterns it finds. The cost of “what’s our convention here” is the cost of one repo scan.

Small audience, fast deploy, throwaway

The MySQL binlog organizer is used by one team. The AZ backup shipper is used by the storage on-call. The diagnostic API in front of Redis is curl’d by whoever is debugging tonight. None of these tools have customer SLOs, escalation paths, or a third-party integration to coordinate. v1 broken on Monday is fixed by Tuesday afternoon. The cost of a bug is bounded by how loudly someone says “this broke” in Slack.

Fast feedback is what makes the testing strategy work. A --dry-run flag and a sandbox replica is sufficient verification for the binlog purge script, because the worst case is the script crashes and the on-call reverts to the manual find command, which is what they were already doing. There’s no need for a 2000-row test suite. There’s barely a need for a runbook. The tool exists in the negative space between “the manual process” and “a properly engineered service” and both ends of that range are operationally fine.

And the tools are throwaway. If the binlog organizer turns out to be shaped wrong (the team wanted archival, the agent built categorization, the bucketing scheme is awkward) it gets thrown out and rewritten from scratch. Sunk cost is hours, not weeks. The agent doesn’t carry the same emotional attachment to the previous version’s design that a human author would, which makes the rewrite cheaper than the original. That’s the property that lets a platform team take more shots than they otherwise would. Most of the toil-removal scripts that sit in the backlog die there because the expected effort feels higher than the expected payoff. The 10x reframes the expected effort, and a noticeable chunk of the backlog suddenly clears its own bar.

The rewrite property has a second face: the replacement is often an upgrade, not just a regeneration. Most platform teams have the internal HTML dashboard from 2012 nobody wants to touch. Bootstrap, jQuery, server-rendered templates, no input validation on any of the forms, no real documentation, and one engineer left who remembers which buttons do what. Pre-AI estimate to modernize: a quarter that nobody schedules. Friday afternoon with a coding agent: a React frontend with form validation built in, a typed backend behind it, a README that documents what the endpoints actually do, and the input-validation gap that has been a low-grade footgun for three years closed by default, because the new stack treats validation as table stakes. The rewrite is cheaper than the original and an upgrade at the same time, because the agent is writing in a stack that already has the properties the old code lacked.

Note

The other property internal-only buys you: v2 ships side by side with v1. Both links go on the wiki, the team tries the new one in real work for a week, parity gets confirmed against the old one for the handful of operations anyone actually uses, and v1 gets deleted when nobody opens it anymore. No cutover plan, no customer comms, no parallel-infrastructure cost worth pricing. The team migrates itself. Try this rollout shape with a customer-facing service and the conversation goes very differently.

How to operationalize this without standardizing it to death

The temptation, as soon as the team notices the multiplier, is to standardize. Pick a Go scaffolding template, pick a logging library, mandate a directory layout, route everything through the platform team for review. Resist that. The standardization is where the multiplier collapses, because the moment two teams have to negotiate conventions, the agent is back in coordination-cost territory.

A workable shape: every team picks two or three toil-removers from their own backlog and commits to shipping them this quarter. Each tool is owned by one team. Code review is one teammate, same-day, with the explicit understanding that the review is checking the tool does what the README says and doesn’t delete prod data, not enforcing a corporate style guide. Testing is on a sandbox or staging instance, not in CI for two weeks. A Backstage plugin backend that surfaces deploy status for the storage team’s services lives in the storage team’s repo, with the storage team’s conventions, and gets reused by other teams only if the other teams want to depend on it (and accept the version it ships in).

The honest trade-off: tools built this way are good enough for one team and not good enough for the platform catalog. Some will need rewriting if they cross teams. That’s fine. The cost of rewriting is hours, and most of them will never cross teams anyway. The cost of insisting on cross-team-grade quality up front is that the diagnostic API the network team would have shipped in a Friday afternoon turns into a Q3 platform initiative, then a Q4 platform initiative, then nothing.

The same multiplier extends past engineering. Marketing wants to bulk-edit campaign tags. Customer success wants a dashboard joining ticket history against feature adoption. Finance wants a quarterly close helper. With a coding agent, the team that actually knows what shape they want can build the first version themselves. Engineering takes the tool over and hardens it only if it proves valuable enough to graduate. Anything that doesn’t prove itself disappears quietly, which is the right outcome for a prototype.

Warning

The sandbox is non-negotiable when non-engineers build their own internal tools. The environment has to be one where the tool cannot reach production credentials, cannot read customer data at production scale, and cannot expose anything to the public internet. A finance analyst prototyping a close helper against the prod database with their own AI agent is the worst-case version of this trend. The same prototype against pseudonymized data in a network-isolated environment is the version that pays off. What the platform team owes the business isn’t a scaffolding template. It’s a safe place to ship in. The infrastructure cost of getting the sandbox right is small. The cost of getting it wrong is the breach.

When the math runs the other way

The regime above doesn’t extend to every tool a platform team might build. The decision matrix:

Cross-team tools from day one. Two teams on the same internal tool means convention negotiation, which means coordination cost, which means the multiplier is gone. Build it the way you build any shared service: design review, versioned API, deprecation policy. The agent is still useful here. It is not 10x useful.
Regulated internal systems. HR, finance close, anything in SOX, SOC2, or HIPAA scope. The verification bar rises sharply because the audit trail has to survive an external reviewer, and AI-speed advantage compresses against the human review time the controls require.
Tools that touch customer data, even internally. A script that joins across users and subscriptions is a customer-facing risk regardless of who runs it. Read access to PII is read access to PII whether the caller is a customer-facing API, an analytics agent, or a backup shipper. Blast-radius arguments don’t get a discount for being operational.
Destructive operations on prod infra. A binlog purge script can break replication if it computes the cutoff wrong on a replica that’s silently behind. A snapshot shipper writes to S3 buckets other systems read from. The testing rigor required to ship destructive code pulls the productivity gain down. Still positive, often still 3x or 4x, but not the 10x of a pure read-only diagnostic.
Tools that drift into load-bearing. The orchestrator that started life as a Friday-afternoon “drain, snapshot, upgrade, verify, restore traffic” script and is now the deploy path for six services. Once a tool crosses that line it has graduated out of the throwaway regime and into the production-systems regime. The conventions need writing down, the test suite needs to exist, and the next human who edits the code needs to be able to read it without an agent’s help. The productivity question is no longer “how fast can the agent ship v1” but “how cheap is the system to operate at year three.” Different question. Different answer.

The bigger picture

The 10x envelope is narrow: greenfield, internal, replaceable, one team, fast feedback. It contains the toil backlog that has cost the team twenty silent minutes a week per uncreated tool for years, and the cost of writing those tools just dropped by an order of magnitude. The business case runs higher: when a non-engineer ships a tool with a coding agent, the multiplier is closer to 100x, because the pre-AI version of the tool doesn’t exist. Nobody ever wrote a ticket for it.

The most ambitious version of the regime is the agent API in front of the database: greenfield, owned by one team, conventions the agent sets, audience small enough that v1 wrong is recoverable. The team that has been deferring it because it sounds like a quarter of work has a path to ship it in days.

What doesn’t extend: customer-facing systems and customer data don’t tolerate the same speed. Once PII leaks into a model provider’s buffer, or a destructive script hits the wrong replica, you don’t get it back. The apology you’ll get is some flavor of “you’re absolutely right”. The phrase has no recall semantics.

Platform-Engineering on EXPLAIN ANALYZE