Death by a Thousand Cuts: the AI Database Failure You Can't Restore

TL;DR

The catastrophic AI failure (the agent that DROPs a table) is the recoverable one: loud, attributable, and on a gated pipeline it mostly can’t happen anyway. What actually bleeds you is the change that clears every gate because the gates check correctness at review time, and the failure is a function of volume and time, both of which are zero when the PR is open. Plot failures on loud-vs-quiet and recoverable-vs-not, and AI floods the quiet-and-unrecoverable corner the pipeline was never watching.

A scraper ships behind the same pipeline as everything else: feature branch, two approvals, CI green, a day in staging, then deploy. There is already a postings table, one row per job posting the crawler tracks, keyed by posting_id. Part of the change is a new table to track crawl state, storing the content hash of every posting each run sees so the next run can tell what changed. The agent designed the table, a reviewer approved it, and at review time it held zero rows. It looked reasonable:

1
2
3
4
5
6
7
8
CREATE TABLE crawl_state (
    id           BIGSERIAL   PRIMARY KEY,
    run_id       BIGINT      NOT NULL,
    posting_id   BIGINT      NOT NULL REFERENCES postings (posting_id),
    content_hash CHAR(64)    NOT NULL,   -- SHA-256 of the posting body
    crawled_at   TIMESTAMPTZ NOT NULL DEFAULT now()
);
CREATE INDEX ON crawl_state (posting_id);

It was reasonable, for about three months.

Nothing on that screen looks wrong, and that is the problem. A surrogate id, a foreign key to postings, an index on the column you look postings up by. It reviews as boilerplate. The grain is the part nobody states out loud: one row per posting per run. Every pass appends the entire posting set under a fresh run_id instead of updating the rows already there, so the table grows by the full set every time the scraper runs. postings holds roughly 100,000 rows. Three months and a hundred-odd runs later, crawl_state holds 11 million. The job that decides whether a posting changed runs the obvious query:

1
2
3
4
5
SELECT content_hash
FROM crawl_state
WHERE posting_id = $1
ORDER BY crawled_at DESC
LIMIT 1;

The posting_id index finds the matching rows, but there are now a hundred-odd of them per posting, and to return the single latest hash it sorts that pile by crawled_at on every call. Across the working set the pipeline is timing out. Asked to fix the slowness, the agent recommends what it always recommends: widen the index to CREATE INDEX ON crawl_state (posting_id, crawled_at DESC), so the latest-hash lookup stops sorting.

An engineer who knows the system reads it differently. The table shouldn’t have run grain at all. A posting’s current hash is one value, stored once and overwritten each run, and there is nothing to accumulate. Better than that, the hash isn’t a separate concern from the posting. It belongs on the postings row that already exists, so there is no second table, no run_id, and no latest-per-posting lookup at all:

1
2
3
4
5
6
7
8
ALTER TABLE postings
    ADD COLUMN content_hash    CHAR(64),
    ADD COLUMN hash_checked_at TIMESTAMPTZ;

-- each run, per posting:
UPDATE postings
SET content_hash = $2, hash_checked_at = now()
WHERE posting_id = $1 AND content_hash IS DISTINCT FROM $2;

That stays at 100k rows forever, the changed-or-not check is a single primary-key row read, and there is no crawl_state to bloat. The index the agent suggested speeds the symptom while doubling down on the grain that is the actual bug, paying write cost and storage on a table that should not exist. And there is no deploy to revert. The bad grain is a schema decision three months old plus 11 million rows of accumulated state, and unwinding it is a planned migration and a backfill, reviewed like any other change, not a rollback.

The DROP is the lucky case, and here’s the 2x2 that shows why

The failure everyone pictures is the destructive one: the DROP TABLE, the migration that truncates the wrong relation, the script that deletes the binlogs. On a gated system that is the one you have mostly handled, with destructive migrations caught in review, credentials scoped, and a restore runbook practiced. When the Replit agent wiped a production database during a code freeze in July 2025, the data was restored: loud, attributable to one timestamp, recoverable. The scraper cleared that same pipeline because at merge time it was correct, with no rows for the grain bug to express and a CI seed that never reached the row count where it breaks.

Rank a failure on two axes, loud-vs-quiet and recoverable-vs-not, and you get four corners:

Loud and recoverable is the DROP. You know instantly and you roll it back.
Loud and unrecoverable is rarer: a destructive operation you catch but can’t undo.
Quiet and recoverable is the bug that sat unnoticed but is still reversible when you find it.
Quiet and unrecoverable ruins quarters. You don’t find out for months, and by then the prior state is gone or never existed as a clean artifact.

The axes correlate, which is what makes the bad corner deep. Loud failures get caught while the prior state still exists; quiet ones sit, and the longer one sits the more downstream systems consume it as truth, until a recoverable error has been aggregated and propagated into an unrecoverable one. Loudness buys the time, so quiet and unrecoverable travel together.

Note

This post is about the write path: statements that change data, schema, or the planner’s behavior. The read-path failure (an AI-generated SELECT that returns a confidently wrong number) is the sibling problem, covered in What AI Gets Wrong About Your Database. A wrong read misleads a decision; a wrong write becomes the new truth.

AI floods the bad corner for a structural reason. Each change it ships is plausible: it compiles, runs, returns the right shape, passes whatever checks exist. That plausibility is the corruption floor, the same mechanism that makes the output useful making it occasionally wrong in a way that looks exactly right. A loud failure is one the output failed to make plausible; the quiet one is the model working as designed. Then volume multiplies it: a team shipping eighty changes a week instead of eight samples that floor ten times as often, on the same review and CI budget. The DROP is the rare draw the pipeline was built to stop. The thousand cuts are the modal draw, and it waves them through.

The quiet corner comes in three shapes. The scraper is the first: a schema correct at zero rows and fatal at eleven million, because the model optimizes from its training distribution, not your scale. Asked to partition a large table it reaches for created_at in the primary key, the common corpus shape, not the primary-key partitioning that fits a high-scale OLTP table. The second is the value it computes wrong because it doesn’t hold your domain: Cursor’s support bot invented a login policy that didn’t exist and users cancelled before anyone knew, and Air Canada lost in court over a bereavement refund its chatbot made up. Move that same generator onto a write path computing a discount or a tax split and the row is well-typed and wrong about what the number means, with nothing reconciling it against the contract until quarter-end. The third is the change shipped past the author’s own understanding: it looked good and worked, so it went, and the judgment that would have caught it is built by the slow work the agent now skips. That is the paradox of the fast engineer, which a July 2025 study measured as 19% slower even as the developers felt faster.

A worked example: the soft-delete leak

The grain bug is loud once you go looking, because it shows up as latency. The worse version of the same class is a write that corrupts a number and never moves a performance metric at all. Soft deletes are where this lives in most enterprise schemas.

The convention is old and unwritten. A row is never physically removed; it gets a deleted_at stamp, and every query that reads the table is expected to filter it out. A billing system for a SaaS company looks like this:

1
2
3
4
5
6
7
8
-- one row per recurring line on an account: base plan, seats, add-ons
CREATE TABLE subscription_items (
    id              BIGSERIAL   PRIMARY KEY,
    account_id      BIGINT      NOT NULL REFERENCES accounts (account_id),
    sku             TEXT        NOT NULL,
    monthly_cents   BIGINT      NOT NULL,
    deleted_at      TIMESTAMPTZ          -- set when a customer drops the line
);

When a customer downgrades, the application does not delete the row, it stamps it:

1
UPDATE subscription_items SET deleted_at = now() WHERE id = $1;

Every existing query that touches money knows this. The MRR rollup, the invoice generator, the revenue dashboard, all of them carry AND deleted_at IS NULL, because the team learned years ago that forgetting it double-counts churned revenue. That knowledge lives in the queries and in the heads of the people who wrote them. It is nowhere in the schema; deleted_at is just a nullable timestamp, and nothing stops a query from ignoring it.

Now an agent is asked to add a board-facing metric, monthly recurring revenue by region, that the nightly ETL persists into the warehouse fact tables the dashboards read. It writes the obvious thing:

1
2
3
4
5
6
CREATE VIEW mrr_by_region AS
SELECT a.region, SUM(si.monthly_cents) AS mrr_cents
FROM accounts a
JOIN subscription_items si ON si.account_id = a.account_id
WHERE a.deleted_at IS NULL          -- remembered on accounts
GROUP BY a.region;

It filtered deleted_at on accounts but not on subscription_items, because nothing in the schema said it had to and the training corpus is full of joins shaped exactly like this. Every cancelled add-on and downgraded seat is now summed back into regional MRR, and each night the ETL reads this view and writes the inflated number into the warehouse the whole company reads as truth. The shape is right and the number is plausible, a little high, and growing as the soft-deleted pool grows.

Nothing catches it. The engineer saw a working view and a deleted_at filter sitting right there and moved on; the AI reviewer flagged a naming nit; the human skimmed the green summary and approved; CI ran on a seed database with almost no deleted rows, so the leak was a rounding error in the test. A missing predicate is an absence, the one thing every reviewer, human or model, reliably fails to see.

The drift is the tell. Small at launch, ten or fifteen percent high a year later in the regions with the most downgrade history, and finance finds it the only way anyone does: reconciling the dashboard against billed revenue, a year in, with no single cause and no deploy to revert.

Warning

A soft-delete filter is a contract the schema cannot enforce. Nothing in subscription_items makes a query honor deleted_at, so the rollup should have joined an active_items view (CREATE VIEW active_items AS SELECT ... WHERE deleted_at IS NULL) and a rule should forbid money queries from touching the base table at all. That is worth more than any amount of review attention, because the absence of a predicate is the one thing a reviewer, human or model, reliably fails to see.

The fix is a business call, not a technical one

The mitigations are known and none of them are clever: reconcile against a source of truth on a cadence the business can stand, alarm on aggregates and drift instead of only errors, run CI against production-shaped volume, bake invariants into views and constraints. All worth doing, all secondary, because every one is a net thrown after the write has committed. The load-bearing decision is upstream of the tooling, and it is a positioning call leadership makes on purpose. Four honest positions:

Ship at full speed, accept the corner. Take the velocity and take the unrecoverable write, the silent data loss, the bug the customer finds, as the cost. Legitimate for a seed-stage product with no prior state worth protecting. Catastrophic for a billing system.
Fast where it’s cheap, gated where it’s not. Agents and junior engineers run on loud, recoverable surfaces (internal tooling, dashboards, throwaway analysis); writes that touch money, schema, or multi-writer tables go through someone who holds the domain. This is where Amazon landed the expensive way, requiring senior sign-off on AI-assisted changes to its sensitive stack after a run of incidents (The Register, April 2026). They named the cost: controlled friction.
SME on everything. Only the smallest or most regulated shops can afford it, and it collapses into the second position the moment change volume outgrows the reviewers.
Encode the domain into tests. The one position that scales without scaling reviewers, and the one with the sharpest trap. An SME who knows the soft-delete convention writes an assertion that fails the build on the leak. Ask the agent to “add tests” under deadline and it writes one that sums the same leaked rows and asserts the inflated total is correct: the bug ships with a green check certifying it. And tests only check behavior, never whether the design should exist. The scraper passes everything you could write against it; the bug is the table, and a green suite is guaranteed to bless an architecture that works exactly as built.

The trap is choosing the second or fourth on a slide and the first in practice. All four only work if the people doing the reading still have judgment to read with, and the paradox of the fast engineer is draining that pool: hand the slow work that grows an SME to an agent for two years and the sign-off is staffed by people with the title and not the instinct. If you are not willing to lose your seniors, the budget item is the work that makes them, not just the headcount that has the rank today.

What your monitoring is actually for

Everything you monitor fires in the loud quadrant. Error rates, 5xx, failed-job alerts, latency thresholds set above current numbers, all of it watching the corner you were already going to survive, because the DROP has a backup and a practiced runbook. The scraper that dies three months after a clean review, the write that computes the wrong number, the logic bug a customer found first, those never alarm, because the pipeline reads every change at the one moment it is still correct and then stops looking.

A single one of those is not a crisis. You find it, you trace it, you fix it. The problem is rate. Each is one draw from the corruption floor, and a team shipping ten thousand lines a day draws constantly, laying down a sediment of small wrongs that surface not the day they’re written but a year later, together, as a system nobody fully understands returning numbers nobody fully trusts. By then it is past untangling: a thread to pull assumes a thread, and a year of compounded cuts is the whole fabric. The options shrink to a rewrite or living with numbers you can’t defend, and no senior worth the title signs up to reverse-engineer a year of an agent’s confident guesses.

The fake citations got caught because the judge knew the real ones. That is the whole job, and the agent can’t do it for you: someone has to ship nothing they don’t understand, and understand it the whole way down, what the value means and what it does to every system that reads it later. Your product has no judge unless you are one. The agent makes the drafts faster; knowing what they cost is still the part you can’t hand off.