Corruption Is a Feature, Not a Bug: Why LLMs Corrupt by Design

Sun, 26 Apr 2026 00:00:00 +0000

TL;DR

Frontier LLMs corrupt at least 25% of delegated multi-step document work in lab conditions, with the rate rising with document size and turn count and tool use failing to help. The fix isn’t a better model - corruption is a property of the architecture, not a defect to be patched, and the only thing that closes the gap is a best-in-class domain expert at every checkpoint.

Microsoft Research has a number on it. Laban, Schnabel, and Neville’s LLMs Corrupt Your Documents When You Delegate (arxiv 2604.15597, April 2026) ran the DELEGATE-52 benchmark across 52 professional domains - coding, crystallography, music notation, professional writing - against 19 frontier LLMs including Claude 4.6 Opus, GPT 5.4, and Gemini 3.1 Pro. Average corruption: 25% of document content by the end of long workflows. Tool use doesn’t fix it. Agentic harnesses don’t fix it. Larger documents and longer interactions make it worse, not better. The number doesn’t depend on which frontier model you pick.

The 25% is a floor, not a ceiling

The benchmark is a controlled lab measurement against curated tasks with known ground truth. Real production has every reality catalogued in What AI Gets Wrong About Your Database - undocumented conventions, polysemic columns, four-format date strings, JSON-as-schema, business logic in tribal knowledge, ten-year-old codebases with three “current” patterns for the same operation. The model in production reads from that impoverished signal, and the rate multiplies. The 25% is what you get on a good day on clean data. Production is not a good day.

The version doesn’t matter - corruption is a feature

Claude Opus 4.7 is the latest as of writing. DELEGATE-52 measured 4.6, GPT 5.4, Gemini 3.1 Pro. The next generation will measure at the same floor. Not because the labs aren’t trying - they are - but because the corruption isn’t a defect to patch; it’s the property you bought when you bought “language model.” The same mechanism that makes the model useful (generalizing from a training distribution to plausible novel output) is the one that makes it corrupt your document (generalizing from a training distribution to plausible novel output that doesn’t match your specific facts). You can’t fix one without losing the other.

LLMs are not intelligence. They are a probability machine. That isn’t a snub; it’s a description, and the engineering implications follow from taking it seriously.

The obvious fixes that don’t work

The reflex when the rate is 25% is to reach for the things that usually fix software defects. None of them touch the floor:

“Use a better model.” The benchmark already measured the frontier. Same rate.
“Add tools, RAG, fine-tuning.” Tool use doesn’t change the rate. RAG narrows the prior, but the same sampling mechanism draws from it. Fine-tuning shifts the distribution; it doesn’t add deterministic constraints.
“Add agent self-verification.” The verifier is the same architecture reading the same training distribution as the generator. It will ratify the corruption with the same confidence the generator produced it with.
“Add more context.” What AI Gets Wrong About Your Database already covered this - more context lowers the rate, doesn’t drive it to zero. The hallucination floor is structural.

These aren’t bad ideas. They lower the rate from terrible to bad. They don’t make delegation safe.

Why this is structural - the mechanism

The mechanism is worth understanding because it’s what tells you why “use a better model” doesn’t move the floor.

Embeddings are statistical co-occurrence, not knowledge. A model “understands” users.deleted_at as a vector position adjacent to other deleted_at columns it saw during training. There is no concept of your soft-delete convention, your tenant filter, the incident your team had last quarter, or the rule you wrote into the catalog comment two months ago. The vector is a fingerprint of what tokens like that one tend to appear next to in a billion training documents. It is not a fact set with internal consistency the model can check against.

Attention is a soft lookup, not retrieval. Each output token is a weighted blend of every other token in context, with the weights being learned similarity scores. The model isn’t looking up “the right answer for this schema.” It’s computing a weighted average of what tokens like this tend to be followed by tokens like that in its training distribution. Correctness is the special case where the distribution happens to be sharply peaked on the correct token.

Generation is sampling from a probability distribution. Every token is a guess. When training data is dense and consistent for the topic, the distribution is sharp - but a sharp distribution is reliable token-relationship probability, not knowledge of the answer. It means tokens in that region of the data point reliably to a particular continuation; the relationships aren’t facts, they’re co-occurrence statistics with no internal check against reality. When the underlying relationship happens to match the world, the guess looks right. When it doesn’t - a popular misconception in the training corpus, an outdated convention, a pattern typical of training data but not of your specific case - the guess is confidently wrong with the same calibration. When training data is sparse, contradictory, or local to your codebase, the distribution flattens, the guess gets noisier, and the calibration of the model’s confidence stays the same. Nothing in the architecture says “I don’t know.” It says “this token has the highest probability among my distribution,” even when the distribution is barely above random - or sharp on a relationship that doesn’t hold for your case.

To collapse those three steps into the operation that actually runs: every token is a vector - a position in a high-dimensional space, typically thousands of dimensions (4,096 in some models, 12,288 in others). “Similarity” between two tokens is the dot product (or cosine distance) of their vectors. Attention computes its weights by taking those similarity scores between the current position and every prior token in context, then softmax-normalizing. The “probability” of the next token is the dot product of the model’s predicted direction against every candidate token’s vector in the vocabulary, divided through a softmax to produce a distribution. Every probability you read out of the model is a distance computation between vectors in that high-dimensional space. “Understanding” is position. “Probability” is geometric proximity. There’s no step in the pipeline where knowledge enters - only matrix multiplications and a normalizing function.

DELEGATE-52’s 25% is the rate at which the distribution flattened across 52 domains’ edge cases and the sampling collapsed to plausible-sounding hallucination. The confidence reading stayed identical to when the model was right. This is Part 1’s “confidence is anti-signal” restated at the architecture level: confidence and correctness are produced by different mechanisms, neither tied to the other.

Why best-in-class SME is the load-bearing safeguard

Humans don’t operate this way. A crystallographer knows the unit cell parameters have to satisfy specific symmetry constraints - not because she’s seen a million similar structures, but because the constraints follow from a small set of facts she can verify against. A senior database engineer knows the soft-delete convention because she wrote it. A composer knows the chord progression doesn’t resolve because the leading tone wasn’t raised. That knowledge is symbolic, propositional, traceable to evidence the human can produce on demand. It isn’t a probability distribution.

That is the gap an SME closes. Not “any reviewer with a checklist” - a generalist reviewer can’t tell when the model has silently corrupted the symmetry constraints, the soft-delete predicate, or the leading-tone rule. The corruption looks plausible because the model’s job is producing plausible output. Catching it requires someone who holds the actual constraints in their head and can check the model’s output against them. The cheaper the SME, the more corruption ships.

The labor-market reading of this is already visible. IBM announced in February 2026 it would triple US entry-level hiring - explicitly because the AI era hollowed out the rote tasks that used to fill junior roles and left the load-bearing work (judgment, customer interaction, oversight of automated systems) needing humans who grow into it. The pipeline argument is unforgiving: cut juniors to capture the AI-productivity dividend, save short-term, and starve the senior layer the next decade of work depends on. The companies treating today’s juniors as a long bet on the experts they’ll become are reading the architecture honestly. The ones cutting them to bank the LLM savings are paying down the pipeline their competitors are building.

The supply side is tightening on both ends. The senior tier is retiring out - the engineers who built the soft-delete conventions, the schema histories, the production-incident memory - and that institutional knowledge isn’t transferring into a probability distribution any model can sample from. On the formation side, Anthropic’s own research shows engineers using AI assistance score measurably lower on comprehension quizzes about the code they shipped than engineers who wrote it themselves; the skill-formation loop that turns juniors into SMEs over a decade - write, struggle, debug, internalize - gets shortcut by tooling that produces working code without the struggle. Companies that don’t actively design against both effects get the worst of three pressures: SMEs retiring, juniors not deepening, and an architecture that has no internal substitute for either.

The unintuitive recommendation that follows: don’t fire the humans you have. The SME labor market will tighten faster than LLM tooling can replace what SMEs do - supply is shrinking on both ends, the architecture has no substitute, and today’s senior engineer is cheap relative to their replacement cost in three to five years. Companies banking the LLM productivity dividend by cutting senior staff are trading short-term margin for a much steeper rehiring bill against a constrained future market. The math will look obvious in retrospect. It doesn’t look obvious now because the LLM line item lands on the income statement before the SME-shortage bill arrives.

This is why “best-in-class” is load-bearing in the title. A junior with a checklist runs the same architecture-level pattern-matching the model does - recognizes things that look right, doesn’t catch silent semantic drift. A top-of-class SME has the constraint set internalized to the point that the wrong answer feels wrong, even when the surface looks correct. That feeling is the safeguard the architecture cannot provide.

The system

Don’t delegate end-to-end. Decompose work into chunks small enough that an SME can verify each one in minutes. Checkpoint between chunks. Route each checkpoint to the SME whose domain it’s in. Treat every chunk as 25%-floor untrusted by default. Don’t trust agentic chains to self-verify (the verifier reads from the same training distribution as the generator). Don’t trust LLM-judge eval as a release gate (Part 2 of the testing series covered why; the architectural reason is in the mechanism above). The system is decomposition + checkpoints + SMEs, not a better model and not a better prompt.

The cost is real. SMEs are expensive, the workflow is slower, and the temptation to skip checkpoints when the early ones look fine is constant. The cost of the alternative - silent 25%-floor corruption layered through a long workflow, surfaced six months later when the data has propagated past the recovery window - is much higher and structurally harder to detect. The math is the math the testing series already laid out: catching corruption pre-deployment is a fraction of the cost of finding it after.

When this doesn’t apply

Drafts you’ll discard. Brainstorming, throwaway code, content the human will rewrite anyway. The model is generating a starting point, not delegated output.
The user is the SME. A senior database engineer using AI to draft SQL she’ll review line-by-line is using the model as autocomplete, not as delegation. The 25% is irrelevant because she’s the verification layer.
Low-stakes, recoverable work. A typo in a personal email isn’t a 25% corruption event you need to system-design around.
Bounded, well-trodden problems. Generating boilerplate in a popular language with a well-documented framework is the dense-distribution sweet spot. The rate is much lower because the prior is sharp.
Proof-of-concept and rapid-feedback work. “Does this idea work at all” needed in minutes. The 25% floor is the right trade because the output is a directional signal, not production code; the cost of being wrong is “we tried, didn’t pan out.”

The article is about the rest - production work where corruption is invisible, expensive to fix, and the team is treating the model as a co-author instead of a guess machine.

The bigger picture

Calling the model “intelligence” is the framing that gets engineers in trouble. Intelligence implies a knowing entity that holds facts, checks them against evidence, and tells you when it doesn’t know. The architecture has none of those properties. It has a learned distribution and a sampling procedure. The output is a guess every time, and the guess is well-calibrated only where the training data was dense and consistent - which is precisely not where your specific codebase, your specific schema, or your specific domain conventions live.

The 25% floor is what that guarantees, in numbers. Versions don’t move it. Tools don’t move it. Bigger context doesn’t drive it to zero. The only thing that closes the gap between the architecture and the work is a human who actually knows the domain, checking the output against constraints the architecture can’t represent.

Treat the model as what it is - a probability machine - and the engineering decisions get easier. Decompose. Checkpoint. Put the best SME you have on each domain. Build the testing layer the way the testing series describes. Stop expecting the next model to fix it. The corruption isn’t a bug. It’s the feature.

Anti-Patterns on EXPLAIN ANALYZE