The Hello-World Procurement Problem: Why LLM Tooling Gets Bought Wrong

Sun, 26 Apr 2026 00:00:00 +0000

TL;DR

A CTO declares “full agentic” off a demo. Without an SME watching the rollout, corruption ships and surfaces a year later when a customer reports a wrong number. With an SME, the job is information infrastructure first (so agents have enough context to make high-probability decisions) and guardrails for the cases where context isn’t enough.

A CTO sits through a vendor demo. A sales engineer types “show me the top ten customers by revenue last quarter” into a prompt and a working SQL query materializes in 30 seconds, runs against a sample dataset, returns plausible numbers. The CTO declares the company is going full agentic. Procurement closes the contract by Friday.

Two paths follow. Which one the company ends up on depends on whether anyone with veto authority over rollouts can measure how wrong the model will be on the company’s actual data.

Without SMEs

If the agent-generated SQL looks like gold to everyone in the room, the Rounders rule applies: if you can’t spot the sucker in your first half hour at the table, you are the sucker. Without someone in the room who’d catch the polysemic tier column or the undocumented soft-delete convention buried in three tables, the team is approving plausibility on a system optimized to produce it.

If the produced code looks good to you, you’re probably not the SME.

The corruption rate observed in the demo is a lower bound for what the tool produces against real data, often by a multiple. The realities catalogued in What AI Gets Wrong About Your Database (undocumented conventions, polysemic columns, business logic in tribal knowledge, ten-year-old codebases with three “current” patterns) are exactly the regions of input space where the model’s training distribution is sparse and contradictory. Demos run in the dense-distribution sweet spot. Production runs the inverse on every axis.

With nobody positioned to measure the gap, nothing flags it. Corruption is silent by construction. It doesn’t surface as one identifiable bug; it surfaces as drift across many places at once, traced back to LLM-generated code or queries whose authors can’t reconstruct what the model meant. By the time the rate is visible, the corruption has been propagating for weeks or months. The team has too many simultaneous issues to triage one at a time. Backups have rolled past the worst of the window.

The detection mode is external. A customer reports a number that doesn’t match what they expected. An analyst running LLM-powered queries on the company’s data publishes a report that contradicts internal numbers. A regulator asks a question and the answer doesn’t match the previous quarter’s filing. Whatever surfaces it, the failure is now a public one, and the team learning the failure mode is the same team trying to contain it.

With SMEs

The CTO’s declaration doesn’t change. The job changes. With an SME watching the rollout, the work is infrastructure first.

Agents make high-probability decisions when their inputs are dense. That means the schema is documented, polysemic columns are tagged, conventions are written down somewhere the model can reach, the dataset the agent runs against mirrors production rather than a curated subset. The realities that make a mature codebase mature (patterns evolved over years, decisions encoded in column names, exceptions buried in tribal knowledge) are exactly the inputs the agent doesn’t have unless someone puts them there. The SME’s first job is documenting what currently lives in heads. Without that, the agent operates in the sparse regions of its training distribution, and the floor on its corruption rate stays high regardless of how the harness is tuned.

Guardrails are the second piece, for the cases where dense inputs still aren’t enough. Decompose work into chunks small enough to verify. Route checkpoints between chunks to the SME whose domain it is. Audits produce a failure-rate number against ground truth, not a yes/no. Recovery drills test rolling back six months of LLM-generated changes, because that’s the realistic detection horizon for silent corruption. The point is to catch the cases where the agent’s confidence and its accuracy are decoupled, which is where most of the corruption lives.

Both pieces have to be in place before the deployment goes wide. Once the rate is visible from outside, the SME bench is already triaging incidents instead of building infrastructure, and the architecture won’t grow either piece on its own.

When this doesn’t apply

Small teams. The buyer is the SME, or one degree away. The infrastructure question gets answered by the same person making the rollout call.
Bounded, low-stakes use cases. Personal productivity tooling, draft generation, internal-only knowledge work where corruption is recoverable.
Mature vendor categories. Office suites, established CI/CD platforms, well-trodden CRM tooling. The failure modes are known and the buyer has reference points. New categories are where the asymmetry lives, and that’s exactly where LLM tooling sits in 2026.

The bigger picture

If you can’t evaluate the output, it looks great. LLM corruption is silent by construction, and silence reads as correctness to anyone without the framework to see what’s wrong. The productivity dividend the CTO booked off the demo is real. The bill arrives in the quarter a customer surfaces a number that doesn’t match the books.

Governance on EXPLAIN ANALYZE

The Hello-World Procurement Problem: Why LLM Tooling Gets Bought Wrong

Without SMEs

With SMEs

When this doesn’t apply

The bigger picture