Legacy Schemas Are Sediment, Not Design

Wed, 22 Apr 2026 00:00:00 +0000

TL;DR

A legacy schema looks like a design but reads like a sediment — layers of decisions from different eras, where names that once described the data no longer do and conventions that look uniform aren’t. The fix isn’t renaming (prohibitively expensive once every caller depends on the current names); it’s documenting the drift so the next reader — human or LLM — can navigate what’s actually there.

A new engineer joins the team and reads the schema. tmp_orders looks like scaffolding — something to delete once the real migration ships. The tech lead answers: never delete it. tmp_orders is the main orders table. The temp-to-permanent rename was planned for 2017, nobody shipped it, and every service in the company now writes to the table. The name is a lie the schema tells every new reader — and every LLM generating SQL against the catalog.

The obvious fix is to rename the table. Nothing about the database itself prevents it — drop the tmp_ prefix, update every call site, ship. The reality is that every service, ORM model, report, integration, and runbook references tmp_orders by name. The rename is a multi-quarter effort that crosses team boundaries, and the only justification is legibility. Teams rarely prioritize legibility work, so the name stays, and the schema keeps lying.

What’s drifted

Legacy drift shows up in three visible modes and one invisible one.

Names that stopped describing the data. tmp_ tables that are permanent. old_ columns that are current. deprecated_ fields that every write path still populates. flag1, flag2, status_code — names whose meaning was obvious when the column was added, because the person adding it remembered why. By the time a new reader arrives, the intent is gone and the name is false advertising. Comment Your Schema covers the documentation side of this; legacy schemas are the case where comments would help most and where they’re most often absent.

Conventions per era. The 2014-era backend team used camelCase. The 2019 rewrite adopted snake_case. The 2022 microservice added a third table with PascalCase because the Go team wrote it and nobody pushed back. Now one database has userId, user_id, and UserID — all referring to the same entity across different tables. The LLM that generates business.created_at when the column is actually business.createdDate isn’t wrong in any sense the schema could catch; it’s inferring a convention from one table and applying it to another, which is a reasonable thing to do in a schema that has only one convention.

Tables that were supposed to be temporary. tmp_orders is the canonical example, but every long-lived database has some. Staging tables that got promoted to production. Migration tables that weren’t cleaned up. “Phase 2” tables built for a transitional period that shipped in phase 1 and never came back to finish. The names encode the original intent; the data encodes the current reality; the two diverge a little more with every migration that preserves the name instead of fixing it.

Invisible structural drift. Charsets and collations are the version of drift that doesn’t even show up in the column list. Older tables created before the Unicode migration default to latin1; newer tables use utf8mb4. A join between a VARCHAR(100) column in one table and a VARCHAR(100) column in another — both with the same name, both with the same logical meaning — silently produces different results depending on which side’s collation MySQL picks. In the bad cases, an implicit charset conversion kills index usage and turns the query into a table scan. SHOW TABLE STATUS reveals this; reading the column list doesn’t. Most LLMs read the column list.

Why this is worse for LLMs than for humans

A new human engineer working with a legacy schema can ask. They can ping the on-call channel, look up the original migration in git, trace a column back to the PR that introduced it, or simply ask “what is flag1?” and get an answer from someone who knows. The answer is often wrong or outdated, but it’s a starting point, and the engineer learns to treat the schema with appropriate suspicion.

An LLM generating SQL from the catalog has no such recourse. It sees tmp_orders and reasons from the name — probably “this is a staging table, prefer the non-tmp version if one exists, otherwise deprioritize.” It sees old_price and treats it as historical. It sees flag1 BOOLEAN and infers a generic flag. Each inference is reasonable; each is wrong in the specific case; and the schema gives no signal that this is one of the cases where reasoning from the name produces bad SQL.

This is the sharper version of the generic id primary key problem. Both are failures of the schema to describe itself. The PK case hides what’s being matched; legacy drift hides what anything means. Neither failure shows up at write time — both produce queries that run, return data, and look plausible, because the rows exist and the types match. The wrongness is in the interpretation, which the database has no way to check.

The fix is documentation, not renaming

The obvious fix — rename everything to match intent and convention — fails on cost. Every table, column, and constraint in a mature schema is referenced by services the team has forgotten about: scheduled jobs, Redshift imports, third-party integrations, BI dashboards built by a contractor in 2019, runbooks pasted into wiki pages that nobody has edited since. A rename that looks like a one-line migration touches every surface the table is exposed on, and the projects that survive the attempt usually take a year and leave the schema worse during the transition.

The workable fix is to stop the drift from continuing and make the existing drift visible. Stopping new drift means picking a convention for new tables and columns and writing it down where CI can enforce it (Schema Conventions and Why They Matter covers the mechanics). Making existing drift visible means column and table comments on everything whose name doesn’t match its meaning, plus a per-era mapping somewhere in the repo that says “this database has four naming conventions, used in these periods, applied to these tables.” Legacy schemas are the case where COMMENT ON pays off highest — the names are already wrong, the cost of fixing them is prohibitive, and the comment is the one affordable signal the next reader gets.

COMMENT ON TABLE tmp_orders IS
  'Main orders table. The tmp_ prefix is historical — a 2017 migration was planned to rename this and was never completed. Do not drop.';

COMMENT ON COLUMN customers.flag1 IS
  'VIP customer flag. Legacy name from the 2014 schema — never renamed because of external reporting dependencies.';

One-line migrations, zero risk, and every reader — human and LLM — now has a chance of reading the schema correctly. This isn’t a fix in the sense of “problem solved.” It’s a fix in the sense of “the next reader has a chance.” The drift is structural; the documentation is how you navigate it without making it worse.

When a clean rewrite is actually worth it

Renames and migrations aren’t always wrong. Three cases where the rewrite earns its cost:

A misleading name is actively causing incidents. If tmp_orders is regularly truncated or dropped by someone who reads the name literally and acts on it, the rename cost is less than the recovery cost from the next incident. Usually the practical fix here isn’t a rename — it’s a view, synonym, or ALTER-TABLE-RENAME that exposes orders as the canonical name and leaves tmp_orders as a compatibility alias for legacy callers.

A schema migration is happening anyway. If the team is replatforming the OLTP database or splitting it across services, the rewrite opens a window where renames are cheap because callers are being updated either way. Take the opportunity; don’t schedule a separate naming cleanup six months later when the window has closed.

A database small enough that it fits one person’s head. Early-stage startups, internal tools, bounded-scope services. At twenty tables and three developers, a Saturday afternoon of renames is cheaper than a decade of comments.

In every other case, the schema is load-bearing history, and you renovate it the way you renovate a building with people still living in it: patch, document, and schedule the demolition for a window when it’s genuinely cheap.

The bigger picture

Every production schema is a compressed record of the decisions the team made under pressure. Some of those decisions were good and still fit; some were good at the time and don’t fit now; some were expedient and nobody noticed. The schema can’t tell you which is which, and it was never going to. The fix isn’t to aspire to a clean schema that doesn’t accumulate history — no such schema exists past a three-year horizon — but to leave the next reader enough signal to decompress the sediment without guessing.