Reading the Schema Is Not Reading the Data

TL;DR

A schema describes the shape the database enforces; the data inside follows a second set of conventions (soft-delete coverage, sentinel values, encoding quirks, format drift) that live nowhere the catalog can show. Queries written from the DDL alone run clean and return results that look right and mean something different. Treat the data as a second source that has to be read, sampled, and documented alongside the types.

An engineer (or an AI) writes a query to find pending orders:

1
2
3
4
SELECT id, total_cents, created_at
FROM orders
WHERE status = 1
  AND created_at > NOW() - INTERVAL 7 DAY;

orders.status is TINYINT NOT NULL. The query runs. Forty thousand rows come back. Most of them shipped days ago. The mistake lives in the column’s other life: status on this table is a boolean is_processed flag where 1 means “has been through the fulfillment pipeline.” The order lifecycle state (pending, processing, shipped, delivered, cancelled) is in orders.state, also TINYINT NOT NULL, also no comments, and whoever read the schema first picked the column whose name they recognized. The DDL was no help; both columns have the same type, the same nullability, and the same look in information_schema. The data was telling the real story, and the data wasn’t read.

The obvious fix is “add comments, use ENUM, lint for ambiguous names.” Each of those helps on new columns and the next migration. None of them touch the existing data, which is where the ambiguity actually lives: forty thousand rows of status = 1 that mean one thing on this table and a different thing on its sibling, ten million VARCHAR dates written by five generations of code in three formats, and a users table where rows with email = '[email protected]' have been on the leaderboard for two years. Fixing forward keeps the problem from growing. Reading the data is how you find out what’s already there.

Four ways the data disagrees with the schema

These are not the exotic cases. They show up in nearly every mature production database, and each one is a place where a schema-only read produces a plausible, wrong query.

TINYINT(1) is polysemic. It stores a boolean flag (is_active, has_seen_onboarding, email_verified), a small enum (lifecycle states, tier levels, priority), a bit-packed byte (eight flags in a single column), or a count that never exceeds 127. All four uses produce identical entries in information_schema. Naming conventions (is_*, has_*, can_* for booleans; _type, _status, _level for enums) are the informal signal, and like every informal signal, they’re applied inconsistently and broken in legacy tables. See Schema Conventions and Why They Matter for the prescriptive side; this is the descriptive reality.

Soft-delete coverage is partial. Some tables have deleted_at TIMESTAMP NULL. Some have is_deleted TINYINT(1) DEFAULT 0. Most have neither, because the original author decided the table didn’t need soft deletes and nobody revisited. A query that correctly filters WHERE deleted_at IS NULL on customers returns the right answer; the same pattern applied to addresses either errors out (column doesn’t exist) or silently matches everything (column exists but is always NULL because the application never writes to it). There’s no global rule to encode and no way to know from the catalog which tables fall in which bucket. You have to read the data, or read the application code that writes to it (which is usually worse).

VARCHAR dates in multiple formats. A column called signup_date VARCHAR(10) is a tell. The first generation of rows has YYYY-MM-DD. A rewrite that switched import vendors introduced MM/DD/YYYY. An international expansion produced DD/MM/YYYY for rows that came in through a specific endpoint and DD-Mon-YYYY for one partner’s CSV imports. All four formats live in the same column. WHERE signup_date >= '2025-01-01' matches the first generation correctly, matches the third generation backwards (“2025-01-01” sorts before “15/03/2024”), and misses the fourth entirely because the sort order doesn’t touch Mon strings. The query returned rows, so the reviewer moved on.

Sentinel values and test data. Row with user_id = 0 means “anonymous.” Row with email = '[email protected]' is a test account that’s been in production for three years because nobody wanted to take responsibility for deleting it. Row with created_at = '1970-01-01 00:00:00' is a backfill where the original timestamp was unknown and epoch zero got written as a placeholder. Every one of these is an intentional violation of the apparent meaning of the column, and every schema-level read treats them as ordinary data. Copilot ranked DO_NOT_USE as the top customer with $99,999 in revenue because the row had the highest total; the test record had been sitting there for years, visible to anyone who queried the table but invisible to anyone who only read the DDL.

Input-convention drift. VARCHAR(255) accepts “Acme Corp,” “ACME CORPORATION,” “Acme Corp.,” “acme corp,” and “ACME CORP” (two spaces, somebody’s trailing whitespace bug). All five are the same company in different rows. The unique constraint, if it exists, didn’t catch any of them because they’re not byte-identical. Any query that groups or joins on the text field silently double-counts - not by a small amount, by however much the convention drift is worth. Encoding quirks compound: café in NFC and NFD look identical in the terminal and hash differently; case-folding depends on collation; trailing whitespace varies by source system.

Why the catalog can’t tell you this

information_schema describes the contract the database enforces on writes. That contract is narrow: types, nullability, defaults, constraints, foreign keys. It doesn’t describe what got written before the constraint was added (almost all of it), what gets written by code paths that bypass the ORM (a surprising fraction of it), or what the application decided to write into a column that the database happily accepts because the type matches.

Type compatibility is a floor, not a ceiling. TINYINT NOT NULL excludes strings, NULLs, and integers outside [-128, 127]. It doesn’t exclude 1 meaning five different things in five different tables, because that’s not a type constraint - it’s a semantic one, and the database has no vocabulary for semantics. The same logic applies to NULL handling: the catalog tells you a column is nullable; it doesn’t tell you whether NULL means “unset,” “not applicable,” “still in progress,” or “data lost during the 2019 migration.”

LLMs inherit this limitation directly. A model generating SQL from the catalog sees column names and types, not data distributions. It has no way to tell that status is polysemic across tables, that deleted_at exists on four of the six relevant tables, or that signup_date has three format generations. The LLM’s best guess is the one a new engineer would make: the schema looks uniform, so the data probably is. Neither is wrong in general; both are wrong often enough in mature databases to produce plausibly-shaped and semantically-hollow query results. This is the generalization of the specific patterns covered in Legacy Schemas Are Sediment; legacy schemas are one source of data drift, and there are others.

Runs clean, returns plausible, means something else

Schema-only queries fail in the quietest way a query can fail. The SQL is syntactically correct. The types match. Rows come back. Some fraction of those rows mean what the author intended, and some fraction mean something else, and there’s no signal at the database level telling you which is which. Reviewers who only look at the query text can’t catch it. The data is where the check has to happen.

The fix is a habit, not a migration

You can’t retroactively enforce a schema on ten years of writes. You can change what the next reader (human or model) has available before they generate the next query.

Profile before you query. Before writing a predicate against an unfamiliar column, run a one-liner: SELECT col, COUNT(*) FROM t GROUP BY col ORDER BY COUNT(*) DESC LIMIT 20. For low-cardinality columns (status, type, flags) this reveals the actual value distribution in thirty seconds and catches the flag-versus-enum mistake before the query ships. For higher-cardinality columns, sample: SELECT col FROM t ORDER BY RAND() LIMIT 50. The time cost is minutes; the catch rate is substantial.

Comment the columns the DDL can’t describe. A one-line comment on orders.status ('Pending=1, Processing=2, Shipped=3, Delivered=4, Cancelled=5') and on orders.state ('Boolean: 1 if order has been through fulfillment') is the difference between a reader who gets it right and one who guesses. Comment Your Schema covers the mechanics in full; for the flag/enum disambiguation specifically, this is the highest-leverage fix per character of effort anywhere in schema maintenance.

CHECK constraints for new values. CHECK (status IN (1,2,3,4,5)) is the forcing function for the next writer. It won’t clean up existing rows, and it won’t stop a future engineer from reaching for 6, but it will fail loudly when they try, instead of silently accepting a value the readers of the table don’t know about. On nullable columns, CHECK (deleted_at IS NULL OR deleted_at > created_at) catches the backfill-sentinel case.

Migrate VARCHAR dates when you can afford it. The migration is real work: parse each row, fail loudly on unparseable formats, pick a canonical representation, backfill. Leaving VARCHAR in place guarantees the next query is written against whichever format the author happened to sample. The right-sized fix in the meantime: a comment on the column listing the known formats, and a view that exposes a parsed DATE for the queries that can tolerate loss on the unparseable rows.

Treat data profiling as part of review. When a PR adds a new query, the reviewer’s first question is “does this predicate match the data?”, which requires actually looking at the data, not just the query. For AI-assisted development this is even more load-bearing: the model generated the query from the catalog, so the human review is the only layer that can compare the query’s predicates to the column’s actual contents.

When schema-only reading is fine

Not every database carries this baggage. Three cases where the schema really is the data’s description:

Schemas designed from scratch with strict conventions. New services, greenfield tables, codebases where every column has a comment, every enum is an ENUM type, and every date column is DATE or TIMESTAMPTZ. The drift hasn’t had time to accumulate, and the conventions are enforced by linters on migrations. The failure modes described above can still show up; they show up as bugs that get caught, not as the steady-state of the table.

Small, single-team databases. Twenty tables, three engineers, all the data flowing through one service. Everyone who writes to the table knows what the conventions are; the data drift is small because there are only three writers. The cost of the habit described above exceeds the cost of the drift it catches. Grow the team or the table count by a factor of ten and the math flips.

Analytical warehouses that expect exploration. In a BigQuery, Snowflake, or ClickHouse dataset built for analytics, everyone who queries the data profiles it as a matter of course: sample the column, check the distribution, look for nulls. The profiling habit is already the workflow; the schema is treated as a hint rather than a contract. This is the part of the data stack where reading the data is assumed, and the failure mode is correspondingly rare.

The bigger picture

A production database has two artifacts worth reading: the DDL the engine enforces, and the data the engine happens to hold. The first is legible, indexed, and comes with tooling; the second is tribal knowledge, distributed across rows written by years of code, and invisible to every tool that stops at the catalog. Everyone from new engineers to LLMs reads the first artifact and assumes it describes the second, which is true in schemas fresh enough to have no drift and false in every schema old enough to have generated any.

Rigor on new tables pays off, but the larger lever is routine comparison between what the schema says and what the data does: sampling before querying, commenting columns whose meaning isn’t self-evident, treating data profiling as part of review rather than a debugging step. None of it scales to “we documented the whole schema in one sprint.” It scales one column at a time, on the columns that are about to be queried, until the fraction of the schema that lies to its readers is small enough to stop costing incidents.