Ai on EXPLAIN ANALYZE

Reading the Schema Is Not Reading the Data

Thu, 23 Apr 2026 00:00:00 +0000

TL;DR

A schema describes the shape the database enforces; the data inside follows a second set of conventions — soft-delete coverage, sentinel values, encoding quirks, format drift — that live nowhere the catalog can show. Queries written from the DDL alone run clean and return results that look right and mean something different. The lever isn’t more schema rigor; it’s treating the data as a second source that has to be read, sampled, and documented alongside the types.

An engineer — or an AI — writes a query to find pending orders:

SELECT id, total_cents, created_at
FROM orders
WHERE status = 1
  AND created_at > NOW() - INTERVAL 7 DAY;

orders.status is TINYINT NOT NULL. The query runs. Forty thousand rows come back. Most of them shipped days ago. The mistake lives in the column’s other life: status on this table is a boolean is_processed flag where 1 means “has been through the fulfillment pipeline.” The order lifecycle state — pending, processing, shipped, delivered, cancelled — is in orders.state, also TINYINT NOT NULL, also no comments, and whoever read the schema first picked the column whose name they recognized. The DDL was no help; both columns have the same type, the same nullability, and the same look in information_schema. The data was telling the real story, and the data wasn’t read.

The obvious fix is “add comments, use ENUM, lint for ambiguous names.” Each of those helps on new columns and the next migration. None of them touch the existing data, which is where the ambiguity actually lives: forty thousand rows of status = 1 that mean one thing on this table and a different thing on its sibling, ten million VARCHAR dates written by five generations of code in three formats, and a users table where rows with email = 'DO_NOT_USE@test.com' have been on the leaderboard for two years. Fixing forward keeps the problem from growing. Reading the data is how you find out what’s already there.

Four ways the data disagrees with the schema

These are not the exotic cases. They show up in nearly every mature production database, and each one is a place where a schema-only read produces a plausible, wrong query.

TINYINT(1) is polysemic. It stores a boolean flag (is_active, has_seen_onboarding, email_verified), a small enum (lifecycle states, tier levels, priority), a bit-packed byte (eight flags in a single column), or a count that never exceeds 127. All four uses produce identical entries in information_schema. Naming conventions — is_*, has_*, can_* for booleans; _type, _status, _level for enums — are the informal signal, and like every informal signal, they’re applied inconsistently and broken in legacy tables. See Schema Conventions and Why They Matter for the prescriptive side; this is the descriptive reality.

Soft-delete coverage is partial. Some tables have deleted_at TIMESTAMP NULL. Some have is_deleted TINYINT(1) DEFAULT 0. Most have neither, because the original author decided the table didn’t need soft deletes and nobody revisited. A query that correctly filters WHERE deleted_at IS NULL on customers returns the right answer; the same pattern applied to addresses either errors out (column doesn’t exist) or silently matches everything (column exists but is always NULL because the application never writes to it). There’s no global rule to encode and no way to know from the catalog which tables fall in which bucket — you have to read the data. Or read the application code that writes to it, which is usually worse.

VARCHAR dates in multiple formats. A column called signup_date VARCHAR(10) is a tell. The first generation of rows has YYYY-MM-DD. A rewrite that switched import vendors introduced MM/DD/YYYY. An international expansion produced DD/MM/YYYY for rows that came in through a specific endpoint and DD-Mon-YYYY for one partner’s CSV imports. All four formats live in the same column. WHERE signup_date >= '2025-01-01' matches the first generation correctly, matches the third generation backwards (“2025-01-01” sorts before “15/03/2024”), and misses the fourth entirely because the sort order doesn’t touch Mon strings. The query returned rows, so the reviewer moved on.

Sentinel values and test data. Row with user_id = 0 means “anonymous.” Row with email = 'DO_NOT_USE@test.com' is a test account that’s been in production for three years because nobody wanted to take responsibility for deleting it. Row with created_at = '1970-01-01 00:00:00' is a backfill where the original timestamp was unknown and epoch zero got written as a placeholder. Every one of these is an intentional violation of the apparent meaning of the column, and every schema-level read treats them as ordinary data. Copilot ranked DO_NOT_USE as the top customer with $99,999 in revenue because the row had the highest total; the test record had been sitting there for years, visible to anyone who queried the table but invisible to anyone who only read the DDL.

Input-convention drift. VARCHAR(255) accepts “Acme Corp,” “ACME CORPORATION,” “Acme Corp.,” “acme corp,” and “ACME CORP” (two spaces, somebody’s trailing whitespace bug). All five are the same company in different rows. The unique constraint, if it exists, didn’t catch any of them because they’re not byte-identical. Any query that groups or joins on the text field silently double-counts — not by a small amount, by however much the convention drift is worth. Encoding quirks compound: café in NFC and NFD look identical in the terminal and hash differently; case-folding depends on collation; trailing whitespace varies by source system.

Why the catalog can’t tell you this

information_schema describes the contract the database enforces on writes. That contract is narrow: types, nullability, defaults, constraints, foreign keys. It doesn’t describe what got written before the constraint was added (almost all of it), what gets written by code paths that bypass the ORM (a surprising fraction of it), or what the application decided to write into a column that the database happily accepts because the type matches.

Type compatibility is a floor, not a ceiling. TINYINT NOT NULL excludes strings, NULLs, and integers outside [-128, 127]. It doesn’t exclude 1 meaning five different things in five different tables, because that’s not a type constraint — it’s a semantic one, and the database has no vocabulary for semantics. The same logic applies to NULL handling: the catalog tells you a column is nullable; it doesn’t tell you whether NULL means “unset,” “not applicable,” “still in progress,” or “data lost during the 2019 migration.”

LLMs inherit this limitation directly. A model generating SQL from the catalog sees column names and types, not data distributions. It has no way to tell that status is polysemic across tables, that deleted_at exists on four of the six relevant tables, or that signup_date has three format generations. The LLM’s best guess is the one a new engineer would make: the schema looks uniform, so the data probably is. Neither is wrong in general; both are wrong often enough in mature databases to produce plausibly-shaped and semantically-hollow query results. This is the generalization of the specific patterns covered in Legacy Schemas Are Sediment — legacy schemas are one source of data drift; there are others, and they’re not all legacy.

Runs clean, returns plausible, means something else

Schema-only queries fail in the quietest way a query can fail. The SQL is syntactically correct. The types match. Rows come back. Some fraction of those rows mean what the author intended, and some fraction mean something else, and there’s no signal at the database level telling you which is which. Reviewers who only look at the query text can’t catch it. The data is where the check has to happen.

The fix is a habit, not a migration

You can’t retroactively enforce a schema on ten years of writes. You can change what the next reader — human or model — has available before they generate the next query.

Profile before you query. Before writing a predicate against an unfamiliar column, run a one-liner: SELECT col, COUNT(*) FROM t GROUP BY col ORDER BY COUNT(*) DESC LIMIT 20. For low-cardinality columns (status, type, flags) this reveals the actual value distribution in thirty seconds and catches the flag-versus-enum mistake before the query ships. For higher-cardinality columns, sample: SELECT col FROM t ORDER BY RAND() LIMIT 50. The time cost is minutes; the catch rate is substantial.

Comment the columns the DDL can’t describe. A one-line comment on orders.status ('Pending=1, Processing=2, Shipped=3, Delivered=4, Cancelled=5') and on orders.state ('Boolean: 1 if order has been through fulfillment') is the difference between a reader who gets it right and one who guesses. Comment Your Schema covers the mechanics in full; for the flag/enum disambiguation specifically, this is the highest-leverage fix per character of effort anywhere in schema maintenance.

CHECK constraints for new values. CHECK (status IN (1,2,3,4,5)) is the forcing function for the next writer. It won’t clean up existing rows, and it won’t stop a future engineer from reaching for 6 — but it will fail loudly when they try, instead of silently accepting a value the readers of the table don’t know about. On nullable columns, CHECK (deleted_at IS NULL OR deleted_at > created_at) catches the backfill-sentinel case.

Migrate VARCHAR dates when you can afford it. The migration is real work — parse each row, fail loudly on unparseable formats, pick a canonical representation, backfill. Leaving VARCHAR in place guarantees the next query is written against whichever format the author happened to sample. The right-sized fix in the meantime: a comment on the column listing the known formats, and a view that exposes a parsed DATE for the queries that can tolerate loss on the unparseable rows.

Treat data profiling as part of review. When a PR adds a new query, the reviewer’s first question is “does this predicate match the data?” — which requires actually looking at the data, not just the query. For AI-assisted development this is even more load-bearing: the model generated the query from the catalog, so the human review is the only layer that can compare the query’s predicates to the column’s actual contents.

When schema-only reading is fine

Not every database carries this baggage. Three cases where the schema really is the data’s description:

Schemas designed from scratch with strict conventions. New services, greenfield tables, codebases where every column has a comment, every enum is an ENUM type, and every date column is DATE or TIMESTAMPTZ. The drift hasn’t had time to accumulate, and the conventions are enforced by linters on migrations. The failure modes described above can still show up — but they show up as bugs that get caught, not as the steady-state of the table.

Small, single-team databases. Twenty tables, three engineers, all the data flowing through one service. Everyone who writes to the table knows what the conventions are; the data drift is small because there are only three writers. The cost of the habit described above exceeds the cost of the drift it catches. Grow the team or the table count by a factor of ten and the math flips.

Analytical warehouses that expect exploration. In a BigQuery, Snowflake, or ClickHouse dataset built for analytics, everyone who queries the data profiles it as a matter of course — sample the column, check the distribution, look for nulls. The profiling habit is already the workflow; the schema is treated as a hint rather than a contract. This is the part of the data stack where reading the data is assumed, and the failure mode is correspondingly rare.

The bigger picture

A production database has two artifacts worth reading: the DDL the engine enforces, and the data the engine happens to hold. The first is legible, indexed, and comes with tooling; the second is tribal knowledge, distributed across rows written by years of code, and invisible to every tool that stops at the catalog. Everyone from new engineers to LLMs reads the first artifact and assumes it describes the second, which is true in schemas fresh enough to have no drift and false in every schema old enough to have generated any.

The lever isn’t more rigor in the DDL, though rigor on new tables pays off. It’s routine comparison between what the schema says and what the data does — sampling before querying, commenting columns whose meaning isn’t self-evident, treating data profiling as part of review rather than a debugging step. None of this is glamorous, and none of it scales to “we documented the whole schema in one sprint.” It scales the way good schema practice always has: one column at a time, on the columns that are about to be queried, until the fraction of the schema that lies to its readers is small enough to stop costing incidents.

TEXT and JSON Columns: Where the Schema Goes to Hide

Thu, 23 Apr 2026 00:00:00 +0000

TL;DR

A TEXT or JSON column moves the schema out of the database catalog and into application code — the data inside has a shape, but the DDL won’t tell you what it is. Readers can’t query into it without knowing the format, planners can’t reason about it, and the shape drifts across years of writes with no signal to the next reader. The fix isn’t “don’t use JSON”; it’s to promote the fields that actually get queried into real columns and treat the rest as genuinely opaque.

An AI assistant is asked to “find customers who upgraded to enterprise in the last quarter.” It reads the catalog, finds api_logs(id, endpoint VARCHAR, payload LONGTEXT, created_at DATETIME), and generates the reasonable query:

SELECT JSON_EXTRACT(payload, '$.action') AS action, created_at
FROM api_logs
WHERE JSON_EXTRACT(payload, '$.action') = 'upgrade'
  AND JSON_EXTRACT(payload, '$.plan')   = 'enterprise'
  AND created_at >= NOW() - INTERVAL 90 DAY;

Runs clean. Returns zero rows. The actual key was renamed from action to event.type two years ago when the team adopted a shared event schema — new rows match $.event.type, old rows still match $.action, and no one migrated the historical data because it wasn’t queryable anyway. Neither column nor catalog said any of this. The query is syntactically perfect, semantically correct for the key it guessed, and wrong because the key doesn’t exist in most of the rows.

The obvious fix is “switch to JSONB, validate with a JSON schema, add a GIN index.” Each one helps at the margin and none of them close the gap. JSONB tells you the blob is valid JSON, not what keys are in it. CHECK constraints with JSON_SCHEMA_VALID or jsonb_matches_schema work prospectively, but the six years of rows already in the table were written against five format generations and no validator reaches back in time. A GIN index accelerates key lookups but only if you know which keys to look up. The problem isn’t the storage format — it’s that the schema emigrated to application code, and changing the column type doesn’t bring it back.

What leaves the catalog when the column becomes a blob

DDL is the contract between the database and everything that reads it. A typed column says “this value is an integer between 0 and 2³¹−1, and here’s the index I’ve built over it.” A TEXT or JSON column says “this value is a string the application decided on, and the application can tell you what that means.” The second contract is thinner in ways that compound.

Readers can’t discover the shape from the schema. information_schema.COLUMNS for a JSON column returns COLUMN_TYPE = 'json' and nothing else. Every tool that reads catalog metadata — MCP servers, ERD generators, typed-client code generators, AI assistants, new engineers running \d+ — sees a blob. The shape lives in the serializer class, the protobuf definition, the TypeScript interface, or nowhere. Whichever of those the reader happens to find is the shape they’ll assume. See Comment Your Schema for the lowest-effort way to leave a trail, but comments can describe the shape; they can’t make the catalog enforce it.

Generational drift is silent. Year one the payload is {action, user}. A migration adds nested metadata: {action, user, metadata: {source}}. A rewrite flattens and renames: {event: {type, user_id}, source}. A new service standardizes with a version field: {version: 3, event: {...}}. All four versions are sitting in the same column with nothing to distinguish them at read time except the keys they happen to have. A JSON_EXTRACT path written against today’s producer hits the newest generation and silently misses the older ones. The failure mode is exactly the one described in Legacy Schemas Are Sediment: the schema’s history is compressed into the data, and the data can’t decompress itself.

Writes are untyped. Without CHECK constraints or a JSON-schema validator, the writer is the only guardrail. A service deployed last Tuesday that emits amount as the string "9900" instead of the integer 9900 silently poisons the column — downstream queries comparing amount > 1000 work on new rows and misbehave on the poisoned batch, because JSON-extract returns a string and the comparison is lexicographic. The same class of mismatch a typed column would reject on INSERT.

The planner is working blind. Row-count estimates on JSON_EXTRACT(payload, '$.event.type') = 'upgrade' have no histogram to consult; the planner falls back to a default selectivity estimate that’s usually wrong. Plans for queries filtered on JSON fields are routinely pessimistic or optimistic by an order of magnitude, and there’s no ANALYZE to fix that because the statistics don’t exist for the interior of the blob.

Indexes are per-key, not per-column. A functional index on JSON_EXTRACT(payload, '$.event.type') accelerates one path. The next query filters on $.source and scans the table. Generated columns are the cleaner version of this — payload_event_type VARCHAR(50) GENERATED ALWAYS AS (JSON_EXTRACT(payload, '$.event.type')) STORED — but each one is a schema change with a backfill, and you have to know in advance which keys matter. GIN indexes on JSONB cover arbitrary keys but are large, slow to update, and still don’t tell the reader what keys exist.

Untyped writes + untyped reads = silent schema drift

A TEXT or JSON column accepts anything the writer emits and returns exactly that on read. Two services writing to the same column with slightly different shapes don’t conflict at the database level — they just produce a column whose contents depend on which service wrote the row. The divergence is invisible until a query tries to read uniformly across both.

Plausible paths, empty results

Schema-reading LLMs generate JSON_EXTRACT paths the same way they generate column names in a typed schema — by pattern-matching the column name and the question. Asked about “upgrade actions,” the model guesses $.action = 'upgrade' because the English-to-JSON-path mapping is obvious. It has no way to know that the key was renamed, that three generations coexist, or that the canonical name is now buried under two layers of nesting. The catalog gives it a column type of json and nothing else, and the model’s best guess is reasonable and wrong.

The failure pattern is familiar from other schema-hiding designs. Polymorphic references hide which table a foreign-key-shaped column points at; bare id primary keys hide which identifier is being compared; TEXT/JSON columns hide what’s in the column at all. All three are cases where the LLM generates a plausible query against a schema that isn’t telling it enough, and the query returns plausibly-shaped but semantically empty results.

The fix, and where it stops being free

The lever isn’t “avoid JSON” — which is both impractical and sometimes wrong — it’s to be honest about what’s inside and pick the right storage per field.

Promote fields that get queried. If the application filters on event.type more than occasionally, that’s a real column. Generated columns are the low-friction middle path: derive a typed, indexable column from the JSON, keep the raw payload as the audit trail.

ALTER TABLE api_logs
  ADD COLUMN event_type VARCHAR(50) GENERATED ALWAYS AS
    (JSON_UNQUOTE(JSON_EXTRACT(payload, '$.event.type'))) STORED,
  ADD INDEX idx_event_type (event_type);

The trade-off: every promoted field is a migration, and generated columns don’t retroactively rewrite rows written with a different shape — you still need the COALESCE(JSON_EXTRACT(payload, '$.event.type'), JSON_EXTRACT(payload, '$.action')) cleanup for the old generations, and you’re doing that exactly once as part of the promotion rather than in every query.

Enforce new writes with a JSON schema. PostgreSQL’s pg_jsonschema and MySQL 8.0’s JSON_SCHEMA_VALID let a CHECK constraint reject writes that don’t match a named schema. Doesn’t fix existing rows; does stop the next silent format change from landing. If the team doesn’t already have a shared event schema, a CHECK constraint is the forcing function that produces one.

Version the payload explicitly. {"version": 3, "payload": {...}} at the top lets every reader dispatch on version instead of inferring it from which keys happen to be present. Doesn’t help rows written before versioning started, but bounds the drift going forward and turns “which generation is this row?” from archaeology into a lookup.

Document what stays inside. Comments on the column — “see github.com/org/events for the schema; versions 1–3 coexist in rows older than 2024-Q2” — won’t replace types, but they give the reader a place to look. Comments on the schema are cheap, in-place, and propagate through every tool that reads the catalog; for genuinely-opaque columns this is the best available signal.

When JSON is actually the right answer

The pattern earns its keep in specific shapes where the alternative — typed columns — is worse.

Truly variable shape per row. User-supplied settings blobs, custom-field configurations, extension points where the keys are genuinely per-tenant or per-user. Modeling each variant as a column produces a wide table full of NULLs; see God Tables for the cost of that direction. The column is honest about being schemaless because the data is schemaless.

Audit payloads nobody queries. Raw API request/response bodies retained for compliance, debug traces, incident forensics. Written once, read by humans one row at a time, never aggregated. The lack of a queryable schema is fine because no query needs one. A sensible default here is to keep the payload compressed and add a small set of typed columns (endpoint, status_code, user_id, created_at) for the predicates the operational queries actually use.

Short-lived staging. Job queues, idempotency cache payloads, outbox entries — where the producer and consumer are deployed together, the payload is read once, and the row is deleted on completion. Drift can’t accumulate in rows that don’t stay around.

Document stores on purpose. PostgreSQL JSONB with a stable schema, validated on write, with functional indexes on the paths that matter. This is a real design; it’s not the unspoken default that most TEXT columns represent. If the team is reaching for JSONB and treating it as a document store, it should look like one — with validation, indexes, and documentation — not like a TEXT column that happens to parse.

The bigger picture

A TEXT or JSON column is a specific architectural choice: move part of the schema out of the catalog, in exchange for cheaper writes and looser contracts between producer and consumer. When the trade is deliberate — genuinely variable data, write-once audit, short-lived buffer — it’s the correct shape. When it’s the path of least resistance because typed columns would require a migration, the cost is deferred to every future reader who has to reconstruct the format from commit history.

Databases are good at enforcing the contracts they know about. The column types are how they know. Every field that matters to a query deserves to be in the part of the schema the database can see; everything else is honestly opaque and should look it. The default drift — “stick it in the payload, we’ll parse it later” — produces columns whose contents nobody fully knows, including the team that wrote them, and the cost is paid in the form of queries that return plausible answers to questions the data can’t actually answer.

God Tables: 150 Columns and the Quiet Cost of 'Just Add a Column'

Wed, 22 Apr 2026 00:00:00 +0000

TL;DR

A wide table looks cheap because every column was added for a real reason — the expensive part is that rows grow, every write amplifies, and every secondary index inherits the bloat. The fix isn’t aggressive normalization (which trades one wide table for six-way joins on every read) but splitting by access pattern: columns read together stay together, rarely-touched columns move out.

The schema started clean four years ago: users(id, email, password_hash, created_at) — four columns. Today the table is renamed customers and has 184 columns. Billing address. Shipping address. Three additional shipping addresses numbered 2 through 4. preferences_json for user settings. Twelve feature-flag TINYINTs. Three Stripe identifiers from three processor migrations. last_login_at, last_seen_at, last_purchase_at, last_notification_sent_at. Forty more columns whose meaning lives in Confluence, if anywhere. No single ALTER TABLE ADD COLUMN was unreasonable at the time. The accumulated result is an average row size of 6KB, an UPDATE to last_login_at that rewrites every byte of it, and a buffer pool holding four customer rows per page instead of forty.

The obvious fix is to normalize it — split into customer_profile, customer_billing, customer_addresses, customer_preferences, customer_feature_flags, customer_audit. That’s the textbook answer and it’s the one that breaks the moment you look at the dominant read. The list view on the admin page needs name, email, status, last login, Stripe status, and total spent — now it’s a six-way join on every page load. The fix that looked clean in the migration doc makes the most-frequent query more expensive, not less. The read cost moves to the place it’s paid most often, and somebody — usually a few months later — proposes a materialized view to “just flatten it back out,” which is the god table returning through a different door.

How a row-store actually reads a row

Before the cost math makes sense: OLTP engines like InnoDB and PostgreSQL’s heap store complete rows laid out contiguously on fixed-size pages — typically 16KB in InnoDB, 8KB in PostgreSQL. A page holds as many rows as fit. When a query needs one column of one row, the engine doesn’t read that column alone; it locates the row’s page via an index lookup or scan, loads the whole page into the buffer pool, and reads the requested column out of the in-memory row image.

The one exception is the index-only scan: if every column the query projects and filters on is already present inside an index, the base table doesn’t have to be touched and only the index pages are loaded. See Covering Index Traps for how quickly this optimization disappears — usually the moment a SELECT list grows by one column. Every other read path goes through the row, which means the row’s width sets the floor on how much data the engine moves per lookup. Reading email from a 184-column customer row loads 6KB into memory to return 50 bytes; reading the same column from an 800-byte row loads 800 bytes. The buffer pool is a fixed size and every byte of unused column data in it is displacing something another query needs.

Column stores (ClickHouse, BigQuery, Parquet-backed warehouses) invert this entirely — data is laid out by column, so reading one column reads only that column’s storage. The wide-table cost math doesn’t apply there, which is why this anti-pattern is specifically a row-store OLTP problem and why denormalized fact tables in analytical warehouses are fine at 300 columns.

What 150 columns actually costs

The individual cost of one column is negligible. The system-level cost shows up in several places at once, and none of them are visible in a diff that adds one more.

Row size and write amplification. InnoDB stores full rows on disk pages, and an UPDATE rewrites the entire row even if only one column changed. On a 184-column table averaging 6KB per row, updating last_login_at on every sign-in rewrites 6KB, not 8 bytes. PostgreSQL doesn’t rewrite in place — MVCC creates a new tuple for every UPDATE and marks the old one dead — but the new tuple is 6KB too, and VACUUM has that much more to reclaim. Either engine, the write cost per logical change scales with row width.

Buffer pool density. The page-per-read mechanism above means buffer-pool efficiency scales inversely with row width. At 6KB per row, an InnoDB 16KB page holds two rows; at 400 bytes per row it holds forty. A database with 10GB of buffer pool has the effective working set of a much smaller instance once rows get wide — queries that used to run hot start touching disk for no reason other than that the rows they cared about no longer fit in memory alongside the rows other queries cared about.

Secondary indexes inherit the width problem. Every secondary index in InnoDB carries a copy of the primary key at its leaves; every index entry is a key-columns + PK-copy record. A wide table tends to accumulate indexes — you index email, Stripe ID, last-login, phone, region, account-manager-ID, each for a different query path. Six secondary indexes on a 184-column table isn’t unusual, and each of them is physically larger than it would be on a narrow table, because the PK copy and fill-factor choices interact with row density. Covering indexes are also harder to arrange: the list view wants eight columns projected, and indexing eight columns of a 184-column table to cover one query is an expensive trade.

Lock and transaction width. Every UPDATE acquires a row-level lock. Transactions that touch a wide row hold that lock for the duration of the transaction, and because the row spans many concerns — billing, preferences, audit timestamps — transactions from unrelated code paths contend on the same row. A background job updating last_seen_at now serializes against a billing job updating stripe_customer_id on the same customer, because both paths lock the same row. In the split-by-concern shape, they’d contend on different rows of different tables.

Schema migrations get more expensive. ALTER TABLE ADD COLUMN on a 184-column table is slower, holds metadata locks longer, and has a larger blast radius if it fails. MySQL’s online DDL is usually fine for NULL-default additions; PostgreSQL is generally fast for the same case. But any migration that needs to rewrite rows (changing a column type, adding NOT NULL with a backfill) scales with row size, and a 6KB row rewrite on 200 million rows is a different operation than an 800-byte row rewrite on the same count.

Every column is a commitment

The cost of adding a column is small and immediate. The cost of having 150 columns is systemic and deferred — buffer-pool density, index size, write amplification, lock contention, migration cost. None of the deferred costs are visible in the PR that adds one more column, which is why they accumulate uncorrected until the table is painful.

Why LLMs make this worse

Schema drift in the wide-table direction is what language models reinforce by default. A model generating ALTER TABLE for a feature request reads the current schema and proposes the smallest change that makes the feature work — which is almost always adding columns to the table that already holds the related data. Proposing a split requires understanding the access pattern, the transaction boundaries, and the write frequency of the new columns versus the existing ones. None of that is in the CREATE TABLE.

The loop reinforces itself: the wider the table gets, the more natural it is for the next change to widen it further. “Where do loyalty tier and tier expiry go?” The model sees customers has every other user-attached concept in it and adds two columns. The alternative — CREATE TABLE customer_loyalty (customer_id PK FK, tier, expires_at) — requires the model to argue for a split, and splits are rare in the training data compared to additions because splits are rare in real codebases for the same reason: they’re harder to ship than additions. The model is correctly pattern-matching on what humans actually do, which is exactly the problem.

ORMs compound this. One model equals one table is the default shape in ActiveRecord, Django ORM, Prisma, SQLAlchemy, and Ecto. Refactoring a Customer model into three co-owned tables is a change that touches every query, every serializer, every test. The ORM makes “add a column to the existing model” a five-line change and “split the model” a project. Engineers pick the cheap option every time, and the wide table ratchets.

Split by access pattern, not by concept

“Normalize it” isn’t the fix because normalization is a property of data shape, not query cost. The fix is to look at what columns are actually read and written together, and keep those co-located; the rest moves out.

A workable decomposition for the customers example:

Core hot table — the columns read on nearly every query: id, email, name, status, tier, stripe_customer_id, created_at. Maybe twenty columns. This is what the list view, the auth path, and most API responses need.
1:1 cold tables — concerns that are read rarely or in specific flows: customer_audit for login/seen/purchase timestamps, customer_preferences for user settings, customer_feature_flags for the twelve TINYINT flags. Each is a separate table with customer_id as PK and FK, joined only when the flow actually needs it. Writes to last_login_at stop rewriting the billing row.
1:N tables for repeating groups — addresses, payment methods, anything that was modeled as shipping_address_2, shipping_address_3, shipping_address_4 is an addresses table with a FK and a type. This collapses polymorphic-ish schema decisions that shouldn’t have been made at the column level in the first place; see Polymorphic References for the related pattern where doing this without a FK goes wrong.

The trade-off is that some queries now join two or three tables instead of reading one. On the hot path this is fine — the joins are on PK-equals-FK, the join tables are small, and the read is usually cheaper than scanning a fat row. The cold path is where it matters: the audit screen now joins customers to customer_audit, which costs one indexed lookup and nobody notices. The place to be careful is the query that reads from three of the split tables on every request — if that’s dominant, one of those tables probably belongs merged back in.

When a wide table is actually fine

Not every 100-column table is a god table. Three cases where width is defensible:

Analytical and reporting tables on columnar storage. As noted above, warehouses like ClickHouse, BigQuery, and Redshift invert the cost calculus — reading one column doesn’t load the rest, and the normalization pressure flips: denormalize aggressively because joins are expensive and per-column reads are cheap. This anti-pattern is specifically a row-store OLTP problem.

Small tables that stay small. A tenants table with 80 columns and 500 rows fits entirely in the buffer pool. The write amplification is paid a few thousand times a day, not a few million. The secondary-index cost is negligible because the indexes are small. Width matters when row count is large enough for the per-row cost to dominate — on small tables it doesn’t.

Every query reads every column. Uncommon but real. If the dominant read is “fetch the full customer record for display” and the split would produce a join that runs on every request anyway, the split doesn’t help. The test is whether the queries you actually run touch disjoint column sets — if they do, the split has a real win; if they don’t, it’s architecture for its own sake.

The bigger picture

Relational databases aren’t built for developer convenience. They’re built for storage efficiency and retrieval speed — narrow rows, well-placed indexes, joins on indexed keys, query plans that read only what they need. Normalization isn’t an academic ideal; it’s the shape that lines up with how the engine actually pays its bills. Every cost mechanism in this post — buffer-pool density, write amplification, index bloat, row-lock width — is the engine reporting the same thing in different dialects: the shape you’re asking it to hold isn’t the shape it was optimized for. The SELECT-*-and-done dream is the developer’s cost model, not the database’s.

God tables aren’t designed; they’re the limit of a sequence of rational local decisions where the global cost is invisible at each step. The column count of a mature production table is usually a decent proxy for how long the team has been making the cheap choice, which is most teams most of the time — and that is not by itself a failure. The failure is that the cost goes uncounted. A 6KB row is a write-amplification multiplier on every UPDATE, a buffer-pool multiplier on every read, and an index-size multiplier on every secondary index. None of those costs are on the PR that adds a column; all of them are on the dashboard that shows p99 drifting up quarter after quarter.

The lever is to count the cost at the system level when the table hits a certain width — pick a threshold, sixty columns, a hundred, whatever fits — and make the next column addition a conversation about whether this concern belongs here, not a line in a migration. The answer is often still yes, but it shouldn’t be the default answer. When it’s no, the split is far cheaper at column sixty than at column one-eighty; the table doesn’t care, but every caller of the table does, and the rewrite’s blast radius scales with how long the drift went uncorrected.

Legacy Schemas Are Sediment, Not Design

Wed, 22 Apr 2026 00:00:00 +0000

TL;DR

A legacy schema looks like a design but reads like a sediment — layers of decisions from different eras, where names that once described the data no longer do and conventions that look uniform aren’t. The fix isn’t renaming (prohibitively expensive once every caller depends on the current names); it’s documenting the drift so the next reader — human or LLM — can navigate what’s actually there.

A new engineer joins the team and reads the schema. tmp_orders looks like scaffolding — something to delete once the real migration ships. The tech lead answers: never delete it. tmp_orders is the main orders table. The temp-to-permanent rename was planned for 2017, nobody shipped it, and every service in the company now writes to the table. The name is a lie the schema tells every new reader — and every LLM generating SQL against the catalog.

The obvious fix is to rename the table. Nothing about the database itself prevents it — drop the tmp_ prefix, update every call site, ship. The reality is that every service, ORM model, report, integration, and runbook references tmp_orders by name. The rename is a multi-quarter effort that crosses team boundaries, and the only justification is legibility. Teams rarely prioritize legibility work, so the name stays, and the schema keeps lying.

What’s drifted

Legacy drift shows up in three visible modes and one invisible one.

Names that stopped describing the data. tmp_ tables that are permanent. old_ columns that are current. deprecated_ fields that every write path still populates. flag1, flag2, status_code — names whose meaning was obvious when the column was added, because the person adding it remembered why. By the time a new reader arrives, the intent is gone and the name is false advertising. Comment Your Schema covers the documentation side of this; legacy schemas are the case where comments would help most and where they’re most often absent.

Conventions per era. The 2014-era backend team used camelCase. The 2019 rewrite adopted snake_case. The 2022 microservice added a third table with PascalCase because the Go team wrote it and nobody pushed back. Now one database has userId, user_id, and UserID — all referring to the same entity across different tables. The LLM that generates business.created_at when the column is actually business.createdDate isn’t wrong in any sense the schema could catch; it’s inferring a convention from one table and applying it to another, which is a reasonable thing to do in a schema that has only one convention.

Tables that were supposed to be temporary. tmp_orders is the canonical example, but every long-lived database has some. Staging tables that got promoted to production. Migration tables that weren’t cleaned up. “Phase 2” tables built for a transitional period that shipped in phase 1 and never came back to finish. The names encode the original intent; the data encodes the current reality; the two diverge a little more with every migration that preserves the name instead of fixing it.

Invisible structural drift. Charsets and collations are the version of drift that doesn’t even show up in the column list. Older tables created before the Unicode migration default to latin1; newer tables use utf8mb4. A join between a VARCHAR(100) column in one table and a VARCHAR(100) column in another — both with the same name, both with the same logical meaning — silently produces different results depending on which side’s collation MySQL picks. In the bad cases, an implicit charset conversion kills index usage and turns the query into a table scan. SHOW TABLE STATUS reveals this; reading the column list doesn’t. Most LLMs read the column list.

Why this is worse for LLMs than for humans

A new human engineer working with a legacy schema can ask. They can ping the on-call channel, look up the original migration in git, trace a column back to the PR that introduced it, or simply ask “what is flag1?” and get an answer from someone who knows. The answer is often wrong or outdated, but it’s a starting point, and the engineer learns to treat the schema with appropriate suspicion.

An LLM generating SQL from the catalog has no such recourse. It sees tmp_orders and reasons from the name — probably “this is a staging table, prefer the non-tmp version if one exists, otherwise deprioritize.” It sees old_price and treats it as historical. It sees flag1 BOOLEAN and infers a generic flag. Each inference is reasonable; each is wrong in the specific case; and the schema gives no signal that this is one of the cases where reasoning from the name produces bad SQL.

This is the sharper version of the generic id primary key problem. Both are failures of the schema to describe itself. The PK case hides what’s being matched; legacy drift hides what anything means. Neither failure shows up at write time — both produce queries that run, return data, and look plausible, because the rows exist and the types match. The wrongness is in the interpretation, which the database has no way to check.

The fix is documentation, not renaming

The obvious fix — rename everything to match intent and convention — fails on cost. Every table, column, and constraint in a mature schema is referenced by services the team has forgotten about: scheduled jobs, Redshift imports, third-party integrations, BI dashboards built by a contractor in 2019, runbooks pasted into wiki pages that nobody has edited since. A rename that looks like a one-line migration touches every surface the table is exposed on, and the projects that survive the attempt usually take a year and leave the schema worse during the transition.

The workable fix is to stop the drift from continuing and make the existing drift visible. Stopping new drift means picking a convention for new tables and columns and writing it down where CI can enforce it (Schema Conventions and Why They Matter covers the mechanics). Making existing drift visible means column and table comments on everything whose name doesn’t match its meaning, plus a per-era mapping somewhere in the repo that says “this database has four naming conventions, used in these periods, applied to these tables.” Legacy schemas are the case where COMMENT ON pays off highest — the names are already wrong, the cost of fixing them is prohibitive, and the comment is the one affordable signal the next reader gets.

COMMENT ON TABLE tmp_orders IS
  'Main orders table. The tmp_ prefix is historical — a 2017 migration was planned to rename this and was never completed. Do not drop.';

COMMENT ON COLUMN customers.flag1 IS
  'VIP customer flag. Legacy name from the 2014 schema — never renamed because of external reporting dependencies.';

One-line migrations, zero risk, and every reader — human and LLM — now has a chance of reading the schema correctly. This isn’t a fix in the sense of “problem solved.” It’s a fix in the sense of “the next reader has a chance.” The drift is structural; the documentation is how you navigate it without making it worse.

When a clean rewrite is actually worth it

Renames and migrations aren’t always wrong. Three cases where the rewrite earns its cost:

A misleading name is actively causing incidents. If tmp_orders is regularly truncated or dropped by someone who reads the name literally and acts on it, the rename cost is less than the recovery cost from the next incident. Usually the practical fix here isn’t a rename — it’s a view, synonym, or ALTER-TABLE-RENAME that exposes orders as the canonical name and leaves tmp_orders as a compatibility alias for legacy callers.

A schema migration is happening anyway. If the team is replatforming the OLTP database or splitting it across services, the rewrite opens a window where renames are cheap because callers are being updated either way. Take the opportunity; don’t schedule a separate naming cleanup six months later when the window has closed.

A database small enough that it fits one person’s head. Early-stage startups, internal tools, bounded-scope services. At twenty tables and three developers, a Saturday afternoon of renames is cheaper than a decade of comments.

In every other case, the schema is load-bearing history, and you renovate it the way you renovate a building with people still living in it: patch, document, and schedule the demolition for a window when it’s genuinely cheap.

The bigger picture

Every production schema is a compressed record of the decisions the team made under pressure. Some of those decisions were good and still fit; some were good at the time and don’t fit now; some were expedient and nobody noticed. The schema can’t tell you which is which, and it was never going to. The fix isn’t to aspire to a clean schema that doesn’t accumulate history — no such schema exists past a three-year horizon — but to leave the next reader enough signal to decompress the sediment without guessing.

Comment the columns that lie. Document the conventions per era. Treat LLMs generating SQL against the catalog as the same kind of reader a new engineer is, and give them the same written context. The goal isn’t a schema without legacy drift; it’s a schema whose drift is legible to the people and tools that will inherit it.

The Bare `id` Primary Key: When Every Table Joins to Every Other Table

Wed, 22 Apr 2026 00:00:00 +0000

TL;DR

A bare id primary key on every table makes a.id = b.id valid SQL between any two tables, which means neither a human reviewing the query nor an LLM generating one can tell which of those equalities are meaningful. The fix isn’t picking the “right” PK type — it’s naming primary keys after the table they identify, so the schema describes its own relationships.

Here’s a query an AI assistant generated against a real production schema:

SELECT u.email, a.payload
FROM users u
JOIN actions a ON u.id = a.id
WHERE u.email = 'alice@example.com';

Syntactically clean. Ran without error. Returned zero rows — which the assistant reported back as “this user has no actions.” The real answer was that users.id is a BIGINT and actions.id is a CHAR(36) UUID. MySQL coerced the integer to a string, compared it to a UUID, and found no match. The join wasn’t wrong, exactly — it was meaningless, and the database had no way to say so.

The experienced reader’s first fix is “just use UUIDs everywhere” or “enforce the type at join time.” Neither works. The footgun isn’t the type mismatch; it’s the column name. When every table’s primary key is named id, a.id = b.id is a valid expression between any two tables in the schema, and nothing in the column names tells you whether that expression means anything. Fix the types and you close one failure mode; the identically-typed, semantically-unrelated users.id = 42 = orders.id case still ships.

What nobody can see

The <table>_id convention is older than most of us, and the case for it is usually framed as clarity or style. The sharper framing is that bare id hides the information that matters most at the point of the join — which table’s identity is being compared, and whether comparing them makes sense — from every reader of the query.

The query’s reviewer. ON u.id = a.id gives no hint of what’s being matched. A human reviewer has to carry the table-to-alias mapping (u is users, a is actions) and the table-to-type mapping (users.id is BIGINT, actions.id is UUID) in working memory, then cross-check them against the join condition. None of those steps are hard, but reviewers skip them because the column names look symmetric. Two .id references read as “joining on primary keys,” which is the kind of join nobody flags.

The LLM reading the schema. An assistant generating SQL from the catalog sees users(id BIGINT, ...) and actions(id CHAR(36), ...) as two tables with primary keys named id. Absent a full column-type check on every candidate join (and most schema-reading prompts don’t do this), the natural-looking join between “a user and their actions” is u.id = a.id — which is exactly wrong. The schema presented the column as joinable; the LLM took it at face value. The same mistake a tired human makes, but at scale and without fatigue to blame.

The static analyzer. Linters and schema-aware query builders operate on names first and types second. A rule that warns on suspicious cross-table joins has no signal to fire on when both sides are .id — the column names match, so the join is “legitimate” by shape. The same rule on users.user_id = actions.action_id would flag it immediately, because the names would be obviously non-corresponding.

None of these readers are missing a step they should have taken. They’re all doing the reasonable thing, and the reasonable thing produces wrong queries because the schema is telling them id is id in both tables.

Three failure modes, ranked by how loudly they fail

Three distinct outcomes hide behind a.id = b.id, and they don’t fail equally:

PostgreSQL, mixed types. The comparison errors out with operator does not exist: bigint = uuid. Loud, caught in development, fixed before merge. The best failure mode.
MySQL, mixed types. Silent coercion to string, zero rows returned. The opening example. Bad, because “no results” looks like valid data to every downstream consumer.
Either engine, same type but semantically unrelated. BIGINT users.id = 42 matched against BIGINT orders.id = 42 returns the rows where the integers happen to collide. The query runs, the result set isn’t empty, and the rows look plausible because they’re real rows from real tables. The worst failure mode, because nothing about the output signals that the join was nonsense.

The first two are loud enough to catch in review. The third is the one that ships. And the third is the default once more than one table in the schema uses a plain BIGINT id — which is almost every relational schema in existence.

Zero rows looks like no data

A join that silently returns zero rows because of a type coercion is indistinguishable from a join that legitimately has no matches. Code generators, dashboards, and AI assistants all interpret empty results as “the relationship exists but has no rows,” not “the query is nonsense.” The failure hides inside success.

Mixed PK types make the naming problem sharper

Production schemas rarely stay on one PK strategy for long. The original tables are usually BIGINT AUTO_INCREMENT because the framework defaulted to it; a newer service switches to UUIDs to let clients generate IDs offline or to distribute across shards; join tables pick up composite keys because (user_id, role_id) is the natural identity. Nothing in the schema announces which tables fall into which bucket — SHOW CREATE TABLE or \d is the only source of truth, and even that requires reading every table to know what joins are legal.

Mixed types are where the naming footgun turns from theoretical to frequent. When every PK was a BIGINT, the “same type but semantically unrelated” case was the main risk and reviewers caught most of it. Once the schema has BIGINT and UUID sitting next to each other — all named id — the mismatched-type cases pile on top, and “no data found” becomes a regular report from any tool generating queries from the schema.

The sizing question — when to pick BIGINT versus UUID versus UUIDv7 versus composite, and what each costs at the index level — is covered separately in Random UUIDs as Primary Keys. The two problems interact but have independent fixes: pick your PK types deliberately, and name them so the schema describes its own relationships. Neither fix substitutes for the other.

Naming is the lever that actually helps

Naming is what makes a schema describe its own relationships without requiring the reader — human or otherwise — to open every CREATE TABLE. Two conventions, consistently applied, close most of the gap:

Name the primary key after the table. users.user_id, orders.order_id, actions.action_id. The equality users.user_id = orders.order_id reads as obvious nonsense, because the column names are no longer identical. Reviewers see it, LLMs don’t produce it, linters can flag it. The cost is a small amount of redundancy in queries (users.user_id instead of users.id), which is almost always a fair trade. This lines up with the broader guidance in Schema Conventions and Why They Matter.

Foreign keys mirror the target PK. orders.user_id clearly references users.user_id. actions.user_id clearly references users.user_id. This is already common practice; the only change is that the target’s PK name matches, closing the loop. Foreign Keys Are Not Optional covers why the FK itself matters; naming is what makes the FK legible without the REFERENCES clause in hand.

The bare id convention is defensible when the PK column only ever shows up in queries alongside its table name (users.id) and never as a bare id in a SELECT list or join condition. That discipline is hard to enforce across a team over years, and every framework’s default query builder produces SELECT id FROM users without thinking about it. The naming fix makes the discipline unnecessary.

When bare `id` is actually fine

Not every schema needs to bend. A small application, a service with a handful of tables, or a database where every query is reviewed by one team has plenty of context to keep the a.id = b.id landmine out of reach. The cost of the convention scales with the number of tables, the number of engineers, and the number of non-human query generators; in the small case it rarely shows up.

What changes once any of those numbers grow: nobody remembers which tables are BIGINT versus UUID, the assistant pattern of generating queries from schema is routine, and the review process that caught a.id = b.id in a 20-table schema can’t read every join in a 400-table one. At that size the convention pays rent, and renaming PKs is a migration that gets slower every quarter.

The bigger picture

A schema’s job isn’t just to hold data correctly; it’s to describe its own shape well enough that the tools reading it can reason about relationships without reading every line. The bare id PK is a small departure from that — one column name shared across tables — but it’s the departure that most consistently produces silent-wrong-answer queries, because SQL has no way to distinguish “same name, same meaning” from “same name, different meaning.”

Name the primary key after the table it identifies, so the schema tells its own story when someone — human or otherwise — joins two of them together. It costs almost nothing on day one and leaves the schema legible at 400 tables, which is where most of us end up.

Polymorphic References Are Not Foreign Keys

Tue, 21 Apr 2026 00:00:00 +0000

TL;DR

A polymorphic reference is a pair of columns — resource_id plus resource_type — where the type string chooses which table the ID points to. One column of data, many possible targets. ORMs make it a one-liner (polymorphic: true in Rails, GenericForeignKey in Django, morphTo in Laravel) and the database can’t help with any of it: no foreign key, no cascade, no planner metadata, no schema-level description of what the column actually references. Reads need conditional joins or unions. Orphans accumulate silently. For the cases people usually reach for it — comments, notifications, attachments — the alternatives (per-target tables, mutually-exclusive nullable FKs with a CHECK) restore schema integrity at modest cost. The pattern earns its keep only where the relationship is genuinely best-effort, like audit or activity logs.

What the pattern looks like

CREATE TABLE notifications (
    id BIGINT PRIMARY KEY,
    user_id BIGINT NOT NULL REFERENCES users(id),
    resource_id BIGINT NOT NULL,
    resource_type VARCHAR(50) NOT NULL,
    message TEXT NOT NULL,
    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

-- resource_type = 'order'   → resource_id references orders.id
-- resource_type = 'invoice' → resource_id references invoices.id
-- resource_type = 'ticket'  → resource_id references support_tickets.id

The tell is resource_id BIGINT NOT NULL with no REFERENCES clause — it can’t have one, because there are multiple targets. What the application treats as a foreign key is, at the database level, a plain integer with a sibling tag string.

What the database can’t do

The cost shows up as absence — every mechanism the database offers for reasoning about relationships is disabled, because the column’s meaning depends on data in another column.

No foreign key. A REFERENCES clause names exactly one target. Orphaned resource_id values are a write-time non-event and a read-time mystery. (Foreign Keys Are Not Optional covers the general cost; polymorphic is the case where skipping isn’t a choice.)
No cascade. Delete an order and nothing cleans up the notifications pointing at it. The application has to know every table that might hold a polymorphic reference to orders and clean each one — new tables added later don’t get noticed.
No planner metadata. Foreign keys feed join ordering and row estimates, especially in PostgreSQL. The planner sees resource_id as a BIGINT with a histogram and no known target.
No schema-level description. Anything that reads the catalog — ERD tools, query generators, AI assistants, typed-client generators — sees no link between notifications.resource_id and the tables it points at. The mapping lives in model files and string literals. (Comment Your Schema helps here but can’t fully restore the information.)

Orphans accumulate silently

A polymorphic column with no FK and no cascade develops orphans over time. Reads paper over them with LEFT JOIN ... WHERE target.id IS NOT NULL, so the broken rows disappear from the UI but stay in the table. In schemas a few years old, the orphan rate is rarely zero — and nobody designed for it.

Reads pay for the write-side convenience

The absent FK is the schema problem. The read-path shape is where the cost becomes daily. A query that needs any column from the referenced row can’t write a single join — the target depends on a per-row value, and SQL’s join syntax takes a static target.

-- Conditional LEFT JOIN per target
SELECT n.id, n.message,
       COALESCE(o.order_number, i.invoice_number, t.ticket_code) AS ref
FROM notifications n
LEFT JOIN orders   o ON n.resource_type = 'order'   AND n.resource_id = o.id
LEFT JOIN invoices i ON n.resource_type = 'invoice' AND n.resource_id = i.id
LEFT JOIN support_tickets t
                     ON n.resource_type = 'ticket'  AND n.resource_id = t.id
WHERE n.user_id = 42;

Every new target type adds a join clause here and in every other read-path query that displays a related field. The alternative — a UNION ALL per target — is narrower per branch but scales linearly with target count and pushes pagination up to the union level. And most ORMs’ default resolution is one query per (resource_type, resource_id) group, which is the N+1 pattern that makes polymorphic feeds slow once the target set widens.

“One column can point at many tables” on the write side turns into “every read query enumerates every possible table” on the read side. The symmetry people expect isn’t there.

Why the pattern spreads

It’s not a designed-in choice; it’s the path of least resistance that framework ergonomics encourage. Rails’ polymorphic: true, Django’s GenericForeignKey, and Laravel’s morphTo make one-liner what would otherwise be multiple belongs_to associations and a migration. “Comments on orders” and “comments on invoices” look like duplication, so a single comments table with commentable_id / commentable_type feels cleaner. An open-ended “add comments to anything” product ask reads as an argument against committing to a target list.

Each of those framings overweights the write-side cost (another table or another FK column) and underweights the integrity loss (no enforcement, no cascades, schema no longer describes itself). ORMs Are a Coupling covers the broader trade — polymorphic is the canonical case where the ORM’s preferred shape is actively incompatible with what the database wants to enforce.

Alternatives

Each alternative gives back some of the database’s relational machinery at different levels of verbosity.

Per-target tables. Split along the target dimension: order_notifications, invoice_notifications, ticket_notifications, each with a real FK. Real cascades, real planner metadata, self-describing schema. Cost: duplicated column sets and an explicit UNION ALL for cross-target reads — but that union already exists implicitly in the polymorphic shape, just moved from the read query into typed branches.

Mutually-exclusive nullable FKs with CHECK. One table, one FK column per target, a constraint enforcing exactly one is non-null:

CREATE TABLE notifications (
    id BIGINT PRIMARY KEY,
    user_id BIGINT NOT NULL REFERENCES users(id),
    order_id BIGINT REFERENCES orders(id),
    invoice_id BIGINT REFERENCES invoices(id),
    ticket_id BIGINT REFERENCES support_tickets(id),
    message TEXT NOT NULL,
    CONSTRAINT exactly_one_target CHECK (
        (order_id IS NOT NULL)::int +
        (invoice_id IS NOT NULL)::int +
        (ticket_id IS NOT NULL)::int = 1
    )
);

Real FKs per target, real cascades, row’s meaning unambiguous. Scales reasonably up to a handful of targets and stops scaling cleanly somewhere around ten.

Supertype table. A shared parent table carries a common ID; each target type’s table references the parent. The polymorphic column then points at the parent, which is a single real FK. Cleanest structural answer and the one with the highest adoption cost — retrofitting this onto an existing schema is substantial migration work.

When polymorphic is actually the right call

The trade-offs stack up unfavorably for most common uses, but not all. The pattern earns its keep when the relationship is genuinely best-effort — audit events, activity logs, “recently viewed” lists, undo history — where a lost reference is a recoverable annoyance rather than a correctness incident. The FK was never going to be load-bearing, and the polymorphic shape matches the actual semantics: “reference anything, and if it’s gone, show a tombstone.”

Outside that zone the default bias should run the other way. A comment system with three possible parents is not a case for polymorphism; it’s a case for three comment tables or mutually-exclusive FK columns, with the ORM abstracting the read-side stitching.

The bigger picture

Polymorphic references are a specific case of a broader pattern: designs that move information out of the schema and into the application, in exchange for ergonomics in the model layer. The schema drifts from “self-describing relational structure” toward “indexed key-value store the application interprets.” That’s a legitimate position — DynamoDB and friends live there on purpose — but a relational database running on polymorphic associations is paying for a relational engine and choosing not to use most of what it offers.

The pattern isn’t wrong. It’s an aggressive trade, priced on day one by the convenience of polymorphic: true and on day three hundred by the silent orphan count, the conditional joins, and resource_id BIGINT telling no one what the table is related to. Reach for it on purpose, not by default — and keep the option of pulling it back onto typed FK columns open, because the migrations away are slower the longer the schema has been pretending the reference isn’t there.

Ai on EXPLAIN ANALYZE

Reading the Schema Is Not Reading the Data

Four ways the data disagrees with the schema

Why the catalog can’t tell you this

The fix is a habit, not a migration

When schema-only reading is fine

The bigger picture

TEXT and JSON Columns: Where the Schema Goes to Hide

What leaves the catalog when the column becomes a blob

Plausible paths, empty results

The fix, and where it stops being free

When JSON is actually the right answer

The bigger picture

God Tables: 150 Columns and the Quiet Cost of 'Just Add a Column'

How a row-store actually reads a row

What 150 columns actually costs

Why LLMs make this worse

Split by access pattern, not by concept

When a wide table is actually fine

The bigger picture

Legacy Schemas Are Sediment, Not Design

What’s drifted

Why this is worse for LLMs than for humans

The fix is documentation, not renaming

When a clean rewrite is actually worth it

The bigger picture

The Bare `id` Primary Key: When Every Table Joins to Every Other Table

What nobody can see

Three failure modes, ranked by how loudly they fail

Mixed PK types make the naming problem sharper

Naming is the lever that actually helps

When bare id is actually fine

The bigger picture

Polymorphic References Are Not Foreign Keys

What the pattern looks like

What the database can’t do

Reads pay for the write-side convenience

Why the pattern spreads

Alternatives

When polymorphic is actually the right call

The bigger picture

When bare `id` is actually fine