Schema-Design on EXPLAIN ANALYZE

TEXT and JSON Columns: Where the Schema Goes to Hide

Thu, 25 Sep 2025 00:00:00 +0000

TL;DR

A TEXT or JSON column moves the schema out of the database catalog and into application code; the data inside has a shape, but the DDL won’t tell you what it is. Promote the fields that actually get queried into real columns, and treat the rest as genuinely opaque.

An AI assistant is asked to “find customers who upgraded to enterprise in the last quarter.” It reads the catalog, finds api_logs(id, endpoint VARCHAR, payload LONGTEXT, created_at DATETIME), and generates the reasonable query:

1
2
3
4
5


SELECT JSON_EXTRACT(payload, '$.action') AS action, created_at
FROM api_logs
WHERE JSON_EXTRACT(payload, '$.action') = 'upgrade'
 AND JSON_EXTRACT(payload, '$.plan') = 'enterprise'
 AND created_at >= NOW() - INTERVAL 90 DAY;

Runs clean. Returns zero rows. The actual key was renamed from action to event.type two years ago when the team adopted a shared event schema; new rows match $.event.type, old rows still match $.action, and no one migrated the historical data because it wasn’t queryable anyway. Neither column nor catalog said any of this. The query is syntactically perfect, semantically correct for the key it guessed, and wrong because the key doesn’t exist in most of the rows.

The obvious fix is “switch to JSONB, validate with a JSON schema, add a GIN index.” Each one helps at the margin and none of them close the gap. JSONB tells you the blob is valid JSON, not what keys are in it. CHECK constraints with JSON_SCHEMA_VALID or jsonb_matches_schema work prospectively, but the six years of rows already in the table were written against five format generations and no validator reaches back in time. A GIN index accelerates key lookups but only if you know which keys to look up. The problem isn’t the storage format. The schema emigrated to application code, and changing the column type doesn’t bring it back.

What leaves the catalog when the column becomes a blob

DDL is the contract between the database and everything that reads it. A typed column says “this value is an integer between 0 and 2³¹−1, and here’s the index I’ve built over it.” A TEXT or JSON column says “this value is a string the application decided on, and the application can tell you what that means.” The second contract is thinner in ways that compound.

Readers can’t discover the shape from the schema. information_schema.COLUMNS for a JSON column returns COLUMN_TYPE = 'json' and nothing else. Every tool that reads catalog metadata (MCP servers, ERD generators, typed-client code generators, AI assistants, new engineers running \d+) sees a blob. The shape lives in the serializer class, the protobuf definition, the TypeScript interface, or nowhere. Whichever of those the reader happens to find is the shape they’ll assume. See Comment Your Schema for the lowest-effort way to leave a trail, but comments can describe the shape; they can’t make the catalog enforce it.

Generational drift is silent. Year one the payload is {action, user}. A migration adds nested metadata: {action, user, metadata: {source}}. A rewrite flattens and renames: {event: {type, user_id}, source}. A new service standardizes with a version field: {version: 3, event: {...}}. All four versions are sitting in the same column with nothing to distinguish them at read time except the keys they happen to have. A JSON_EXTRACT path written against today’s producer hits the newest generation and silently misses the older ones. The failure mode is exactly the one described in Legacy Schemas Are Sediment: the schema’s history is compressed into the data, and the data can’t decompress itself.

Writes are untyped. Without CHECK constraints or a JSON-schema validator, the writer is the only guardrail. A service deployed last Tuesday that emits amount as the string "9900" instead of the integer 9900 silently poisons the column. Downstream queries comparing amount > 1000 work on new rows and misbehave on the poisoned batch, because JSON-extract returns a string and the comparison is lexicographic. The same class of mismatch a typed column would reject on INSERT.

The planner is working blind. Row-count estimates on JSON_EXTRACT(payload, '$.event.type') = 'upgrade' have no histogram to consult; the planner falls back to a default selectivity estimate that’s usually wrong. Plans for queries filtered on JSON fields are routinely pessimistic or optimistic by an order of magnitude, and there’s no ANALYZE to fix that because the statistics don’t exist for the interior of the blob.

Indexes are per-key, not per-column. A functional index on JSON_EXTRACT(payload, '$.event.type') accelerates one path. The next query filters on $.source and scans the table. Generated columns are the cleaner version of this (payload_event_type VARCHAR(50) GENERATED ALWAYS AS (JSON_EXTRACT(payload, '$.event.type')) STORED) but each one is a schema change with a backfill, and you have to know in advance which keys matter. GIN indexes on JSONB cover arbitrary keys but are large, slow to update, and still don’t tell the reader what keys exist.

Untyped writes + untyped reads = silent schema drift

A TEXT or JSON column accepts anything the writer emits and returns exactly that on read. Two services writing to the same column with slightly different shapes don’t conflict at the database level; they produce a column whose contents depend on which service wrote the row. The divergence is invisible until a query tries to read uniformly across both.

Plausible paths, empty results

Schema-reading LLMs generate JSON_EXTRACT paths the same way they generate column names in a typed schema, by pattern-matching the column name and the question. Asked about “upgrade actions,” the model guesses $.action = 'upgrade' because the English-to-JSON-path mapping is obvious. It has no way to know that the key was renamed, that three generations coexist, or that the canonical name is now buried under two layers of nesting. The catalog gives it a column type of json and nothing else, and the model’s best guess is reasonable and wrong.

The failure pattern is familiar from other schema-hiding designs. Polymorphic references hide which table a foreign-key-shaped column points at; bare id primary keys hide which identifier is being compared; TEXT/JSON columns hide what’s in the column at all. All three are cases where the LLM generates a plausible query against a schema that isn’t telling it enough, and the query returns plausibly-shaped but semantically empty results.

The fix, and where it stops being free

The lever is being honest about what’s inside and picking the right storage per field.

Promote fields that get queried. If the application filters on event.type more than occasionally, that’s a real column. Generated columns are the low-friction middle path: derive a typed, indexable column from the JSON, keep the raw payload as the audit trail.

1
2
3
4


ALTER TABLE api_logs
 ADD COLUMN event_type VARCHAR(50) GENERATED ALWAYS AS
 (JSON_UNQUOTE(JSON_EXTRACT(payload, '$.event.type'))) STORED,
 ADD INDEX idx_event_type (event_type);

The trade-off: every promoted field is a migration, and generated columns don’t retroactively rewrite rows written with a different shape; you still need the COALESCE(JSON_EXTRACT(payload, '$.event.type'), JSON_EXTRACT(payload, '$.action')) cleanup for the old generations, and you’re doing that exactly once as part of the promotion rather than in every query.

Enforce new writes with a JSON schema. PostgreSQL’s pg_jsonschema and MySQL 8.0’s JSON_SCHEMA_VALID let a CHECK constraint reject writes that don’t match a named schema. Doesn’t fix existing rows; does stop the next silent format change from landing. If the team doesn’t already have a shared event schema, a CHECK constraint is the forcing function that produces one.

Version the payload explicitly. {"version": 3, "payload": {...}} at the top lets every reader dispatch on version instead of inferring it from which keys happen to be present. Doesn’t help rows written before versioning started, but bounds the drift going forward and turns “which generation is this row?” from archaeology into a lookup.

Document what stays inside. Comments on the column (“see github.com/org/events for the schema; versions 1–3 coexist in rows older than 2024-Q2”) won’t replace types, but they give the reader a place to look. Comments on the schema are cheap, in-place, and propagate through every tool that reads the catalog; for genuinely-opaque columns this is the best available signal.

When JSON is actually the right answer

The pattern earns its keep in specific shapes where the alternative (typed columns) is worse.

Truly variable shape per row. User-supplied settings blobs, custom-field configurations, extension points where the keys are genuinely per-tenant or per-user. Modeling each variant as a column produces a wide table full of NULLs; see God Tables for the cost of that direction. The column is honest about being schemaless because the data is schemaless.

Audit payloads nobody queries. Raw API request/response bodies retained for compliance, debug traces, incident forensics. Written once, read by humans one row at a time, never aggregated. The lack of a queryable schema is fine because no query needs one. A sensible default here is to keep the payload compressed and add a small set of typed columns (endpoint, status_code, user_id, created_at) for the predicates the operational queries actually use.

Short-lived staging. Job queues, idempotency cache payloads, outbox entries, where the producer and consumer are deployed together, the payload is read once, and the row is deleted on completion. Drift can’t accumulate in rows that don’t stay around.

Document stores on purpose. PostgreSQL JSONB with a stable schema, validated on write, with functional indexes on the paths that matter. This is a real design; it’s not the unspoken default that most TEXT columns represent. If the team is reaching for JSONB and treating it as a document store, it should look like one (with validation, indexes, and documentation) not like a TEXT column that happens to parse.

The bigger picture

A TEXT or JSON column is a specific architectural choice: move part of the schema out of the catalog, in exchange for cheaper writes and looser contracts between producer and consumer. When the trade is deliberate (genuinely variable data, write-once audit, short-lived buffer) it’s the correct shape. When it’s the path of least resistance because typed columns would require a migration, the cost is deferred to every future reader who has to reconstruct the format from commit history.

Databases are good at enforcing the contracts they know about. The column types are how they know. Every field that matters to a query deserves to be in the part of the schema the database can see; everything else is honestly opaque and should look it. The default drift (“stick it in the payload, we’ll parse it later”) produces columns whose contents nobody fully knows, including the team that wrote them.

Reading the Schema Is Not Reading the Data

Mon, 08 Sep 2025 00:00:00 +0000

TL;DR

A schema describes the shape the database enforces; the data inside follows a second set of conventions (soft-delete coverage, sentinel values, encoding quirks, format drift) that live nowhere the catalog can show. Queries written from the DDL alone run clean and return results that look right and mean something different. Treat the data as a second source that has to be read, sampled, and documented alongside the types.

An engineer (or an AI) writes a query to find pending orders:

1
2
3
4


SELECT id, total_cents, created_at
FROM orders
WHERE status = 1
 AND created_at > NOW() - INTERVAL 7 DAY;

orders.status is TINYINT NOT NULL. The query runs. Forty thousand rows come back. Most of them shipped days ago. The mistake lives in the column’s other life: status on this table is a boolean is_processed flag where 1 means “has been through the fulfillment pipeline.” The order lifecycle state (pending, processing, shipped, delivered, cancelled) is in orders.state, also TINYINT NOT NULL, also no comments, and whoever read the schema first picked the column whose name they recognized. The DDL was no help; both columns have the same type, the same nullability, and the same look in information_schema. The data was telling the real story, and the data wasn’t read.

The obvious fix is “add comments, use ENUM, lint for ambiguous names.” Each of those helps on new columns and the next migration. None of them touch the existing data, which is where the ambiguity actually lives: forty thousand rows of status = 1 that mean one thing on this table and a different thing on its sibling, ten million VARCHAR dates written by five generations of code in three formats, and a users table where rows with email = 'DO_NOT_USE@test.com' have been on the leaderboard for two years. Fixing forward keeps the problem from growing. Reading the data is how you find out what’s already there.

Four ways the data disagrees with the schema

These are not the exotic cases. They show up in nearly every mature production database, and each one is a place where a schema-only read produces a plausible, wrong query.

TINYINT(1) is polysemic. It stores a boolean flag (is_active, has_seen_onboarding, email_verified), a small enum (lifecycle states, tier levels, priority), a bit-packed byte (eight flags in a single column), or a count that never exceeds 127. All four uses produce identical entries in information_schema. Naming conventions (is_*, has_*, can_* for booleans; _type, _status, _level for enums) are the informal signal, and like every informal signal, they’re applied inconsistently and broken in legacy tables. See Schema Conventions and Why They Matter for the prescriptive side; this is the descriptive reality.

Soft-delete coverage is partial. Some tables have deleted_at TIMESTAMP NULL. Some have is_deleted TINYINT(1) DEFAULT 0. Most have neither, because the original author decided the table didn’t need soft deletes and nobody revisited. A query that correctly filters WHERE deleted_at IS NULL on customers returns the right answer; the same pattern applied to addresses either errors out (column doesn’t exist) or silently matches everything (column exists but is always NULL because the application never writes to it). There’s no global rule to encode and no way to know from the catalog which tables fall in which bucket. You have to read the data, or read the application code that writes to it (which is usually worse).

VARCHAR dates in multiple formats. A column called signup_date VARCHAR(10) is a tell. The first generation of rows has YYYY-MM-DD. A rewrite that switched import vendors introduced MM/DD/YYYY. An international expansion produced DD/MM/YYYY for rows that came in through a specific endpoint and DD-Mon-YYYY for one partner’s CSV imports. All four formats live in the same column. WHERE signup_date >= '2025-01-01' matches the first generation correctly, matches the third generation backwards (“2025-01-01” sorts before “15/03/2024”), and misses the fourth entirely because the sort order doesn’t touch Mon strings. The query returned rows, so the reviewer moved on.

Sentinel values and test data. Row with user_id = 0 means “anonymous.” Row with email = 'DO_NOT_USE@test.com' is a test account that’s been in production for three years because nobody wanted to take responsibility for deleting it. Row with created_at = '1970-01-01 00:00:00' is a backfill where the original timestamp was unknown and epoch zero got written as a placeholder. Every one of these is an intentional violation of the apparent meaning of the column, and every schema-level read treats them as ordinary data. Copilot ranked DO_NOT_USE as the top customer with $99,999 in revenue because the row had the highest total; the test record had been sitting there for years, visible to anyone who queried the table but invisible to anyone who only read the DDL.

Input-convention drift. VARCHAR(255) accepts “Acme Corp,” “ACME CORPORATION,” “Acme Corp.,” “acme corp,” and “ACME CORP” (two spaces, somebody’s trailing whitespace bug). All five are the same company in different rows. The unique constraint, if it exists, didn’t catch any of them because they’re not byte-identical. Any query that groups or joins on the text field silently double-counts - not by a small amount, by however much the convention drift is worth. Encoding quirks compound: café in NFC and NFD look identical in the terminal and hash differently; case-folding depends on collation; trailing whitespace varies by source system.

Why the catalog can’t tell you this

information_schema describes the contract the database enforces on writes. That contract is narrow: types, nullability, defaults, constraints, foreign keys. It doesn’t describe what got written before the constraint was added (almost all of it), what gets written by code paths that bypass the ORM (a surprising fraction of it), or what the application decided to write into a column that the database happily accepts because the type matches.

Type compatibility is a floor, not a ceiling. TINYINT NOT NULL excludes strings, NULLs, and integers outside [-128, 127]. It doesn’t exclude 1 meaning five different things in five different tables, because that’s not a type constraint - it’s a semantic one, and the database has no vocabulary for semantics. The same logic applies to NULL handling: the catalog tells you a column is nullable; it doesn’t tell you whether NULL means “unset,” “not applicable,” “still in progress,” or “data lost during the 2019 migration.”

LLMs inherit this limitation directly. A model generating SQL from the catalog sees column names and types, not data distributions. It has no way to tell that status is polysemic across tables, that deleted_at exists on four of the six relevant tables, or that signup_date has three format generations. The LLM’s best guess is the one a new engineer would make: the schema looks uniform, so the data probably is. Neither is wrong in general; both are wrong often enough in mature databases to produce plausibly-shaped and semantically-hollow query results. This is the generalization of the specific patterns covered in Legacy Schemas Are Sediment; legacy schemas are one source of data drift, and there are others.

Runs clean, returns plausible, means something else

Schema-only queries fail in the quietest way a query can fail. The SQL is syntactically correct. The types match. Rows come back. Some fraction of those rows mean what the author intended, and some fraction mean something else, and there’s no signal at the database level telling you which is which. Reviewers who only look at the query text can’t catch it. The data is where the check has to happen.

The fix is a habit, not a migration

You can’t retroactively enforce a schema on ten years of writes. You can change what the next reader (human or model) has available before they generate the next query.

Profile before you query. Before writing a predicate against an unfamiliar column, run a one-liner: SELECT col, COUNT(*) FROM t GROUP BY col ORDER BY COUNT(*) DESC LIMIT 20. For low-cardinality columns (status, type, flags) this reveals the actual value distribution in thirty seconds and catches the flag-versus-enum mistake before the query ships. For higher-cardinality columns, sample: SELECT col FROM t ORDER BY RAND() LIMIT 50. The time cost is minutes; the catch rate is substantial.

Comment the columns the DDL can’t describe. A one-line comment on orders.status ('Pending=1, Processing=2, Shipped=3, Delivered=4, Cancelled=5') and on orders.state ('Boolean: 1 if order has been through fulfillment') is the difference between a reader who gets it right and one who guesses. Comment Your Schema covers the mechanics in full; for the flag/enum disambiguation specifically, this is the highest-leverage fix per character of effort anywhere in schema maintenance.

CHECK constraints for new values. CHECK (status IN (1,2,3,4,5)) is the forcing function for the next writer. It won’t clean up existing rows, and it won’t stop a future engineer from reaching for 6, but it will fail loudly when they try, instead of silently accepting a value the readers of the table don’t know about. On nullable columns, CHECK (deleted_at IS NULL OR deleted_at > created_at) catches the backfill-sentinel case.

Migrate VARCHAR dates when you can afford it. The migration is real work: parse each row, fail loudly on unparseable formats, pick a canonical representation, backfill. Leaving VARCHAR in place guarantees the next query is written against whichever format the author happened to sample. The right-sized fix in the meantime: a comment on the column listing the known formats, and a view that exposes a parsed DATE for the queries that can tolerate loss on the unparseable rows.

Treat data profiling as part of review. When a PR adds a new query, the reviewer’s first question is “does this predicate match the data?”, which requires actually looking at the data, not just the query. For AI-assisted development this is even more load-bearing: the model generated the query from the catalog, so the human review is the only layer that can compare the query’s predicates to the column’s actual contents.

When schema-only reading is fine

Not every database carries this baggage. Three cases where the schema really is the data’s description:

Schemas designed from scratch with strict conventions. New services, greenfield tables, codebases where every column has a comment, every enum is an ENUM type, and every date column is DATE or TIMESTAMPTZ. The drift hasn’t had time to accumulate, and the conventions are enforced by linters on migrations. The failure modes described above can still show up; they show up as bugs that get caught, not as the steady-state of the table.

Small, single-team databases. Twenty tables, three engineers, all the data flowing through one service. Everyone who writes to the table knows what the conventions are; the data drift is small because there are only three writers. The cost of the habit described above exceeds the cost of the drift it catches. Grow the team or the table count by a factor of ten and the math flips.

Analytical warehouses that expect exploration. In a BigQuery, Snowflake, or ClickHouse dataset built for analytics, everyone who queries the data profiles it as a matter of course: sample the column, check the distribution, look for nulls. The profiling habit is already the workflow; the schema is treated as a hint rather than a contract. This is the part of the data stack where reading the data is assumed, and the failure mode is correspondingly rare.

The bigger picture

A production database has two artifacts worth reading: the DDL the engine enforces, and the data the engine happens to hold. The first is legible, indexed, and comes with tooling; the second is tribal knowledge, distributed across rows written by years of code, and invisible to every tool that stops at the catalog. Everyone from new engineers to LLMs reads the first artifact and assumes it describes the second, which is true in schemas fresh enough to have no drift and false in every schema old enough to have generated any.

Rigor on new tables pays off, but the larger lever is routine comparison between what the schema says and what the data does: sampling before querying, commenting columns whose meaning isn’t self-evident, treating data profiling as part of review rather than a debugging step. None of it scales to “we documented the whole schema in one sprint.” It scales one column at a time, on the columns that are about to be queried, until the fraction of the schema that lies to its readers is small enough to stop costing incidents.

God Tables: 150 Columns and the Quiet Cost of 'Just Add a Column'

Tue, 05 Aug 2025 00:00:00 +0000

TL;DR

A wide table looks cheap because every column was added for a real reason; the expensive part is that rows grow, every write amplifies, and every secondary index inherits the bloat. The fix is splitting by access pattern (columns read together stay together, rarely-touched columns move out), not aggressive normalization that trades one wide table for six-way joins on every read.

The schema started clean four years ago: users(id, email, password_hash, created_at), four columns. Today the table is renamed customers and has 184 columns. Billing address. Shipping address. Three additional shipping addresses numbered 2 through 4. preferences_json for user settings. Twelve feature-flag TINYINTs. Three Stripe identifiers from three processor migrations. last_login_at, last_seen_at, last_purchase_at, last_notification_sent_at. Forty more columns whose meaning lives in Confluence, if anywhere. No single ALTER TABLE ADD COLUMN was unreasonable at the time. The accumulated result is an average row size of 6KB, an UPDATE to last_login_at that rewrites every byte of it, and a buffer pool holding four customer rows per page instead of forty.

The obvious fix is to normalize it: split into customer_profile, customer_billing, customer_addresses, customer_preferences, customer_feature_flags, customer_audit. That’s the textbook answer and it’s the one that breaks the moment you look at the dominant read. The list view on the admin page needs name, email, status, last login, Stripe status, and total spent. Now it’s a six-way join on every page load. The fix that looked clean in the migration doc makes the most-frequent query more expensive, not less. The read cost moves to the place it’s paid most often, and somebody (usually a few months later) proposes a materialized view to “just flatten it back out,” which is the god table returning through a different door.

How a row-store actually reads a row

Before the cost math makes sense: OLTP engines like InnoDB and PostgreSQL’s heap store complete rows laid out contiguously on fixed-size pages - typically 16KB in InnoDB, 8KB in PostgreSQL. A page holds as many rows as fit. When a query needs one column of one row, the engine doesn’t read that column alone; it locates the row’s page via an index lookup or scan, loads the whole page into the buffer pool, and reads the requested column out of the in-memory row image.

The one exception is the index-only scan: if every column the query projects and filters on is already present inside an index, the base table doesn’t have to be touched and only the index pages are loaded. See Covering Index Traps for how quickly this optimization disappears, usually the moment a SELECT list grows by one column. Every other read path goes through the row, which means the row’s width sets the floor on how much data the engine moves per lookup. Reading email from a 184-column customer row loads 6KB into memory to return 50 bytes; reading the same column from an 800-byte row loads 800 bytes. The buffer pool is a fixed size and every byte of unused column data in it is displacing something another query needs.

Column stores (ClickHouse, BigQuery, Parquet-backed warehouses) invert this entirely. Data is laid out by column, so reading one column reads only that column’s storage. The wide-table cost math doesn’t apply there, which is why this anti-pattern is specifically a row-store OLTP problem and why denormalized fact tables in analytical warehouses are fine at 300 columns.

What 150 columns actually costs

The individual cost of one column is negligible. The system-level cost shows up in several places at once, and none of them are visible in a diff that adds one more.

Row size and write amplification. InnoDB stores full rows on disk pages, and an UPDATE rewrites the entire row even if only one column changed. On a 184-column table averaging 6KB per row, updating last_login_at on every sign-in rewrites 6KB, not 8 bytes. PostgreSQL doesn’t rewrite in place (MVCC creates a new tuple for every UPDATE and marks the old one dead) but the new tuple is 6KB too, and VACUUM has that much more to reclaim. Either engine, the write cost per logical change scales with row width.

Buffer pool density. The page-per-read mechanism above means buffer-pool efficiency scales inversely with row width. At 6KB per row, an InnoDB 16KB page holds two rows; at 400 bytes per row it holds forty. A database with 10GB of buffer pool has the effective working set of a much smaller instance once rows get wide. Queries that used to run hot start touching disk for no reason other than that the rows they cared about no longer fit in memory alongside the rows other queries cared about.

Secondary indexes inherit the width problem. Every secondary index in InnoDB carries a copy of the primary key at its leaves; every index entry is a key-columns + PK-copy record. A wide table tends to accumulate indexes: you index email, Stripe ID, last-login, phone, region, account-manager-ID, each for a different query path. Six secondary indexes on a 184-column table isn’t unusual, and each of them is physically larger than it would be on a narrow table, because the PK copy and fill-factor choices interact with row density. Covering indexes are also harder to arrange: the list view wants eight columns projected, and indexing eight columns of a 184-column table to cover one query is an expensive trade.

Lock and transaction width. Every UPDATE acquires a row-level lock. Transactions that touch a wide row hold that lock for the duration of the transaction, and because the row spans many concerns (billing, preferences, audit timestamps) transactions from unrelated code paths contend on the same row. A background job updating last_seen_at now serializes against a billing job updating stripe_customer_id on the same customer, because both paths lock the same row. In the split-by-concern shape, they’d contend on different rows of different tables.

Schema migrations get more expensive. ALTER TABLE ADD COLUMN on a 184-column table is slower, holds metadata locks longer, and has a larger blast radius if it fails. MySQL’s online DDL is usually fine for NULL-default additions; PostgreSQL is generally fast for the same case. Any migration that needs to rewrite rows (changing a column type, adding NOT NULL with a backfill) scales with row size, and a 6KB row rewrite on 200 million rows is a different operation than an 800-byte row rewrite on the same count.

Every column is a commitment

The cost of adding a column is small and immediate. The cost of having 150 columns is systemic and deferred: buffer-pool density, index size, write amplification, lock contention, migration cost. None of the deferred costs are visible in the PR that adds one more column, which is why they accumulate uncorrected until the table is painful.

Why LLMs make this worse

Schema drift in the wide-table direction is what language models reinforce by default. A model generating ALTER TABLE for a feature request reads the current schema and proposes the smallest change that makes the feature work, which is almost always adding columns to the table that already holds the related data. Proposing a split requires understanding the access pattern, the transaction boundaries, and the write frequency of the new columns versus the existing ones. None of that is in the CREATE TABLE.

The loop reinforces itself: the wider the table gets, the more natural it is for the next change to widen it further. “Where do loyalty tier and tier expiry go?” The model sees customers has every other user-attached concept in it and adds two columns. The alternative (CREATE TABLE customer_loyalty (customer_id PK FK, tier, expires_at)) requires the model to argue for a split, and splits are rare in the training data compared to additions because splits are rare in real codebases for the same reason: they’re harder to ship than additions. The model is correctly pattern-matching on what humans actually do, which is exactly the problem.

ORMs compound this. One model equals one table is the default shape in ActiveRecord, Django ORM, Prisma, SQLAlchemy, and Ecto. Refactoring a Customer model into three co-owned tables is a change that touches every query, every serializer, every test. The ORM makes “add a column to the existing model” a five-line change and “split the model” a project. Engineers pick the cheap option every time, and the wide table ratchets.

Split by access pattern, not by concept

“Normalize it” isn’t the fix because normalization is a property of data shape, not query cost. The fix is to look at what columns are actually read and written together, and keep those co-located; the rest moves out.

A workable decomposition for the customers example:

Core hot table. The columns read on nearly every query: id, email, name, status, tier, stripe_customer_id, created_at. Maybe twenty columns. This is what the list view, the auth path, and most API responses need.
1:1 cold tables. Concerns that are read rarely or in specific flows: customer_audit for login/seen/purchase timestamps, customer_preferences for user settings, customer_feature_flags for the twelve TINYINT flags. Each is a separate table with customer_id as PK and FK, joined only when the flow actually needs it. Writes to last_login_at stop rewriting the billing row.
1:N tables for repeating groups. Addresses, payment methods, anything that was modeled as shipping_address_2, shipping_address_3, shipping_address_4 is an addresses table with a FK and a type. This collapses polymorphic-ish schema decisions that shouldn’t have been made at the column level in the first place; see Polymorphic References for the related pattern where doing this without a FK goes wrong.

The trade-off is that some queries now join two or three tables instead of reading one. On the hot path this is fine; the joins are on PK-equals-FK, the join tables are small, and the read is usually cheaper than scanning a fat row. The cold path is where it matters: the audit screen now joins customers to customer_audit, which costs one indexed lookup and nobody notices. The place to be careful is the query that reads from three of the split tables on every request. If that’s dominant, one of those tables probably belongs merged back in.

When a wide table is actually fine

Not every 100-column table is a god table. Three cases where width is defensible:

Analytical and reporting tables on columnar storage. As noted above, warehouses like ClickHouse, BigQuery, and Redshift invert the cost calculus. Reading one column doesn’t load the rest, and the normalization pressure flips: denormalize aggressively because joins are expensive and per-column reads are cheap. This anti-pattern is specifically a row-store OLTP problem.

Small tables that stay small. A tenants table with 80 columns and 500 rows fits entirely in the buffer pool. The write amplification is paid a few thousand times a day, not a few million. The secondary-index cost is negligible because the indexes are small. Width matters when row count is large enough for the per-row cost to dominate; on small tables it doesn’t.

Every query reads every column. Uncommon but real. If the dominant read is “fetch the full customer record for display” and the split would produce a join that runs on every request anyway, the split doesn’t help. The test is whether the queries you actually run touch disjoint column sets. If they do, the split has a real win; if they don’t, it’s architecture for its own sake.

The bigger picture

Relational databases aren’t built for developer convenience. They’re built for storage efficiency and retrieval speed: narrow rows, well-placed indexes, joins on indexed keys, query plans that read only what they need. Normalization isn’t an academic ideal; it’s the shape that lines up with how the engine actually pays its bills. Every cost mechanism in this post (buffer-pool density, write amplification, index bloat, row-lock width) is the engine reporting the same thing in different dialects: the shape you’re asking it to hold isn’t the shape it was optimized for.

God tables are the limit of a sequence of rational local decisions where the global cost is invisible at each step. The column count of a mature production table is usually a decent proxy for how long the team has been making the cheap choice, which is most teams most of the time, and that is not by itself a failure. The failure is that the cost goes uncounted. A 6KB row is a write-amplification multiplier on every UPDATE, a buffer-pool multiplier on every read, and an index-size multiplier on every secondary index. None of those costs are on the PR that adds a column; all of them are on the dashboard that shows p99 drifting up quarter after quarter.

The lever is to count the cost at the system level when the table hits a certain width (pick a threshold: sixty columns, a hundred, whatever fits) and make the next column addition a conversation about whether this concern belongs here, not a line in a migration. The answer is often still yes, but it shouldn’t be the default answer.

Legacy Schemas Are Sediment, Not Design

Tue, 01 Jul 2025 00:00:00 +0000

TL;DR

A legacy schema looks like a design and reads like a sediment: layers of decisions from different eras, where names that once described the data no longer do and conventions that look uniform aren’t. Renaming is prohibitively expensive once every caller depends on the current names. The workable fix is documenting the drift so the next reader (human or LLM) can navigate what’s actually there.

A new engineer joins the team and reads the schema. tmp_orders looks like scaffolding, something to delete once the real migration ships. The tech lead answers: never delete it. tmp_orders is the main orders table. The temp-to-permanent rename was planned for 2017, nobody shipped it, and every service in the company now writes to the table. The name is a lie the schema tells every new reader, and every LLM generating SQL against the catalog.

The obvious fix is to rename the table. Nothing about the database itself prevents it: drop the tmp_ prefix, update every call site, ship. The reality is that every service, ORM model, report, integration, and runbook references tmp_orders by name. The rename is a multi-quarter effort that crosses team boundaries, and the only justification is legibility. Teams rarely prioritize legibility work, so the name stays, and the schema keeps lying.

What’s drifted

Legacy drift shows up in three visible modes and one invisible one.

Names that stopped describing the data. tmp_ tables that are permanent. old_ columns that are current. deprecated_ fields that every write path still populates. flag1, flag2, status_code: names whose meaning was obvious when the column was added, because the person adding it remembered why. By the time a new reader arrives, the intent is gone and the name is false advertising. Comment Your Schema covers the documentation side of this; legacy schemas are the case where comments would help most and where they’re most often absent.

Conventions per era. The 2014-era backend team used camelCase. The 2019 rewrite adopted snake_case. The 2022 microservice added a third table with PascalCase because the Go team wrote it and nobody pushed back. Now one database has userId, user_id, and UserID, all referring to the same entity across different tables. The LLM that generates business.created_at when the column is actually business.createdDate isn’t wrong in any sense the schema could catch; it’s inferring a convention from one table and applying it to another, which is a reasonable thing to do in a schema that has only one convention.

Tables that were supposed to be temporary. tmp_orders is the canonical example, but every long-lived database has some. Staging tables that got promoted to production. Migration tables that weren’t cleaned up. “Phase 2” tables built for a transitional period that shipped in phase 1 and never came back to finish. The names encode the original intent; the data encodes the current reality; the two diverge a little more with every migration that preserves the name instead of fixing it.

Invisible structural drift. Charsets and collations are the version of drift that doesn’t even show up in the column list. Older tables created before the Unicode migration default to latin1; newer tables use utf8mb4. A join between a VARCHAR(100) column in one table and a VARCHAR(100) column in another (both with the same name, both with the same logical meaning) silently produces different results depending on which side’s collation MySQL picks. In the bad cases, an implicit charset conversion kills index usage and turns the query into a table scan. SHOW TABLE STATUS reveals this; reading the column list doesn’t. Most LLMs read the column list.

Why this is worse for LLMs than for humans

A new human engineer working with a legacy schema can ask. They can ping the on-call channel, look up the original migration in git, trace a column back to the PR that introduced it, or simply ask “what is flag1?” and get an answer from someone who knows. The answer is often wrong or outdated, but it’s a starting point, and the engineer learns to treat the schema with appropriate suspicion.

An LLM generating SQL from the catalog has no such recourse. It sees tmp_orders and reasons from the name (probably “this is a staging table, prefer the non-tmp version if one exists, otherwise deprioritize”). It sees old_price and treats it as historical. It sees flag1 BOOLEAN and infers a generic flag. Each inference is reasonable; each is wrong in the specific case; the schema gives no signal that this is one of the cases where reasoning from the name produces bad SQL.

This is the sharper version of the generic id primary key problem. Both are failures of the schema to describe itself. The PK case hides what’s being matched; legacy drift hides what anything means. Neither failure shows up at write time; both produce queries that run, return data, and look plausible, because the rows exist and the types match. The wrongness is in the interpretation, which the database has no way to check.

The fix is documentation, not renaming

The obvious fix (rename everything to match intent and convention) fails on cost. Every table, column, and constraint in a mature schema is referenced by services the team has forgotten about: scheduled jobs, Redshift imports, third-party integrations, BI dashboards built by a contractor in 2019, runbooks pasted into wiki pages that nobody has edited since. A rename that looks like a one-line migration touches every surface the table is exposed on, and the projects that survive the attempt usually take a year and leave the schema worse during the transition.

The workable fix is to stop the drift from continuing and make the existing drift visible. Stopping new drift means picking a convention for new tables and columns and writing it down where CI can enforce it (Schema Conventions and Why They Matter covers the mechanics). Making existing drift visible means column and table comments on everything whose name doesn’t match its meaning, plus a per-era mapping somewhere in the repo that says “this database has four naming conventions, used in these periods, applied to these tables.” Legacy schemas are the case where COMMENT ON pays off highest. The names are already wrong, the cost of fixing them is prohibitive, and the comment is the one affordable signal the next reader gets.

1
2
3
4
5


COMMENT ON TABLE tmp_orders IS
 'Main orders table. The tmp_ prefix is historical: a 2017 migration was planned to rename this and was never completed. Do not drop.';

COMMENT ON COLUMN customers.flag1 IS
 'VIP customer flag. Legacy name from the 2014 schema; never renamed because of external reporting dependencies.';

One-line migrations, zero risk, and every reader (human and LLM) now has a chance of reading the schema correctly. This isn’t a fix in the sense of “problem solved.” It’s a fix in the sense of “the next reader has a chance.” The drift is structural; the documentation is how you navigate it without making it worse.

When a clean rewrite is actually worth it

Renames and migrations aren’t always wrong. Three cases where the rewrite earns its cost:

A misleading name is actively causing incidents. If tmp_orders is regularly truncated or dropped by someone who reads the name literally and acts on it, the rename cost is less than the recovery cost from the next incident. Usually the practical fix here isn’t a rename; it’s a view, synonym, or ALTER-TABLE-RENAME that exposes orders as the canonical name and leaves tmp_orders as a compatibility alias for legacy callers.

A schema migration is happening anyway. If the team is replatforming the OLTP database or splitting it across services, the rewrite opens a window where renames are cheap because callers are being updated either way. Take the opportunity; don’t schedule a separate naming cleanup six months later when the window has closed.

A database small enough that it fits one person’s head. Early-stage startups, internal tools, bounded-scope services. At twenty tables and three developers, a Saturday afternoon of renames is cheaper than a decade of comments.

In every other case, the schema is load-bearing history, and you renovate it the way you renovate a building with people still living in it: patch, document, and schedule the demolition for a window when it’s genuinely cheap.

The bigger picture

Every production schema is a compressed record of the decisions the team made under pressure. Some of those decisions were good and still fit; some were good at the time and don’t fit now; some were expedient and nobody noticed. The schema can’t tell you which is which, and it was never going to. The aspiration isn’t a clean schema that doesn’t accumulate history (no such schema exists past a three-year horizon) but enough signal for the next reader to decompress the sediment without guessing.

Comment the columns that lie. Document the conventions per era. Treat LLMs generating SQL against the catalog as the same kind of reader a new engineer is, and give them the same written context.

The Bare `id` Primary Key: When Every Table Joins to Every Other Table

Tue, 27 May 2025 00:00:00 +0000

TL;DR

A bare id primary key on every table makes a.id = b.id valid SQL between any two tables, which means neither a human reviewing the query nor an LLM generating one can tell which of those equalities are meaningful. Name primary keys after the table they identify, and the schema describes its own relationships.

Here’s a query an AI assistant generated against a real production schema:

1
2
3
4


SELECT u.email, a.payload
FROM users u
JOIN actions a ON u.id = a.id
WHERE u.email = 'alice@example.com';

Syntactically clean. Ran without error. Returned zero rows, which the assistant reported back as “this user has no actions.” The real answer: users.id is a BIGINT and actions.id is a CHAR(36) UUID. MySQL coerced the integer to a string, compared it to a UUID, and found no match. The join wasn’t wrong, exactly. It was meaningless, and the database had no way to say so.

The experienced reader’s first fix is “just use UUIDs everywhere” or “enforce the type at join time.” Neither works. The footgun isn’t the type mismatch; it’s the column name. When every table’s primary key is named id, a.id = b.id is a valid expression between any two tables in the schema, and nothing in the column names tells you whether that expression means anything. Fix the types and you close one failure mode; the identically-typed, semantically-unrelated users.id = 42 = orders.id case still ships.

What nobody can see

The <table>_id convention is older than most of us, and the case for it is usually framed as clarity or style. The sharper framing is that bare id hides the information that matters most at the point of the join (which table’s identity is being compared, and whether comparing them makes sense) from every reader of the query.

The query’s reviewer. ON u.id = a.id gives no hint of what’s being matched. A human reviewer has to carry the table-to-alias mapping (u is users, a is actions) and the table-to-type mapping (users.id is BIGINT, actions.id is UUID) in working memory, then cross-check them against the join condition. None of those steps are hard, but reviewers skip them because the column names look symmetric. Two .id references read as “joining on primary keys,” which is the kind of join nobody flags.

The LLM reading the schema. An assistant generating SQL from the catalog sees users(id BIGINT, ...) and actions(id CHAR(36), ...) as two tables with primary keys named id. Absent a full column-type check on every candidate join (and most schema-reading prompts don’t do this), the natural-looking join between “a user and their actions” is u.id = a.id, which is exactly wrong. The schema presented the column as joinable; the LLM took it at face value. The same mistake a tired human makes, but at scale and without fatigue to blame.

The static analyzer. Linters and schema-aware query builders operate on names first and types second. A rule that warns on suspicious cross-table joins has no signal to fire on when both sides are .id; the column names match, so the join is “legitimate” by shape. The same rule on users.user_id = actions.action_id would flag it immediately, because the names would be obviously non-corresponding.

None of these readers are missing a step they should have taken. They’re all doing the reasonable thing, and the reasonable thing produces wrong queries because the schema is telling them id is id in both tables.

Three failure modes, ranked by how loudly they fail

Three distinct outcomes hide behind a.id = b.id, and they don’t fail equally:

PostgreSQL, mixed types. The comparison errors out with operator does not exist: bigint = uuid. Loud, caught in development, fixed before merge. The best failure mode.
MySQL, mixed types. Silent coercion to string, zero rows returned. The opening example. Bad, because “no results” looks like valid data to every downstream consumer.
Either engine, same type but semantically unrelated. BIGINT users.id = 42 matched against BIGINT orders.id = 42 returns the rows where the integers happen to collide. The query runs, the result set isn’t empty, and the rows look plausible because they’re real rows from real tables. The worst failure mode, because nothing about the output signals that the join was nonsense.

The first two are loud enough to catch in review. The third is the one that ships. The third is the default once more than one table in the schema uses a plain BIGINT id, which is almost every relational schema in existence.

Zero rows looks like no data

A join that silently returns zero rows because of a type coercion is indistinguishable from a join that legitimately has no matches. Code generators, dashboards, and AI assistants all interpret empty results as “the relationship exists but has no rows,” not “the query is nonsense.” The failure hides inside success.

Mixed PK types make the naming problem sharper

Production schemas rarely stay on one PK strategy for long. The original tables are usually BIGINT AUTO_INCREMENT because the framework defaulted to it; a newer service switches to UUIDs to let clients generate IDs offline or to distribute across shards; join tables pick up composite keys because (user_id, role_id) is the natural identity. Nothing in the schema announces which tables fall into which bucket; SHOW CREATE TABLE or \d is the only source of truth, and even that requires reading every table to know what joins are legal.

Mixed types are where the naming footgun turns from theoretical to frequent. When every PK was a BIGINT, the “same type but semantically unrelated” case was the main risk and reviewers caught most of it. Once the schema has BIGINT and UUID sitting next to each other (all named id) the mismatched-type cases pile on top, and “no data found” becomes a regular report from any tool generating queries from the schema.

The sizing question (when to pick BIGINT versus UUID versus UUIDv7 versus composite, and what each costs at the index level) is covered separately in Random UUIDs as Primary Keys. The two problems interact but have independent fixes: pick your PK types deliberately, and name them so the schema describes its own relationships. Neither fix substitutes for the other.

Naming is the lever that actually helps

Naming is what makes a schema describe its own relationships without requiring the reader (human or otherwise) to open every CREATE TABLE. Two conventions, consistently applied, close most of the gap:

Name the primary key after the table. users.user_id, orders.order_id, actions.action_id. The equality users.user_id = orders.order_id reads as obvious nonsense, because the column names are no longer identical. Reviewers see it, LLMs don’t produce it, linters can flag it. The cost is a small amount of redundancy in queries (users.user_id instead of users.id), which is almost always a fair trade. This lines up with the broader guidance in Schema Conventions and Why They Matter.

Foreign keys mirror the target PK. orders.user_id clearly references users.user_id. actions.user_id clearly references users.user_id. This is already common practice; the only change is that the target’s PK name matches, closing the loop. Foreign Keys Are Not Optional covers why the FK itself matters; naming is what makes the FK legible without the REFERENCES clause in hand.

The bare id convention is defensible when the PK column only ever shows up in queries alongside its table name (users.id) and never as a bare id in a SELECT list or join condition. That discipline is hard to enforce across a team over years, and every framework’s default query builder produces SELECT id FROM users without thinking about it. The naming fix makes the discipline unnecessary.

When bare `id` is actually fine

Not every schema needs to bend. A small application, a service with a handful of tables, or a database where every query is reviewed by one team has plenty of context to keep the a.id = b.id landmine out of reach. The cost of the convention scales with the number of tables, the number of engineers, and the number of non-human query generators; in the small case it rarely shows up.

What changes once any of those numbers grow: nobody remembers which tables are BIGINT versus UUID, the assistant pattern of generating queries from schema is routine, and the review process that caught a.id = b.id in a 20-table schema can’t read every join in a 400-table one. At that size the convention pays rent, and renaming PKs is a migration that gets slower every quarter.

The bigger picture

A schema’s job is to hold data correctly and describe its own shape well enough that the tools reading it can reason about relationships without reading every line. The bare id PK is a small departure from that (one column name shared across tables) but it’s the departure that most consistently produces silent-wrong-answer queries, because SQL has no way to distinguish “same name, same meaning” from “same name, different meaning.”

Name the primary key after the table it identifies, so the schema tells its own story when someone (human or otherwise) joins two of them together. It costs almost nothing on day one and leaves the schema legible at 400 tables.

Polymorphic References Are Not Foreign Keys

Sat, 10 May 2025 00:00:00 +0000

TL;DR

A polymorphic reference is resource_id plus resource_type where the type string chooses which table the ID points to. ORMs make it a one-liner; the database enforces nothing. Reads need conditional joins, orphans accumulate silently, and for most uses (comments, notifications, attachments) per-target tables or mutually-exclusive FKs are the better trade.

What the pattern looks like

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12


CREATE TABLE notifications (
 id BIGINT PRIMARY KEY,
 user_id BIGINT NOT NULL REFERENCES users(id),
 resource_id BIGINT NOT NULL,
 resource_type VARCHAR(50) NOT NULL,
 message TEXT NOT NULL,
 created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

-- resource_type = 'order' → resource_id references orders.id
-- resource_type = 'invoice' → resource_id references invoices.id
-- resource_type = 'ticket' → resource_id references support_tickets.id

The tell is resource_id BIGINT NOT NULL with no REFERENCES clause; it can’t have one, because there are multiple targets. What the application treats as a foreign key is, at the database level, a plain integer with a sibling tag string.

What the database can’t do

The cost shows up as absence: every mechanism the database offers for reasoning about relationships is disabled, because the column’s meaning depends on data in another column.

No foreign key. A REFERENCES clause names exactly one target. Orphaned resource_id values are a write-time non-event and a read-time mystery. (Foreign Keys Are Not Optional covers the general cost; polymorphic is the case where skipping isn’t a choice.)
No cascade. Delete an order and nothing cleans up the notifications pointing at it. The application has to know every table that might hold a polymorphic reference to orders and clean each one. New tables added later don’t get noticed.
No planner metadata. Foreign keys feed join ordering and row estimates, especially in PostgreSQL. The planner sees resource_id as a BIGINT with a histogram and no known target.
No schema-level description. Anything that reads the catalog (ERD tools, query generators, AI assistants, typed-client generators) sees no link between notifications.resource_id and the tables it points at. The mapping lives in model files and string literals. (Comment Your Schema helps here but can’t fully restore the information.)

Orphans accumulate silently

A polymorphic column with no FK and no cascade develops orphans over time. Reads paper over them with LEFT JOIN ... WHERE target.id IS NOT NULL, so the broken rows disappear from the UI but stay in the table. In schemas a few years old, the orphan rate is rarely zero, and nobody designed for it.

Reads pay for the write-side convenience

The absent FK is the schema problem. The read-path shape is where the cost becomes daily. A query that needs any column from the referenced row can’t write a single join; the target depends on a per-row value, and SQL’s join syntax takes a static target.

1
2
3
4
5
6
7
8
9


-- Conditional LEFT JOIN per target
SELECT n.id, n.message,
 COALESCE(o.order_number, i.invoice_number, t.ticket_code) AS ref
FROM notifications n
LEFT JOIN orders o ON n.resource_type = 'order' AND n.resource_id = o.id
LEFT JOIN invoices i ON n.resource_type = 'invoice' AND n.resource_id = i.id
LEFT JOIN support_tickets t
 ON n.resource_type = 'ticket' AND n.resource_id = t.id
WHERE n.user_id = 42;

Every new target type adds a join clause here and in every other read-path query that displays a related field. The alternative (a UNION ALL per target) is narrower per branch but scales linearly with target count and pushes pagination up to the union level. Most ORMs’ default resolution is one query per (resource_type, resource_id) group, which is the N+1 pattern that makes polymorphic feeds slow once the target set widens.

“One column can point at many tables” on the write side turns into “every read query enumerates every possible table” on the read side. The symmetry people expect isn’t there.

Why the pattern spreads

It’s the path of least resistance that framework ergonomics encourage. Rails’ polymorphic: true, Django’s GenericForeignKey, and Laravel’s morphTo make one-liner what would otherwise be multiple belongs_to associations and a migration. “Comments on orders” and “comments on invoices” look like duplication, so a single comments table with commentable_id / commentable_type feels cleaner. An open-ended “add comments to anything” product ask reads as an argument against committing to a target list.

Each of those framings overweights the write-side cost (another table or another FK column) and underweights the integrity loss (no enforcement, no cascades, schema no longer describes itself). ORMs Are a Coupling covers the broader trade. Polymorphic is the canonical case where the ORM’s preferred shape is actively incompatible with what the database wants to enforce.

What the schema-reading assistant sees

A tool reading the catalog (Copilot on a schema dump, an MCP-backed agent, a RAG pipeline indexing DDL) sees notifications.resource_id BIGINT NOT NULL with no REFERENCES clause and no way to tell the column is anything other than an integer. Asked for “notifications about orders,” the assistant’s best guess is notifications.resource_id = orders.id: a join that runs clean, returns every notification whose resource_id happens to collide with an order ID (which includes invoice notifications, ticket notifications, and anything else pointing at an integer that also appears in orders), and surfaces plausible-looking but semantically nonsense rows. The resource_type filter that would make the join correct is the piece the schema doesn’t advertise.

This is the structural version of the problem covered in the bare id primary key: schema that can’t describe its own relationships forces every reader to guess, and schema-reading models guess confidently. Pulling the polymorphic column apart (per-target tables, mutually-exclusive FKs, supertype) restores the signal in the catalog. The assistant stops hallucinating the join; any RAG system indexing the schema picks up real REFERENCES metadata; the next engineer reading the table doesn’t need to grep the ORM models to find out which target types exist. The integrity win and the catalog-legibility win come in the same migration.

Alternatives

Each alternative gives back some of the database’s relational machinery at different levels of verbosity.

Per-target tables. Split along the target dimension: order_notifications, invoice_notifications, ticket_notifications, each with a real FK. Real cascades, real planner metadata, self-describing schema. Cost: duplicated column sets and an explicit UNION ALL for cross-target reads. That union already exists implicitly in the polymorphic shape, just moved from the read query into typed branches.

Mutually-exclusive nullable FKs with CHECK. One table, one FK column per target, a constraint enforcing exactly one is non-null:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13


CREATE TABLE notifications (
 id BIGINT PRIMARY KEY,
 user_id BIGINT NOT NULL REFERENCES users(id),
 order_id BIGINT REFERENCES orders(id),
 invoice_id BIGINT REFERENCES invoices(id),
 ticket_id BIGINT REFERENCES support_tickets(id),
 message TEXT NOT NULL,
 CONSTRAINT exactly_one_target CHECK (
 (order_id IS NOT NULL)::int +
 (invoice_id IS NOT NULL)::int +
 (ticket_id IS NOT NULL)::int = 1
 )
);

Real FKs per target, real cascades, row’s meaning unambiguous. Scales reasonably up to a handful of targets and stops scaling cleanly somewhere around ten.

Supertype table. A shared parent table carries a common ID; each target type’s table references the parent. The polymorphic column then points at the parent, which is a single real FK. Cleanest structural answer and the one with the highest adoption cost; retrofitting this onto an existing schema is substantial migration work.

When polymorphic is actually the right call

The trade-offs stack up unfavorably for most common uses, but not all. The pattern earns its keep when the relationship is genuinely best-effort: audit events, activity logs, “recently viewed” lists, undo history, where a lost reference is a recoverable annoyance rather than a correctness incident. The FK was never going to be load-bearing, and the polymorphic shape matches the actual semantics: “reference anything, and if it’s gone, show a tombstone.”

Outside that zone the default bias should run the other way. A comment system with three possible parents is not a case for polymorphism; it’s a case for three comment tables or mutually-exclusive FK columns, with the ORM abstracting the read-side stitching.

The bigger picture

Polymorphic references are a specific case of a broader pattern: designs that move information out of the schema and into the application, in exchange for ergonomics in the model layer. The schema drifts from “self-describing relational structure” toward “indexed key-value store the application interprets.” That’s a legitimate position (DynamoDB and friends live there on purpose) but a relational database running on polymorphic associations is paying for a relational engine and choosing not to use most of what it offers.

The pattern isn’t wrong. It’s an aggressive trade, priced on day one by the convenience of polymorphic: true and on day three hundred by the silent orphan count, the conditional joins, and resource_id BIGINT telling no one what the table is related to. Reach for it on purpose. Keep the option of pulling it back onto typed FK columns open, because the migrations away are slower the longer the schema has been pretending the reference isn’t there.

ORMs Are a Coupling, Not an Abstraction

Wed, 23 Apr 2025 00:00:00 +0000

TL;DR

An ORM is a coupling between schema shape and code shape, not an abstraction over it. The coupling pays off in year one and compounds against you in year five. For long-lived OLTP systems, a thinner layer over raw SQL (sqlc, jOOQ, typed query builders) ages better.

There’s a period early in a project where an ORM feels like pure upside. You define a model, the framework generates a migration, and User.where(email: …) returns typed objects. No SQL to write, no mapping layer to maintain, no integration boilerplate. Five years later the same project has four migration directories, a model class with thirty custom methods overriding the ORM defaults, team memory of which relations are lazy-loaded and which aren’t, and a quarterly discussion about whether it’s time to upgrade Rails 4 to Rails 7 or skip straight to something else entirely.

Somewhere between those two points, the ORM stopped being an abstraction and became a coupling: a bidirectional contract between schema and code that both sides have to honor for every change. The contract shapes more than how changes propagate. It also shapes the schema itself, because an ORM’s default output is a database structured like the class graph rather than one designed for the workload. Short-lived prototypes and simple CRUD apps still benefit from ORMs. The defensible use cases are narrower than the industry’s default deployment pattern suggests, and the coupling is real, durable, and consistently underestimated at the point a team decides to adopt one.

The oddity worth pausing on

SQL is arguably the most widely-deployed, longest-lived programming language in the industry. Every major database speaks it, every backend engineer eventually learns it, the DDL and DML haven’t meaningfully changed in decades. The ORMs wrapping it are the opposite: framework-specific, tied to a particular version of a particular stack, with conventions that differ across ecosystems and shift across major releases. The default across most engineering orgs is to go out of their way to adopt the less portable, less stable of the two and hide the more durable one behind it. A team joining a new project expects to relearn the ORM. Nobody expects to relearn SELECT.

The rest of this post offers one answer for why that’s the default: the coupling an ORM introduces hides its cost long enough that the trade looks very different in year one than it does in year five.

What the ORM is actually doing

The word “ORM” suggests abstraction, “object-relational mapping” as if the mapping is the hidden plumbing. The practical reality is the opposite: the mapping is the product. An ORM takes your schema shape and projects it onto code shape. Columns become fields. Tables become classes. Foreign keys become methods. Indexes are invisible until you care about them. Constraints are whatever the ORM’s DSL exposes and nothing more.

That projection is useful. It lets application code avoid SQL, most of the time. It also means the code and schema are now two views of the same data model, and those views are expected to stay in sync by you, by your migration framework, by your tests, and by every developer who touches either side.

Stay in sync, in practice, means every schema change is also a code change. Every code change that adds a field triggers a schema change. Every migration is a coordinated edit across multiple files. The coupling isn’t an implementation detail; it’s the defining characteristic of the tool.

Source of truth: pick one, know which

Every ORM ecosystem has a default answer to “where does the schema canonically live”, and most teams never think about it.

Model-first. Rails and Django generate migrations from changes to model classes. The model is the source of truth; the schema follows. Running rails db:schema:dump produces a schema.rb that describes the current state, and the migration files are the history of how it got there.
Schema-first. sqlc and jOOQ read SQL DDL files and generate typed client code. The schema is the source of truth; the code follows.
Hybrid / unclear. Hibernate can do either, depending on configuration. SQLAlchemy lets you declare models in Python and generate migrations via Alembic, or point Alembic at an existing schema and generate models. Teams that don’t decide end up doing both.

The hybrid case is where the real damage happens. Over years, a team that migrates from model-first to schema-first (or vice versa) without a clean cutover ends up with a schema that neither the models nor the migration history correctly describes. Rows backfilled by a DBA with direct SQL don’t show up in the ORM’s understanding of the world. Columns added by a production hotfix get rediscovered six months later when someone regenerates models from the database.

The fix isn’t to prefer one approach over the other. It’s to decide, document, and enforce, the way you would any other convention.

Migrations stop being “DB work”

In a raw-SQL codebase, a schema migration is a single file: CREATE TABLE, ALTER TABLE, DROP COLUMN. The migration is the change.

In an ORM codebase, a single logical schema change is typically:

A migration file (add_email_to_users.rb).
The model class (User#email getter, validation, serialize calls).
The serializer (UserSerializer#email).
The API contract (OpenAPI spec, GraphQL schema, whatever the team uses).
Fixtures and factories (FactoryBot, factory_boy, test data).
Query helpers that need to know the new column.
Type stubs or generated types (TypeScript declarations, Python stubs).
Admin UI config, sometimes.

What should be a single metadata-level change is now a coordinated edit across five to eight files, and missing any one of them produces a subtly broken application. The ORM didn’t create the complexity; it distributed it. The schema change is still one change. It just has to be propagated to every place the code has a mirror of the schema.

At small scale this is fine. The friction compounds once the team is big enough that the people writing the migration aren’t the same people owning the serializers and the API consumers. A schema change now requires coordinating across teams, each with their own view of the data model, each needing their files updated. The schema itself didn’t get harder to change. The ORM layer around it did.

Hidden queries

The ORM generates SQL you didn’t write. That’s the value proposition. It’s also a persistent failure mode.

Lazy loading. user.orders triggers a query. user.orders.first.line_items triggers another. In a loop over 100 users, that’s at least 101 queries, none of them visible in the code. The classic N+1.
Implicit joins. .includes(:orders) eager-loads associations, but only if someone remembers to write it. The default is lazy. Defaults win.
Magic methods. where(status: :active).first_or_create(email: …) is three or four queries depending on the code path, and the code says nothing about it.
Generated sort and filter. User.order(:created_at).limit(10) on a table without an index on created_at does a full table scan. The query was generated by the ORM; the reviewer never saw it.

None of these are the ORM doing something wrong. They’re the ORM doing exactly what it said it would. The cost is that the SQL the database actually runs isn’t in version control, isn’t code-reviewed, and isn’t profiled until it shows up in slow-query logs. Every ORM codebase accumulates query shapes nobody intentionally wrote.

The queries you don't see

The SQL emitted by an ORM is invisible until something breaks. Code review covers the method call; the database sees three joins and a subquery. Teams relying heavily on ORMs end up needing separate tooling (query logs, APM, pg_stat_statements, EXPLAIN on every slow path) just to know what’s actually running.

Two query languages, neither complete

Past the CRUD ceiling, every ORM codebase ends up with raw SQL living alongside ORM calls. Window functions, recursive CTEs, PostgreSQL DISTINCT ON, LATERAL joins, MySQL INSERT ... ON DUPLICATE KEY UPDATE with complex update clauses, exclusion constraints, full-text search, spatial queries: the list of things awkward or impossible to express through the ORM grows over the life of the project.

The result is a codebase with two query languages coexisting. Reviewers have to know both. Type safety is uneven; ORM calls produce typed objects, raw SQL produces hashes or arrays that need manual mapping. The two styles drift. The ORM-side queries follow the ORM’s conventions; the raw-SQL queries follow whatever the author happened to write that day.

The honest consequence: past a certain complexity threshold, the ORM isn’t reducing the SQL surface area, it’s adding a second layer on top of it. The SQL didn’t go away. It got pushed into the half of the codebase that’s harder to trace.

Bidirectional coupling

The part that surprises teams is how hard it is to leave.

Migrating a database schema (renaming a column, changing a type, splitting a table) is mechanical. It’s a migration file and a deploy window. The mechanics are well-understood and the blast radius is bounded.

Migrating off an ORM is not mechanical. The ORM’s conventions have bled into:

Controller and API code. JSON shapes match model attributes. as_json, serializable_hash, and ORM callbacks define what the outside world sees.
Test suites. Fixtures, factories, and in-memory SQLite test databases depend on the ORM being there.
Third-party integrations. Export formats, webhooks, analytics pipelines, all built against the ORM’s JSON representation of the data.
Admin UIs. Rails Admin, Django Admin, Laravel Nova; hard-wired to specific ORM conventions.
Query helpers. Every scope, every association, every callback is ORM-native.
Team knowledge. Every engineer who’s been there more than a year thinks in the ORM’s abstractions.

None of this is the database’s problem. It’s the surrounding code that grew up expecting the ORM to be there. Replacing the ORM means replacing or rewriting every one of those layers. A schema migration is a weekend project; an ORM migration is a yearlong initiative.

The asymmetry is worth naming. The coupling is bidirectional, and one direction (schema → code) is much harder to undo than the other. Teams that adopt an ORM for velocity rarely account for the exit cost.

Database-side logic doesn’t round-trip

Most ORMs have a tunnel-vision view of the schema: they see what they created. They don’t see:

CHECK constraints. The ORM has no concept of them. A constraint like CHECK (amount >= 0) is invisible to the model; the ORM’s validations become the only gatekeeper the application knows about.
Triggers. A trigger that mutates a row after insert produces data the ORM didn’t know would be there. Reading back the row often requires an explicit reload.
Generated columns. MySQL’s GENERATED ALWAYS AS (…) STORED and PostgreSQL’s equivalent produce values the ORM treats as regular columns, but they can’t be written to, and the ORM’s default behavior is to try.
Partial and expression indexes. The ORM sees the column, not the index. A query that should hit a partial index on WHERE deleted_at IS NULL gets generated without that predicate and misses the index.
Exclusion constraints. PostgreSQL EXCLUDE USING gist (…). Completely outside the ORM’s worldview.

The ORM’s view of the schema is a subset of the real schema. Queries written against that subset can violate invariants the database enforces. The application code thinks the write succeeded; the INSERT comes back with a constraint violation; the code has no idea why. Teams paper over this with application-level validation that duplicates the database’s, and then the two drift, which is its own class of production incident.

Relational modeling isn’t object modeling

The coupling goes one direction that’s easy to see: schema changes require code changes. It also goes the other direction, which is harder to see. The ORM’s object model is what shapes the schema in the first place. For simple data, a User with an email and a password hash, that’s fine. For non-trivial domains, the shape inherited from object modeling produces schemas that look like class hierarchies and perform like poorly-designed databases.

This mismatch has a name: the object-relational impedance mismatch. Its practical consequence is that ORM-driven schemas get shaped by class hierarchies rather than by the relationships and access patterns the workload actually has.

Normalization doesn’t look like inheritance. A properly normalized schema is structured by the shape of the relationships between entities, not by a class graph. Consider a scheduling application with three kinds of entries: appointments, days off, and product launches. All of them are events. They have a start time, an owner, a status. Each has different additional fields.

The relational answer is a supertype/subtype pattern (sometimes called class table inheritance): a base events table with the shared fields, and specialized tables for each subtype, each with event_id as a primary key that’s also a foreign key back to events:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26


CREATE TABLE events (
 id BIGINT PRIMARY KEY,
 user_id BIGINT NOT NULL REFERENCES users(id),
 starts_at TIMESTAMPTZ NOT NULL,
 ends_at TIMESTAMPTZ NOT NULL,
 kind TEXT NOT NULL CHECK (kind IN ('appointment', 'day_off', 'launch'))
);

CREATE TABLE appointments (
 event_id BIGINT PRIMARY KEY REFERENCES events(id) ON DELETE CASCADE,
 client_id BIGINT NOT NULL REFERENCES clients(id),
 location TEXT,
 notes TEXT
);

CREATE TABLE days_off (
 event_id BIGINT PRIMARY KEY REFERENCES events(id) ON DELETE CASCADE,
 reason TEXT,
 paid BOOLEAN NOT NULL
);

CREATE TABLE launches (
 event_id BIGINT PRIMARY KEY REFERENCES events(id) ON DELETE CASCADE,
 product_id BIGINT NOT NULL REFERENCES products(id),
 audience TEXT
);

Each subtype has its own columns, indexes, and constraints. Each can evolve independently. A new field on appointments doesn’t touch events, days_off, or launches. Dropping the launches feature drops one table and a CHECK-constraint value. Queries that only care about one subtype hit a narrow, well-indexed table instead of scanning across fifty columns of mostly-null data.

The ORM-driven shape tends to produce something different. Rails’ single-table inheritance (STI) collapses everything into one wide table with a type column and every possible subtype field nullable. Django’s multi-table inheritance is closer to the relational answer but introduces implicit joins the developer didn’t ask for. Hibernate offers all three strategies (SINGLE_TABLE, JOINED, TABLE_PER_CLASS) but most teams pick SINGLE_TABLE because it’s the default and the fastest for small-scale CRUD.

STI-style tables start showing their cost around the 10-million-row mark. Every query now scans a table with dozens of nullable columns. Indexes have to include the type column to be useful. Adding a field to one subtype means adding a nullable column visible to every other subtype. The schema looks like a class hierarchy and performs like one table doing the job of four.

Complex relationships don’t fit class graphs. Many-to-many bridges with their own columns, polymorphic references (one column that points to different tables depending on a sibling column’s value), temporal tables, recursive self-references; once the data model has these, the object graph starts fraying. The ORM’s answer is usually a custom association that looks natural in code and generates SQL nobody would write by hand.

Normalization decisions are driven by access patterns, not classes. A well-designed schema decides what to normalize and what to denormalize based on read/write ratios, query patterns, and storage trade-offs. The ORM-first approach tends to normalize by class structure, which is mostly correlated with good access-pattern normalization at small scale and mostly uncorrelated with it at scale.

The coupling here isn’t only code to schema. It’s class-graph to schema-shape, and that second form is the one that dictates how the database performs under real traffic.

When scale exposes the modeling

The class-shaped schema is cheap at small scale. Its cost is hidden until the workload grows, and because the schema shape is coupled to the class graph the application assumes, fixing it isn’t a schema migration. It’s an application restructure. The ORM’s opinions about data modeling are fine at 1,000 rows. Tolerable at 1 million. Breaking at 10 million. At 100 million, the patterns that were quietly suboptimal become the production incidents of the quarter.

Wide STI tables that scanned fine for 100k rows become the reason a query times out at 100M, because the planner can’t pick an efficient path through dozens of columns of mostly-null data with mixed cardinalities.
Lazy-loaded associations that were 200ms at small scale are now 60-second requests fanning out to a thousand queries.
find_or_create_by races that never mattered when two users hit the same endpoint now cause daily deadlocks on hot rows.
Unindexed ORM-generated sorts that worked at 10k rows become sequential scans over hundreds of gigabytes.
Connection-pool exhaustion from ORMs that hold connections across application logic becomes a top-of-funnel incident when traffic grows.

At this point, teams reach for tools that weren’t supposed to be in the solution space for an OLTP application. Materialized views are the common one. They’re legitimately useful for analytical workloads, wrong for write-heavy OLTP because they have to be refreshed, and refresh windows during traffic either stall the primary or serve stale reads. Read replicas with application-level routing get bolted on not because the read workload demands it, but because the primary is buckling under queries that would have been cheap on a better-designed schema. Caching layers get introduced to paper over query shapes the ORM insists on generating. Each of these has legitimate uses. None of them is a fix for a schema that wasn’t designed for the access pattern it’s getting.

Materialized views aren't an OLTP tool

A materialized view is a precomputed query result stored as a table. In an OLTP system with heavy writes, the refresh cost either stalls the primary during the refresh or leaves the view stale. Neither is acceptable for a live application. Materialized views are an analytical-workload tool; reaching for them to fix an OLTP performance problem is a sign the underlying schema shape is wrong.

The pattern: ORM-driven schemas work until they don’t, and when they don’t, the options are rewrite the schema (hard, because the ORM’s conventions are everywhere) or add infrastructure that papers over the problem (expensive, and eventually stops working too). The schema that was designed to be ergonomic for the ORM at 1,000 rows is now the binding constraint on what the application can do at 100M.

The thinner alternatives

There’s a spectrum between “hand-roll every query with database/sql” and “full ORM with identity map, lazy loading, and 200-line models.” Several tools occupy the middle ground by treating SQL as the source of truth and generating typed code from it, without the mapping layer.

sqlc. Go, Kotlin, Python, TypeScript. You write SQL queries in .sql files; sqlc generates type-safe client code. The schema is canonical, the queries are code-reviewed SQL, and there’s no runtime layer to reason about. Migrations stay plain DDL.
jOOQ. JVM. Reads your schema and produces a fluent, type-safe DSL for building queries. Feels like SQL, reads like SQL, with compile-time type checking. Schema-first, no model mapping.
Kysely. TypeScript. Typed query builder with no ORM layer. You describe the schema in types; Kysely ensures queries match. The full SQL surface area is reachable.
Drizzle. TypeScript. Despite the name, closer to a typed query builder than a classical ORM. Schema declared in code, queries written in a SQL-like DSL, no identity map.
Plain database/sql or pgx with a small query helper. Go in particular has a tradition of “raw SQL plus a thin wrapper.” More boilerplate, minimal coupling.

The common thread across these tools: schema is the source of truth, queries are code-reviewed first-class artifacts, and there’s no mapping layer pretending the database doesn’t exist. The payoff is predictability; the SQL you see is the SQL that runs. The cost is some of the magic: no User.find(1).orders.where(total: 100..).first_or_create one-liners.

For long-lived OLTP systems with non-trivial query shapes, that predictability is worth more than the magic. For short-lived CRUD apps, it isn’t.

When ORMs still earn their place

ORMs have a place. It’s narrower than the industry’s default deployment suggests. The workloads where the velocity payoff consistently outweighs the coupling cost share two properties: they’re bounded in scope and they’re bounded in lifespan.

Short-lived prototypes and experiments. Projects that will be rewritten, replaced, or discarded within a year. Model-first iteration is genuinely faster when the schema is fluid, and the coupling cost doesn’t compound if the project doesn’t live long enough to hit it.
CRUD-heavy internal tools and admin UIs. Query shapes are uniform and simple, the workload won’t scale past the ORM’s comfort zone, and the system doesn’t outlive the product it supports. The ORM’s constraints function as a style guide rather than as a limit on what the application can do.

That’s the list. Not “projects where the team knows Rails.” Not “workloads with uniform query shape, for now.” Not “small teams.” Those framings start as short-lived exceptions and end up as the default, and once the project outlives its original scope the coupling cost compounds silently until it’s too expensive to remove.

The failure mode isn’t picking an ORM for a prototype. It’s keeping it ten years later, after the prototype has become the company’s main production system, after the workload has grown past its original shape, and after migrating off costs more than a rewrite of the application. Most of the ORM codebases engineers end up cursing started in one of the two bullets above and were never reconsidered when they outgrew them.

Trade-offs

Everything in this post has a counter-argument, and the counter-arguments are real.

ORMs save real time on simple queries. User.find(1) is shorter than SELECT * FROM users WHERE id = 1. Across a codebase it adds up.
Type safety in the application layer. Rails and ActiveRecord don’t give compile-time types, but Django’s model fields, SQLAlchemy’s typed columns, and Hibernate’s entity types do. Raw SQL’s answer is schema-first code generation (sqlc, jOOQ), which works but requires tooling.
Domain modeling. Some teams legitimately want their data model to have methods, validations, and behavior co-located with the data. An ORM gives that for free; a query builder doesn’t.
Team familiarity. A team that knows Rails deeply will out-ship a team learning sqlc for the same project. The right answer depends on the team, not the abstract merits.
The middle ground isn’t free. Typed query builders require maintained type definitions. Schema-first code generation adds a build step. “No ORM” means a different abstraction, maintained by you.

The choice isn’t ideological. It’s a trade between two failure modes: the ORM’s coupling cost versus the query-builder’s boilerplate and maintenance cost. For short-lived systems, the ORM wins. For long-lived systems, the thinner layer wins. The catch is that most systems surviving their first year are long-lived, and most teams underestimate how long their system will live. If the project is still running three years from now, you’re probably in the second category whether or not you planned to be.

The bigger picture

The thing an ORM sells is a mapping between code and schema. The thing it delivers is a coupling. For short-lived projects (the prototype, the internal CRUD tool, the bounded experiment) the trade is worth it; the coupling cost is deferred, and by the time it would catch up the project has served its purpose or been replaced.

For projects that live long enough and grow complex enough (which is almost any project that survives its first year) the coupling becomes the dominant cost. Every major framework upgrade is a migration of its own. Every scale inflection requires working around the ORM’s opinions. Every query past the CRUD ceiling is raw SQL anyway. The better default for an application the team expects to still be running in three years is schema-first: keep the DDL canonical, keep queries as first-class code-reviewed artifacts, use a thin typed layer (sqlc, jOOQ, Kysely, Drizzle) to bridge to the application, and leave the ORM in the toolbox for cases that genuinely match its narrow strengths.

If you’re starting a project expected to live more than a year, default to schema-first. Inside an existing ORM codebase where the signals are showing up (raw-SQL ratio creeping up, migrations that require cross-team coordination, queries the ORM can’t express, performance paths that bypass it anyway) the useful question isn’t whether to migrate off. It’s where to draw the schema-first boundary for new work. Usually at new subsystems, not legacy code. Grandfather what’s there, pick up sqlc or jOOQ or Kysely for new code, and let the boundary move over years.

Schema Conventions Don't Survive Without Automation

Sun, 06 Apr 2025 00:00:00 +0000

TL;DR

Schema conventions only survive when automation enforces them. A rule a linter, ORM, migration runner, or IaC module checks will hold for years; a rule the team merely agreed to won’t outlast the people who agreed. Pick the conventions your automation needs and skip the purely subjective ones, because they’ll drift regardless of how strongly anyone feels.

Every long-lived schema accumulates conventions whether anyone picked them or not. The real question is which ones will still be followed two years from now. The answer, reliably, is the ones a piece of automation is enforcing. Everything else drifts. A new engineer joins, prefers camelCase, adds a few tables. The next one prefers plural names, adds a few more. The convention wasn’t wrong, and nobody broke any rule. There was no rule to break. The schema simply recorded every preference of everyone who ever touched it.

The corollary is the thesis of this post. Don’t pick conventions for human reasons alone. Pick them because a tool needs them, enforce them with that tool in CI, and leave the rest alone. If a question is purely about taste (where a timestamp column sits in column order, whether to prefix table names with a service name) and no automation will fail when the answer changes, skipping the decision is cheaper than picking one and pretending it’ll hold.

The inconsistency cost isn’t linear either. Two generations of conventions coexisting is annoying but manageable. Four or five (introduced gradually, each time someone decided to “do it the new way”) compounds into something nobody can reason about and no tool can rely on.

What “conventions” means here

Conventions in this post means the decisions that apply across every table, not the design of any particular table:

Naming. snake_case, camelCase, or ALLCAPS for tables and columns.
Table names. Singular (user) or plural (users).
Primary keys. Bare id or <table>_id. BIGINT, UUID, or composite.
Foreign keys. user_id referencing users.id, or ad-hoc names like owner and creator.
Mandatory columns. created_at, updated_at, deleted_at, created_by. Which tables need them and which don’t.
Status and enum patterns. INT with documented values, CHECK constraint, or native ENUM. Zero-indexed or one-indexed.
Boolean naming. is_active, has_completed, can_edit, or bare active / completed.
Timestamp types. TIMESTAMP, DATETIME, TIMESTAMPTZ. Timezone-aware or naive.
Character sets and collations. utf8mb4 vs latin1; en_US.UTF-8 vs C.

None of these have one right answer. All of them have consequences that multiply across the lifetime of the schema.

Humans benefit, but not durably

Consistent schemas are easier for humans. Onboarding is faster, review is mechanical, queries are predictable. These benefits are real. They’re also entirely dependent on something other than memory holding the convention in place.

A new engineer spends less time building a mental model when PKs, FKs, and timestamps are named the same way everywhere. True, and the convention enabling it exists only as long as someone is actively keeping it enforced.

A migration adding CustomerReference INT in a codebase where everything else is customer_id BIGINT gets flagged when conventions are consistent. True, and whether it actually gets flagged depends on whether the reviewer remembers the rule or a linter is enforcing it.

JOIN users ON orders.user_id = users.id works without a lookup when the convention is <table>_id. True, and the query is right only because every prior migration followed the rule, which is only the case if something kept them on track.

The pattern: every human benefit is downstream of enforcement. A rule that exists only because the current team agreed to it lasts exactly as long as that team does. People change jobs, preferences evolve, new hires bring their own instincts. Within a few quarters of turnover, a human-only convention is gone, and so is the benefit.

The reasons worth picking a convention are the reasons a machine can enforce it.

Why it matters for automation

Automation is the only thing that holds a convention over time. A linter fails the build when snake_case becomes camelCase and keeps failing until someone addresses it; a team agreement doesn’t. The tools below are both the enforcement mechanisms and, by that logic, the only reasons a convention is worth picking in the first place. If none of them apply to your stack, the convention probably isn’t worth the debate.

Every tool that touches the schema reads conventions implicitly. When conventions are consistent, the tool works without configuration. When they’re not, someone has to tell the tool how to handle each exception. Usually in a config file nobody maintains.

ORMs rely on naming rules. ActiveRecord assumes a table named users has a primary key id and that a user_id column is the foreign key. Deviate and you write explicit mappings. Every non-standard table adds a line of configuration; every belongs_to :author, foreign_key: :creator_ref is convention drift showing up as code. Other ORMs are more explicit but still benefit from predictable column names: autogeneration works, inference works, magic methods work.

Code generators produce better output. sqlc, Prisma, jOOQ, and similar tools read schema metadata and emit type-safe client code. Consistent naming means the generated output looks like hand-written code. Inconsistent naming produces getCustomerReferenceByUserId() sitting next to getOrderByUserId(), same concept, different shape, every caller has to remember the difference.

Migration tools depend on mandatory columns. Frameworks that manage created_at / updated_at automatically assume every table has them. Tables that omit these columns silently break the assumption: inserts work, updates work, but the “last modified” display in an admin UI shows null for some tables and not others.

Deployment pipelines assume a consistent migration shape. Migration runners that execute schema changes as part of CI/CD (Flyway, Liquibase, Alembic, Atlas, skeema) rely on migration files following a predictable naming and ordering convention, up/down scripts that mirror each other, and tables that don’t need per-case special-handling. Zero-downtime patterns like expand-and-contract assume updated_at exists for cache invalidation, that new columns are nullable or have defaults so old and new application versions can both write the table, and that soft-delete markers are consistent so rolling deploys across mixed versions don’t resurrect rows one version thought were gone. Every convention that drifts turns a deploy playbook into a per-table checklist, and the checklists are what get skipped under time pressure.

Schema diffing and drift detection depend on consistent shape. Tools like Atlas and skeema compare the desired schema (in version control) to the actual state of each environment and generate the migration to reconcile them. They work well when naming, types, and mandatory columns are uniform, and produce noisy diffs, false positives, and hand-maintained exception lists when they aren’t. Environment parity between dev, staging, and prod degrades the same way: the drift the team never notices becomes the one that breaks a deploy at the worst time.

Schema linters only work if there’s a rule to check. SQLFluff, sqruff, and similar tools can enforce naming conventions, require certain columns on new tables, reject forbidden types, and flag style issues. But the lint rule has to match the team’s convention. No convention, no rule. No rule, no enforcement.

Documentation generators like tbls and SchemaSpy produce browsable schema docs straight from the catalog. Consistent conventions make the generated output navigable. Inconsistent ones make it look like a dump.

Schema-reading LLMs and RAG pipelines have joined the same list. Copilot, MCP-backed agents, text-to-SQL tools, and retrieval-augmented coding systems pull column names and types from information_schema and pattern-match them against natural-language questions. When one table uses createdAt, another uses created_date, and a third uses date_created, the model either generalizes from the most-frequent variant and gets the other two wrong, or hedges and produces verbose conditional SQL. Uniform naming lets the model carry an assumption across tables without re-checking the catalog for every column; the accuracy gains from clean conventions stack on top of the 27% lift studies attribute to column comments alone. Conventions that were about making humans and codegen tools agree turn out to matter just as much for the machine-reading layer.

The common thread: tools treat conventions as a contract. When the contract holds, tools work. When it doesn’t, tools either break or force the team to maintain exceptions forever.

The contract is implicit

Nobody writes down that created_at must be a TIMESTAMPTZ or that FKs must be named <table>_id; the tooling silently starts expecting it. The moment a table violates the expectation, every tool built on it starts producing surprises. Conventions are a contract whether or not anyone acknowledges them, and the tools are the ones keeping score.

Each decision below matters only if something in your stack cares about it. The notes below lean on what tools typically expect. Pick the option that matches your automation. If nothing in your stack cares either way, skip the decision; it won’t survive the next round of team change regardless of which side “won” the debate.

Naming: snake vs camel

snake_case is the idiomatic choice for PostgreSQL and MySQL. Unquoted identifiers in PostgreSQL are case-folded to lowercase, so created_at and createdAt both become createdat unless one is quoted, which means mixed-case names force every query to quote the column. camelCase works if the team is disciplined about quoting, but most teams aren’t. Pick snake_case unless there’s a specific reason not to.

Table names: singular or plural

Both work. Rails and Django default to plural (users). CREATE TABLE user will actually fail in PostgreSQL because user is a reserved word, which is an argument for plural. Singular reads cleaner in joins (user.id feels like “the user’s id”). This is the smallest decision on the list in terms of consequences. The real requirement is that whatever you pick, you use it everywhere.

Primary keys: `id` vs `<table>_id`

Bare id is shorter and matches the default of most ORMs. It also creates a subtle hazard: table_a.id = table_b.id is syntactically valid SQL that silently returns wrong results. <table>_id (so user_id on the users table) makes cross-table joins impossible to write accidentally, because the identifier tells you which table the ID belongs to.

The trade-off is that ORM defaults expect id, so using <table>_id means configuring every model. For teams that rely heavily on an ORM’s conventions, staying with id is pragmatic. For teams with more ad-hoc SQL, <table>_id pays off.

Foreign key naming

user_id referencing users.id is the convention most tools expect. Ad-hoc names like owner, creator, assigned_to, ref_id are sometimes necessary (multiple FKs to the same table need different names) but should be explicit about what they reference, either in the column name (owner_user_id) or in a schema comment. A column named owner with no comment and no FK is a question nobody can answer from the schema alone.

Mandatory columns

Decide which columns every table must have. Common choices:

created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(). Row creation time.
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(). Last modification, driven by a trigger or application logic.
created_by / updated_by. Audit fields, if the team needs them.
deleted_at TIMESTAMPTZ. Soft-delete marker.

Partial adoption is worse than none

If 80% of tables have deleted_at and 20% don’t, every query has to remember which tables to filter and which not to. The queries that forget silently return soft-deleted rows from some tables and not others. Pick a rule (“every table has created_at, updated_at; soft-delete tables have deleted_at”) and apply it uniformly.

Status and enum patterns

Three common strategies, each with trade-offs:

INT with documented values. status TINYINT NOT NULL COMMENT '1=active, 2=paused, 3=cancelled'. Compact, fast, relies on comments for semantics. Works across engines.
CHECK constraint. status VARCHAR(20) CHECK (status IN ('active', 'paused', 'cancelled')). Self-documenting in the DDL, slightly larger storage, human-readable in query results.
Native ENUM. PostgreSQL has first-class ENUM types, MySQL has ENUM(...). Compact and typed, but changing the set requires a schema migration; in PostgreSQL, removing a value is genuinely hard.

Any of these is fine. Mixing them (one table uses INT, another uses CHECK, a third uses ENUM) is what creates the problem. Every query that aggregates across tables has to handle three value formats.

Boolean prefixes

is_active, has_completed, can_edit make filter expressions self-documenting: WHERE is_active AND NOT is_deleted. Bare names like active or completed create ambiguity in review. Is this column a flag or a timestamp? Adjective or verb? Prefixing eliminates the ambiguity at no runtime cost.

Timestamp types

The choice matters more than the name. TIMESTAMP in MySQL auto-converts between UTC and the session timezone, which is usually not what you want. DATETIME stores the literal value with no timezone awareness. PostgreSQL’s TIMESTAMPTZ stores UTC with automatic conversion on input and output, the most forgiving option for most applications.

Mixing types across related tables is where silent timezone bugs come from. A created_at TIMESTAMPTZ on one table joined to a DATETIME on another will either implicit-cast or mismatch, depending on engine and version. Pick one per engine and apply it everywhere.

Character sets and collations

utf8mb4 in MySQL, UTF-8 in PostgreSQL. Anything else in 2026 is a legacy holdover. The subtle hazard: mixing charsets across columns causes joins between text columns to fail silently or return wrong results. PostgreSQL is stricter about this; MySQL is more permissive and more dangerous because of it.

Conventions beyond the schema

Schema conventions usually stop at the DDL, but the automation layer around the database depends on naming decisions that live outside it: secrets, endpoints, users, roles, hostnames, backup files, environment variables. Those names show up in Terraform modules, Vault paths, Kubernetes resources, IAM policies, service-discovery records, monitoring dashboards, and every deploy pipeline. When they’re consistent, the infrastructure is self-describing and IaC modules stay generic. When they aren’t, every piece of automation grows a special case.

Common places this shows up:

Secret names. prod/db/orders/primary/password vs prod-orders-db-pw vs orders_prod_password. A clear prefix/suffix pattern lets secret rotation scripts, IAM scopes (arn:aws:secretsmanager:*:*:secret:prod/db/*), and environment-promotion automation use wildcards instead of hardcoded lists.
Hostnames and endpoints. db-orders-rw.internal and db-orders-ro.internal for reader/writer splits, db-orders-primary-0.us-east-1 for cluster node addressing. Consistent patterns mean DR runbooks, connection pools, and failover scripts can resolve endpoints by transforming a base name rather than reading from config.
Database users and roles. app_orders_rw, app_orders_ro, migration_bot, readonly_analytics. The role name should say what it can do. Teams without a convention end up with svc_user_42, rails, monitoring, and nobody can audit privileges without a spreadsheet.
Database names. orders_prod vs prod_orders vs orders-production. Consistent environment placement (always suffix or always prefix) means wildcard grants, backup pattern matching, and cross-environment queries stay simple.
Environment variables. DB_ORDERS_HOST, DB_ORDERS_USER, DB_ORDERS_PASSWORD_SECRET. A per-service naming convention lets config loaders and IaC modules generate the full variable set from a single identifier.
Backup and snapshot names. orders-prod-20260420-0000 vs backup_orders_20260420. Retention jobs, restore runbooks, and compliance audits all read these names by pattern.

These aren’t schema conventions in the strict sense; they’re operational conventions that happen to be tied to the schema. They follow the same rules: pick a pattern, apply it everywhere, document it where the infrastructure code lives, and enforce it in the IaC linter (tflint, checkov) or the Kubernetes admission controller so new resources can’t be named off-pattern.

The failure mode is the same as inside the schema. A team with three secret-naming patterns needs a custom script per resource. A team with three hostname patterns runs DR runbooks twice as long as they should be. Operational conventions have the same compounding cost as schema conventions, in a different layer; the tooling to enforce them is different (IaC linters instead of SQLFluff), but the discipline is identical.

Enforcement: conventions without enforcement decay

Written conventions that nobody enforces last until the next person who didn’t read the doc. The only conventions that hold over years are the ones CI checks.

Schema linters

SQLFluff is the most popular for PostgreSQL and MySQL. It runs on migration files in CI and can enforce:

Naming rules (snake_case only, specific prefixes/suffixes).
Required columns on CREATE TABLE (every table must have created_at).
Forbidden types (reject TIMESTAMP in favor of TIMESTAMPTZ).
Style (trailing commas, keyword casing, indentation).

The alternative is a custom linter, a script that parses migration files and checks them against a ruleset. More work to build but more flexible if the rules are unusual. Teams with strong opinions often end up here.

CI checks on the schema itself

Beyond linting migration files, a CI job can introspect the database after migrations are applied and assert properties of the final schema:

1
2
3
4
5
6


-- Every table in the application schema has created_at
SELECT table_name
FROM information_schema.columns
WHERE table_schema = 'public'
GROUP BY table_name
HAVING COUNT(*) FILTER (WHERE column_name = 'created_at') = 0;

If the result is non-empty, fail the build. This catches the migration that adds a new table without the mandatory columns, the case a file-level linter can miss if the CREATE TABLE was split across migrations.

Other useful assertions:

1
2
3
4
5
6
7
8


-- No table uses TIMESTAMP without timezone
SELECT table_name, column_name
FROM information_schema.columns
WHERE data_type = 'timestamp without time zone'
 AND table_schema = 'public';

-- Every FK column has an index
-- (expensive to query but worth running on schedule)

Introspection-based checks run against the shape of the schema after migrations are applied; they catch drift the file-level linter can’t see.

Pre-commit hooks

Developer-machine enforcement: running sqlfluff on staged migration files before commit. Faster feedback than CI, but only works if every developer has the hook installed. Treat pre-commit hooks as a developer experience improvement, not as the real gate. CI is the gate.

CODEOWNERS on migration directories

Putting a small group of owners on migrations/ forces review by someone who understands the conventions. This is a human check, not a mechanical one, but it catches things the linter can’t (“this new table has all the right columns but the design is wrong”). The owner doesn’t have to be one person; a rotating review responsibility works.

Review templates

A PR template that includes a checklist for schema changes (“does this follow the naming convention? does it include mandatory columns? are the types consistent with existing tables?”) nudges the author to check before review. The cost is zero; the benefit is that most issues get caught before they reach a reviewer.

Scope: strict for new, lenient for legacy

The enforcement question that derails most teams: do existing tables have to meet the convention? Trying to retrofit decades of legacy is an impossible project; requiring only new tables to meet the convention is achievable. The practical pattern:

New tables. Linter is strict. No exceptions without a documented reason.
Existing tables. Grandfathered. Linter skips them or only checks newly-added columns.
Legacy migrations. An explicit backlog, prioritized by frequency of use and onboarding pain.

This splits the problem into “hold the line on new work” and “improve legacy opportunistically.” Both are manageable. Trying to do both at once isn’t.

The hardest part: changing conventions without creating a new one

Conventions decay not because they were bad, but because they changed faster than the team could propagate the change. The result isn’t “the new convention”. It’s a schema with three coexisting conventions, none of which applies everywhere.

The discipline is straightforward, even if it’s not always followed.

Write the convention down

Before enforcement, before any migration, there has to be a single authoritative document: a SCHEMA-CONVENTIONS.md in the repo, or a runbook, or an RFC. Not a Slack thread, not tribal knowledge. Something a new engineer can read and apply.

The doc is short by design: a page or two, not a book. It answers “what naming convention do we use?” and “what columns does every table need?” and “which timestamp type?”. It doesn’t try to teach relational design. Short docs get read; long ones don’t.

Use a lightweight RFC process for changes

When someone wants to change a convention (switch from id to <table>_id, add updated_by as a mandatory column, move from INT to UUID primary keys) it goes through a written proposal:

What’s changing and why.
Impact on existing tables (migrate all, grandfather, or cutover by date).
Impact on tools, ORMs, dashboards, and downstream consumers.
Who decides (single decision-maker or review board).
Explicit cutover date if changing for new work only.

The RFC doesn’t have to be heavyweight. A paragraph in a shared doc, reviewed by two or three people, approved by a named owner. The value isn’t the document. It’s the forcing function that prevents conventions from changing by PR comment.

Decide: migrate, grandfather, or both

Three options, each with a different risk profile:

Migrate everything. Rename columns across the schema, update every query, every ORM model, every dashboard. This is the clean option and almost never the practical one. Retroactive renaming breaks downstream consumers the team may not even know exist: analytics jobs, exports, integration partners, cached query plans.
Grandfather legacy, enforce on new. Old tables stay as-is; new tables follow the new rule. The schema ends up with two conventions coexisting, but it’s predictable: “tables before this date use X, tables after use Y.”
Cutover with a migration window. Pick a date, migrate the highest-traffic or highest-visibility tables before the date, grandfather the rest, close out the long tail opportunistically.

The grandfather option is the most common in practice because it respects the reality that the schema is a shared resource nobody fully owns. Write the decision down (“before 2025-Q3, tables used camelCase; after, snake_case”) so future engineers know the split exists and isn’t a bug.

The two-generation rule

Two is the limit

One convention is best. Two coexisting conventions is survivable - new engineers can be told “look at the table’s creation date.” Three or more is where schemas become unreviewable. Any proposal to change a convention needs to answer: “are we ending up with two generations, or a third?” A third generation is a forcing function to finish migrating the first one first, not to introduce a new one.

This is a heuristic, not a hard rule, but it’s a useful test. When a proposed change would create a third convention without a plan to eliminate one of the existing two, the change probably isn’t worth it.

When to accept legacy drift

Not every legacy convention is worth fixing. The calculation:

How often does the old convention cause bugs? Column names nobody can remember, types that force implicit casts, missing mandatory columns that break tooling. Real costs, worth migrating.
How often is the table touched? A table used by ten queries a day is different from one used by ten thousand. Migration risk scales with usage.
What breaks downstream? ORM models, dashboards, exports, cached plans, monitoring. Every consumer of the table name or column name has to update. If the count is unknown, it’s higher than you think.
Is there a cheap alternative? A VIEW that exposes the table under the new convention, while the underlying table keeps its legacy name, can bridge the gap without a full migration.

The honest answer is often “leave it alone and document why.” A comment in the schema, or a note in the conventions doc, is cheaper than a migration and accomplishes the main goal: making the inconsistency visible and intentional.

Trade-offs

Conventions have a cost. A rule that doesn’t serve automation is noise. It takes space in the conventions doc, invites bikeshedding in review, and adds nothing to the schema’s consistency over time, because there’s nothing to keep it from decaying the moment the people who cared move on. The heuristic: if no tool fails when the rule is violated, the rule doesn’t need to exist.

Over-specifying is the second failure mode. A team with thirty linter rules will find a way around them or ignore them. Rules that block common, legitimate cases get bypassed with -- noqa comments until the linter stops being a gate.

The lightweight approach:

A small set of rules, each one tied to a specific tool that cares (naming, mandatory columns, forbidden types).
A larger set of advisory warnings, not blockers.
A clear escape hatch for exceptions, with the exception documented.
Periodic review. Rules that fire too often are wrong; rules that never fire are noise.

Strict conventions are a feature up to the point where the enforcement matches the rule count. Beyond that, they become a tax on every change. The right level is the smallest set automation will actually enforce without constant arguments.

The bigger picture

The useful question is what your automation needs, and whether a machine can enforce it. If yes, pick the convention your automation needs and wire it into CI. If no, skip the decision; debating aesthetics in the absence of enforcement produces nothing that will still be true a year from now. People change, teams turn over, preferences drift. A convention enforced by a linter doesn’t care who wrote the migration; a convention enforced by “we agreed last quarter” does.

The schemas that age well are the ones where the only surviving conventions are ones a linter, ORM, migration runner, or IaC module is actively enforcing. Everything else (bikeshed questions about singular vs. plural, religious debates about column ordering) drifts the moment the people who cared stop working there. That’s the predictable result of anchoring a rule to something as ephemeral as a team’s current preference.

Where Business Logic Lives - Database vs. Application

Wed, 19 Mar 2025 00:00:00 +0000

TL;DR

Keep the database narrow: NOT NULL, UNIQUE, FK within a service, simple CHECK for per-row invariants, generated columns for stable derived values. Put everything else (orchestration, computation, rules that change weekly, anything crossing services) in an application-layer library every writer uses. “Dumb database” is half right: dumb across service boundaries, narrowly smart within one.

amount >= 0 lives in three places. A CHECK on the column, a Pydantic validator in the API model, a guard in the order-creation service. Added in different quarters by different teams. Out of sync since GDPR forced a change to the validator that nobody propagated to the constraint. The migration tightening the CHECK to match fails on 4,000 rows the application thought were fine.

This is the default state of any rule about valid data, eventually. It lives in more than one place. The places drift. The reflex answer, “both layers for safety,” is what produced the drift in the first place; “application-only because we have microservices” is the same answer applied to a different fashion cycle. Neither is a decision, both are defaults. The useful question is what each layer can enforce, what it costs, and how often the rule will change. Four axes do the work: scope, cadence, cost, and write-path count.

The short history of the “dumb database” position

The microservices canon and the cloud databases built to support it have already answered one half of this question.

Chris Richardson’s database-per-service pattern rules out cross-service foreign keys as a design choice: each service owns its schema and no one else touches it. Fowler and Lewis’s “Microservices” article coined “smart endpoints and dumb pipes” and “decentralized data management”. Neither the middleware nor a shared database holds cross-service logic. Fowler calls the alternative, integration through a shared database, the canonical encapsulation breach. Vaughn Vernon’s DDD work puts the consistency boundary at the aggregate, enforced in process, not in the DBMS.

The storage layer follows suit. Google Spanner does not support user-defined stored procedures or triggers; its docs explicitly say that on migration, “business logic implemented by database-level stored procedures and triggers must be moved into the application.” DynamoDB has no CHECK, no foreign keys, no triggers; integrity is a per-item conditional write. Cassandra, Bigtable, and Uber’s Schemaless are the same story. Facebook’s TAO keeps the social graph’s integrity inside TAO itself; the underlying MySQL shards don’t enforce it. Shopify, even inside a Rails monolith, doesn’t enforce relationships at the database layer; foreign keys are maintained only in the model code, a choice driven by their sharding and cell architecture.

That’s the position the last fifteen years of large-scale engineering has converged on, and it’s right in the scope it applies to. Across service boundaries, the database physically can’t enforce most cross-cutting rules, the dominant cloud storage engines won’t host procs or triggers, and the pattern literature has codified the split.

The mistake is generalizing from this to “the database should be dumb, period.” That collapses two different debates into one slogan.

Where the position is strong and where it isn’t

The near-unanimous consensus is about cross-service integrity: FK between services, triggers as integration glue, stored procs as the coordination layer. There the answer is genuinely settled. Application-layer, usually in a shared library, sometimes in an orchestration service.

The within-service question is different. Inside a single service’s private schema, with one team owning the reads and writes, the database still sees every write path the service produces: the normal request path, backfill scripts, admin tools, the occasional DBA command at 2am, the new code path the team added last sprint. Richardson, Fowler, and Vernon don’t argue against NOT NULL, CHECK, or UNIQUE inside that boundary. Shopify’s position is an outlier driven by sharding operations, not ideology. Yugabyte goes further and defends stored procedures and triggers inside a service boundary.

So the real framing: the “dumb database” position is unanimous across service boundaries and contested within them. The rest of this post is about where the line actually sits within a service. The honest answer is still “mostly keep the database lean, but not empty,” for reasons that have more to do with deployment cadence and scaling economics than with purity.

The four axes that actually decide the split

The rule-by-rule question is a balance across four properties of the system, not a preference between layers.

1. Scope: does this rule cross service boundaries?

If the rule spans services, the database can’t enforce it. A foreign key into another service’s database doesn’t exist. A trigger that writes to tables owned by another team isn’t compatible with any sane microservices pattern. Cross-service correctness lives in application code, typically in a library that every writing service depends on, or in event-driven compensation (sagas, outbox patterns, eventual-consistency protocols).

The only databases that let you enforce cross-service rules are ones the pattern literature treats as an anti-pattern on purpose: shared databases with multiple writers.

2. Cadence: how often does this rule change?

Application code deploys in minutes. Schema migrations deploy on a migration window, with expand-and-contract dances, NOT VALID + VALIDATE phases, and careful ordering across rolling deploys. A rule that lives in the database inherits the database’s deployment cadence.

That’s fine for rules that change annually or never: “email column is not null”, “amount is non-negative”, “status is one of four values for the life of the product”. It’s painful for rules that change with product experiments: pricing logic, promotion codes, fraud thresholds, discount stacking rules, feature gates. The friction of modifying a CHECK constraint or a stored procedure for a rule that’s going to change again next quarter adds up to “this probably shouldn’t have been in the database in the first place.”

3. Cost: where can this rule run cheapest?

The application tier scales horizontally. The primary database, for most OLTP workloads, scales vertically until sharding, and sharding is a project, not a tuning knob. Every CPU cycle spent inside the database is a cycle not spent on I/O, lock management, query planning, or serving other requests. A busy primary at 80% CPU doesn’t have slack for an additional stored procedure body to run on every write.

For a simple CHECK (amount >= 0), the cost is measured in nanoseconds per write. Irrelevant. For a trigger that recomputes an aggregate on every insert, the cost is a hot row plus whatever the aggregation costs, charged to the most scarce compute tier in the system. For a stored procedure that loops over rows, the cost is full procedure-body CPU on the primary for every call.

Application code, by contrast, has near-free horizontal scale. Adding a pod is cheap. Adding database CPU is vertical-scaling dollars until you’ve run out of instance sizes, then it’s a sharding project.

The database is a vertical-scaling tier

Moving computation into the database moves it toward the scaling ceiling. Declarative constraints (CHECK, FK, UNIQUE) are cheap enough to be irrelevant. Triggers that do nontrivial work, procedures that run loops, and anything that touches multiple rows per call eat CPU on the one tier that’s hardest to scale. The “app can do this magnitudes faster” intuition is right when “faster” is measured in throughput under load, not because a single call is faster, but because the application tier absorbs more of them without a scaling event.

4. Write-path count: how many things write to this schema?

One service, one codebase, one team, one ORM writing to a schema the team fully owns: application-layer enforcement works. A shared library is the single choke point; every write goes through it.

More than one writer (multiple services, admin tools in a different language, backfill scripts maintained by a different team, DBA incident-response SQL) and the library has gaps. Every writer that isn’t the library bypasses the validation. The database is the only layer that catches them all, and the cost of catching them is a small set of declarative constraints.

Two writers isn’t a lot. Most systems that survive a few years accumulate more: data-migration jobs for a table split, an admin dashboard written in a different stack than the service, a reporting ETL that occasionally writes aggregates back, a partner integration that writes through a shared DB user.

The balance that holds in practice

The four axes point at a consistent split. Keep the database narrow and declarative. Put everything else in application code, ideally in a library every writer depends on.

The narrow set the database earns its keep on:

NOT NULL, UNIQUE, FOREIGN KEY within a service’s private schema.
Simple CHECK constraints for per-row invariants: ranges, regex on identifiers, enum membership.
Generated columns for derived values that are deterministic, stable, and cheap to compute.
Indexes the application needs for performance (not business logic, but a reminder they belong in the schema, not in code).

These are declarative, near-zero CPU cost per write, cover every write path, and change rarely enough that the schema’s deployment cadence isn’t a problem. Foreign keys in particular are the canonical within-service example. A post on their own goes deeper on why application-layer referential integrity consistently loses to database-enforced FKs over time, and that argument is this whole post’s framework applied to one specific constraint.

What stays in application code:

Orchestration across multiple statements, services, or external calls.
Rules that depend on request context, caller identity, time-of-day, or anything outside the row.
Rules that change with product experiments.
Rules that span services.
Computation that would cost measurable database CPU per call.
Derived values that involve complex business logic or are likely to change.

If there’s one writer, a shared library is the single source of truth. If there are multiple writers (or there will be, which is most systems after a year), the library is still valuable but needs a narrow safety net in the database for the invariants that would corrupt data if they slipped.

The library as the primary, the schema as the safety net

The pattern that works in practice: a validation library (or a rich domain model) owns the full rule set, including validation messages, business logic, cross-field checks, everything the UI and API need. The schema carries only the declarative subset the database can enforce cheaply: NOT NULL, CHECK, UNIQUE, FK. When the library’s rules diverge from the schema’s, the database rejects the write. The schema is the safety net, not the primary enforcement path. Violations surface as 500s that flag drift, not silent corruption.

CHECK constraints, the cheap, defensible middle ground

Declarative CHECK constraints are the strongest example of database-side logic that justifies itself on every axis.

1
2
3
4
5
6
7
8
9


CREATE TABLE orders (
 id BIGINT PRIMARY KEY,
 user_id BIGINT NOT NULL REFERENCES users(id),
 amount_cents BIGINT NOT NULL CHECK (amount_cents >= 0),
 currency CHAR(3) NOT NULL CHECK (currency ~ '^[A-Z]{3}$'),
 status TEXT NOT NULL CHECK (status IN ('pending', 'paid', 'shipped', 'refunded')),
 placed_at TIMESTAMPTZ NOT NULL,
 shipped_at TIMESTAMPTZ CHECK (shipped_at IS NULL OR shipped_at >= placed_at)
);

Scope is within the service’s schema, applicable. Cadence is annual or never; adding a new status value is a planned migration, not a product-experiment iteration. Cost is near zero, since the planner evaluates the expression once per write and for the operators shown it’s nanoseconds. Write-path count covers every path, including the backfill job someone writes next year in a different language.

The trade-off is real but small. Error messages from a constraint violation are less friendly than a hand-crafted validation message, and adding a CHECK to a large existing table is a migration project (MySQL rewrites the table; PostgreSQL needs NOT VALID then VALIDATE CONSTRAINT to avoid long locks). Both are known problems with known workarounds.

The common pattern that holds up: application library owns the error message and UX, the database owns the enforcement. The library’s check is a fast-path for better errors; the constraint is the gate.

Generated columns, the most underused declarative tool

Generated columns produce a derived value from other columns in the same row. MySQL since 5.7, PostgreSQL since 12. Indexable. Can’t be written to. Consistency guaranteed by the engine.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


CREATE TABLE line_items (
 id BIGINT PRIMARY KEY,
 order_id BIGINT NOT NULL REFERENCES orders(id),
 unit_price_cents BIGINT NOT NULL CHECK (unit_price_cents >= 0),
 quantity INT NOT NULL CHECK (quantity > 0),
 total_cents BIGINT GENERATED ALWAYS AS (unit_price_cents * quantity) STORED
);

CREATE TABLE users (
 id BIGINT PRIMARY KEY,
 email TEXT NOT NULL,
 email_normalized TEXT GENERATED ALWAYS AS (LOWER(email)) STORED,
 UNIQUE (email_normalized)
);

On the four axes: scope is within-service, cadence is stable (the formula is an identity, not a business rule), cost is negligible (pure arithmetic or string operations), write-path count covers everything because every writer gets the same result automatically. Generated columns are the cleanest way to handle derived values that would otherwise be maintained by discipline.

The cost: the derivation has to be stable. Changing email_normalized = LOWER(email) to add Unicode normalization is a migration. If the formula is an active business rule, it’s the wrong tool.

Triggers, for schema migrations only

Triggers run procedural code on insert, update, or delete. That’s exactly what makes them wrong for implementation logic. A trigger mutates rows the caller didn’t ask to change, fires cascades the caller didn’t initiate, and makes “this update touches one column” a lie. The caller’s application logs say one thing; the database does something else. When a bug surfaces, the stack trace goes to application code that never ran the hidden logic.

The usual defenses (updated_at maintenance, audit logging, soft-delete cascades, counter caches) are all better handled in application code.

updated_at belongs in the ORM’s model callback, the shared write library, or a middleware that sets it on every persist. Every writer already goes through that path, and adding a timestamp is one line. If backfill scripts or admin tools bypass the library, the fix is to make them use the library, not to paper over the gap with a trigger.

Audit logs need application context: the user ID, the request ID, the reason, the session, the tenant. A trigger can’t see any of that without awkward session-variable tricks that break across connection pools. Write the audit row in application code, next to the logic that knows why the change is happening.

Soft-delete cascades are business rules. Which child rows get deleted when a parent is soft-deleted, in what order, with what side effects, is a product decision, not a storage concern. Orchestrate it in the application.

Counter caches via trigger create a hot row where every concurrent write serializes on the same parent lock. Application-side counters, background rollups, or a separate events-with-aggregation pipeline all scale better and leave the hot path free.

The general principle: application logic should be visible in application code. A trigger that modifies data the application wrote is a hidden side effect, and hidden side effects are an anti-pattern for the same reason global variables are. They make the reachable state of the system larger than the code the reader is looking at.

The debugging cost is the real cost

When an on-call engineer is looking at a production incident, they read the application code that ran. A trigger that fired three levels down, in a language they may not read fluently, mutating rows nobody expected, is the single biggest source of “the code says X, the database did Y” incidents. That’s not a tooling problem. It’s a design choice that can be avoided by not writing triggers as implementation.

This gap widens when an ORM sits between the application and the database. ORMs model what they created (columns and relations) and don’t reflect triggers, CHECK constraints, or generated columns in the model class. A trigger that mutates a row after insert produces data the ORM didn’t know would be there, and the in-memory object diverges from the persisted row until someone thinks to reload. The ORM coupling post covers this failure mode in more depth; triggers are one of the specific shortfalls that show up as “the model says one thing, the database did another.”

The legitimate case: schema migrations

The one place triggers earn their keep is time-bounded, explicit migration work. During an expand-and-contract schema change (renaming a column, splitting a table, changing a type), a trigger can dual-write between the old and new shape so that mixed old-application and new-application traffic both see consistent data. The trigger exists for the duration of the migration window and is dropped once the backfill is complete and all writers are on the new shape.

This is trigger-as-scaffolding. A temporary mechanism that bridges a specific transition, with a clear removal criterion. It doesn’t hide business logic; it handles transitional compatibility between two versions of a schema while the application rolls forward.

The most common real-world instance of this pattern in MySQL is Percona’s pt-online-schema-change: it creates a shadow table with the target schema, installs INSERT/UPDATE/DELETE triggers on the original to replicate writes into the shadow while data is copied in chunks, then atomically renames and drops the triggers. The triggers exist for the migration’s duration and nothing longer. In PostgreSQL, pgroll does the same kind of dual-write-via-trigger for zero-downtime schema changes. Both treat triggers exactly as this section argues they should be treated: time-bounded scaffolding with an explicit tear-down step.

Worth noting the counter-example. GitHub’s gh-ost performs the same migrations without triggers, reading the binlog instead. Their stated reason is that triggers add synchronous load to the primary during the migration and share its locking fate. That argument is about migration tooling trade-offs, not a defense of triggers in application logic. The conclusion in both camps is the same: triggers outside of migration scaffolding don’t earn their keep.

Everything outside that narrow case (cross-cutting concerns, derived values, audit logs, product rules) belongs in application code where it’s visible, testable, and traceable from the same stack trace as the logic that caused the write.

How companies end up with triggers anyway

A large share of production databases carrying heavy trigger logic didn’t get there by choice. They got there by losing track of the write boundary. The pattern is predictable. A database starts as one service’s store. A second team needs the same data and connects directly because it’s easier than building an API. A data-warehouse ETL starts writing back aggregates. An analytics job needs a “last seen” column updated. A partner integration gets a read-write user “just for this quarter.” Five years later the database has a dozen clients, some inside the company, some not, some on systems nobody actively maintains, and nobody has a full list.

At that point, asking every writer to go through a shared library stops being possible. The library is only the single source of truth if every writer imports it, and “every writer” now includes a Java batch job, a Go analytics worker, a legacy PHP admin tool, a vendor ETL, and a spreadsheet someone’s been running for years. The company doesn’t know where all the calls are coming from, so moving rules into an API layer isn’t an option. There’s no API layer every caller can be forced through.

The database, meanwhile, sees every writer. That’s how a team ends up with a trigger enforcing a rule that should have been in application code. The trigger is the only remaining place. It’s a symptom of losing the boundary, not a design choice made on its merits.

The real lesson is that the boundary is the thing worth defending. Once multiple unknown clients are writing to a schema, every future rule either becomes a trigger by necessity or goes un-enforced. Greenfield systems should treat “who is allowed to write to this schema” as a first-class architectural decision, with one service in front of it and everyone else going through that service. Migrations out of the trap exist (service extraction, proxying direct-DB clients through a write API, introducing a write-time event bus) but they’re multi-quarter projects, and the trigger layer usually stays in place throughout because it’s doing the job nothing else is available to do.

Stored procedures, the vertical-scaling trap

Stored procedures move application logic into the database process. They’re the tool most directly opposed to the “database as storage” position, and the one with the clearest scaling argument against them. On the four axes, stored procedures fail most of them for general business logic.

Scope is within one database. Across services, impossible (which is part of why Spanner and DynamoDB don’t support them). Cadence is schema-migration speed; a product rule that needs a hotfix takes a migration. Cost is the procedure body running on the primary’s CPU, competing with every query for the same scarce resource, when the application tier could run the same logic on a pod that scales horizontally. Write-path count is the one axis where procedures are strongest: if the procedure is the only way to perform the operation, every write path is covered.

The narrow case for stored procedures is the intersection of those trade-offs. Operations that must be atomic, must cover every write path, and would be prohibitively expensive to run row-by-row over the network. Bulk data operations that are genuinely row-by-row expensive. Security boundaries where the application is explicitly not trusted with direct table access. Legacy systems where procedures are the system of record.

Outside those cases, stored procedures trade a scaling-ceiling problem and a deployment-cadence problem for centralization that a shared application library provides at lower cost. The argument that “a stored procedure prevents the application from drifting” is real, and the same argument applies to a validation library without the scaling or deployment penalty.

Views, the quietly useful option

Views don’t enforce writes but they do shape reads, and shaping reads affects correctness in practice. A view that filters soft-deleted rows means every consumer sees the same definition of “active”. Updatable views can also be a migration-compatibility tool.

1
2


CREATE VIEW active_orders AS
SELECT * FROM orders WHERE deleted_at IS NULL;

Scope is within-service. Cadence is fine either way; view bodies change as often as the underlying queries. Cost is the planner expanding views at query time, and complex views can hide expensive plans from the caller. Write-path count is read-time only, so views don’t help with integrity.

Views are underused for their cheap benefits (canonical join shapes, soft-delete filtering, migration shims) and overused when they become a layer of logic the calling code can’t see. Materialized views are a separate topic; they add refresh-cadence questions the live-query tools don’t.

Derived columns and counter caches, implicit logic

Comment counts, follower counts, status summaries, running totals. Every one of these encodes business logic; the question is which mechanism maintains it.

1
2
3
4
5
6


CREATE TABLE posts (
 id BIGINT PRIMARY KEY,
 author_id BIGINT NOT NULL REFERENCES users(id),
 comment_count INT NOT NULL DEFAULT 0,
 last_comment_at TIMESTAMPTZ
);

Through the four-axis lens, four mechanisms:

Application code maintains it. Cadence is fast. Cost is zero on the DB, per-write work on the app tier. Write-path count fails if any writer skips the library. Scope is fine within the service.
Materialized view or batch job. Cadence is decoupled from the write. Cost is the refresh window. Write-path count covers everything, but the value is stale between refreshes. Scope is within-service.
Read-time aggregation. Cadence is irrelevant. Cost is per-read and can be expensive on feed-style queries. Write-path count is always correct. Scope is within-service.
Separate counter service with async events. Cadence is fast. Cost is extra infrastructure and delivery semantics to reason about. Write-path count covers everything if every writer publishes the event. Scope is any.

A trigger is conspicuously absent from that list on purpose. Counter-cache triggers are the canonical example of hidden logic causing a contention problem the application team can’t see: every concurrent comment insert serializes on the parent post’s row lock, and the debugging path goes straight through PL/pgSQL the service engineers didn’t write. The four-axis analysis points instead at the library-maintained counter when there’s one writer, the background rollup when reads are hot, and a separate counter service at scale or across boundaries.

The library pattern, done seriously

The natural consequence of “narrow database, logic in application” is that the application layer’s logic has to be reusable. A validation that only lives in one service’s Rails app isn’t a library, it’s service code. A library every writer imports is the actual mechanism.

Four shapes show up in practice:

Monolith, one language. A package inside the codebase, imported by every write path. Works well. Admin tools and background jobs depend on the same package as the web request path. Backfill scripts should depend on it too; in practice this is where discipline breaks down.
Microservices, one language. A shared library published as a package. Every service depends on the same version, or accepts that a rollout takes a deploy cycle across services. Version skew is the operational tax.
Polyglot services. A shared library doesn’t exist. Validation gets reimplemented per service, or pushed into a validation service that every caller hits over RPC. The RPC option is real and works; it turns “shared library” into “shared service” with the same logical role.
Schema-first code generation. Tools like sqlc and jOOQ generate typed client code from the schema, which gives a narrow kind of library reuse (type safety and query shapes) without attempting to encode business logic. For logic itself, schemas aren’t enough; the library is separate.

The discipline that makes this work: the library is the only write path, and if it isn’t, the database’s declarative constraints are the backup. The two pieces reinforce each other. The library holds the full rule set, fast and rich and horizontal-scale. The schema holds the small subset the database can enforce cheaply and that every writer, library or not, has to pass through.

The duplication trap

The most common failure mode isn’t picking the wrong layer. It’s picking both without deciding which is authoritative.

Application validator: email must match regex A.
Database CHECK: email must match regex B.
Over the years, one gets updated (for GDPR, for internationalization); the other doesn’t.
Legacy rows exist that pass the old version but not the new one.
A migration that tries to tighten the CHECK fails on legacy rows the application thought were fine.

The pattern repeats with status enums, numeric ranges, referential rules, and soft-delete semantics. Two versions of the truth stay in sync as long as someone is actively keeping them in sync, and then they don’t.

The useful framing: pick one layer as authoritative and name the other as a UX mirror or a safety net. The authoritative layer is the one that runs when the other doesn’t, which, for correctness invariants where write paths multiply, still points at the database for the narrow declarative subset.

1
2


-- authoritative: the declarative CHECK
CHECK (status IN ('pending', 'active', 'closed'))

1
2
3
4


# mirror in the library: better errors, fast-fail before the round trip
def validate_status(value):
 if value not in ("pending", "active", "closed"):
 raise ValidationError("Status must be pending, active, or closed.")

If the library and the schema disagree, the schema wins and the write fails. The failure is loud, traceable, and tells you the drift exists, instead of the silent corruption you get when neither layer enforces a rule.

Rules the assistant can see

The choice of where to put a rule is, among other things, a choice about which readers can see it. An AI assistant writing SQL or application code against the schema reads the catalog (column types, constraints, FKs, CHECK definitions) and whatever source files the prompt happens to include. Declarative rules show up in information_schema and pg_constraint. The assistant can reason about them without being pointed at additional files. A CHECK (status IN ('pending', 'active', 'closed')) is visible to any schema-reading tool on day one.

Rules living in triggers, stored procedures, ORM callbacks, or a shared Python validation library don’t surface when the same tool reads the catalog. The write path enforces them at runtime; the schema doesn’t describe them. A model generating an INSERT statement against a table whose uniqueness is enforced only by a before-insert trigger will produce a query that looks correct and violates an invariant the catalog never mentioned. This doesn’t change the conclusion that most logic belongs in the application, but it does tip the math, at the margin, toward the narrow set of correctness invariants where declarative constraints pay double: they enforce on every write path, and they’re the only form of the rule a schema-reading assistant sees for free.

Trade-offs

Every position in this post has counter-arguments, and they’re real.

Declarative database constraints lock you into SQL semantics. A CHECK constraint doesn’t survive a migration to DynamoDB or Spanner without rework. Teams building for a future migration accept less database-side logic in exchange for portability. The trade is real; the frequency of actual cross-engine migrations is lower than the frequency of discussions about them.
Schema changes are slow enough that even “simple” constraints are friction. Adding a CHECK to a 500M-row table is a migration project. For teams shipping schema changes weekly, every constraint is a cost, and sometimes the cheaper answer is to accept looser database-side invariants and stricter application-side ones.
Application-side validation is easier to test, version, and roll back. A library’s tests run in milliseconds; a constraint’s tests need a real database. Teams with weak integration-testing infrastructure end up under-testing database-side rules.
Horizontal-scaling arithmetic isn’t universal. For services running on a single database at moderate load, the “vertical scaling ceiling” argument is an abstraction. The primary has plenty of headroom and the scaling argument is theoretical. The argument matters more as traffic grows.
Shopify’s position is internally consistent. No database-level foreign keys, all integrity in models, sharded storage. It works because every write path goes through Rails and because the operational investment in model-layer integrity is serious. A smaller team without that investment can’t safely adopt the same pattern; the constraints in the database are what a smaller team can afford.
Stored procedures aren’t universally bad. The Yugabyte post is right that in a single-service OLTP context, procedures can centralize logic effectively. The scaling argument is real but not always the binding constraint. Teams with deep SQL skills and disciplined version-control-for-procedures can extract more value than the “avoid them” position suggests.

The balance described above is what holds across the most common cases. Specific cases have specific answers. The failure mode is rarely picking the wrong point on the axis. It’s not picking at all.

A rule-by-rule framework

Instead of a blanket policy, a set of questions that point at the right layer per rule.

Does the rule cross service boundaries? If yes, application library or orchestration service. The database can’t help.
Would violation corrupt data? If yes, the database should enforce it as a declarative constraint, because every write path has to be covered.
Is the rule a derived value with a stable formula? Generated column. Cheap, covers every writer, zero sync code.
Is the rule a derived value with a changing formula or external inputs? Application library.
Does the rule depend on anything outside the row (request context, external services, feature flags)? Application library.
Does the rule change more often than quarterly? Application library.
Is the rule a cross-cutting concern every write path needs (timestamps, audit logs)? Application library that every writer imports, not a trigger. The trigger hides the logic; the library makes it visible to the reader of the code that caused the write.
Does the rule involve non-trivial computation or touch multiple rows per call? Application library. Database CPU is the scarce tier.
Is there more than one write path? The library alone isn’t enough; declarative constraints in the schema are the backup.

The questions don’t eliminate judgment (several rules will land on edges) but they make the trade-offs visible and keep decisions from being driven by which layer the author was working in when the rule came up.

The bigger picture

Across services, the database is storage and logic lives in services and shared libraries. That’s the direction Spanner, DynamoDB, Cassandra, and the pattern literature all point, and the cross-service question is genuinely settled. Within a service it’s softer. The database can enforce things the application can’t, a narrow set of declarative constraints costs almost nothing, and the schema is the only layer that sees every writer the library’s author didn’t plan for. Keep the database lean. Put the full rule set in a library the application owns. Let the schema carry the small subset that catches the writes the library missed (which is more writes than anyone planning the system thought there would be).

Schema-Design on EXPLAIN ANALYZE

TEXT and JSON Columns: Where the Schema Goes to Hide

What leaves the catalog when the column becomes a blob

Plausible paths, empty results

The fix, and where it stops being free

When JSON is actually the right answer

The bigger picture

Reading the Schema Is Not Reading the Data

Four ways the data disagrees with the schema

Why the catalog can’t tell you this

The fix is a habit, not a migration

When schema-only reading is fine

The bigger picture

God Tables: 150 Columns and the Quiet Cost of 'Just Add a Column'

How a row-store actually reads a row

What 150 columns actually costs

Why LLMs make this worse

Split by access pattern, not by concept

When a wide table is actually fine

The bigger picture

Legacy Schemas Are Sediment, Not Design

What’s drifted

Why this is worse for LLMs than for humans

The fix is documentation, not renaming

When a clean rewrite is actually worth it

The bigger picture

The Bare `id` Primary Key: When Every Table Joins to Every Other Table

What nobody can see

Three failure modes, ranked by how loudly they fail

Mixed PK types make the naming problem sharper

Naming is the lever that actually helps

When bare id is actually fine

The bigger picture

Polymorphic References Are Not Foreign Keys

What the pattern looks like

What the database can’t do

Reads pay for the write-side convenience

Why the pattern spreads

What the schema-reading assistant sees

Alternatives

When polymorphic is actually the right call

The bigger picture

ORMs Are a Coupling, Not an Abstraction

What the ORM is actually doing

Source of truth: pick one, know which

Migrations stop being “DB work”

Hidden queries

Two query languages, neither complete

Bidirectional coupling

Database-side logic doesn’t round-trip

Relational modeling isn’t object modeling

When scale exposes the modeling

The thinner alternatives

When ORMs still earn their place

Trade-offs

The bigger picture

Schema Conventions Don't Survive Without Automation

What “conventions” means here

Humans benefit, but not durably

Why it matters for automation

The menu: pick what automation expects

Naming: snake vs camel

Table names: singular or plural

Primary keys: id vs <table>_id

Foreign key naming

Mandatory columns

Status and enum patterns

Boolean prefixes

Timestamp types

Character sets and collations

Conventions beyond the schema

Enforcement: conventions without enforcement decay

Schema linters

CI checks on the schema itself

Pre-commit hooks

CODEOWNERS on migration directories

Review templates

Scope: strict for new, lenient for legacy

The hardest part: changing conventions without creating a new one

Write the convention down

When bare `id` is actually fine

Primary keys: `id` vs `<table>_id`