Testing Your Database, Part 2: What to Test, and How

TL;DR

What most teams call “database tests” is application tests with a database underneath — they cover whether the code reads and writes correctly, not whether the database does what its catalog claims. Real database testing covers five distinct categories, each requiring a different tool, and each invisible to the others; building one without the rest closes part of the surface and leaves the rest open.

The CI suite has 600 tests. Every one is green. Judged on what they actually exercise: 480 are unit tests using a stub database for fixtures, 80 are application integration tests using a real Postgres container with seeded fixtures, 20 are migration tests that run db:migrate against the empty test database, 20 are end-to-end tests that hit the API and observe response shape. Coverage looks comprehensive. None of these tests catches: an ALTER TABLE that locks the table for 40 minutes against production volume, a CHECK constraint that’s syntactically valid and semantically wrong, an ON CONFLICT (email) that depends on a UNIQUE constraint nobody declared, a JOIN that multiplies rows through a missing bridge UNIQUE, a query that returns the right shape with the wrong number, a generated column whose definition drifts from its dependencies after a migration. The reason isn’t that the suite is bad. The failures live in categories the suite doesn’t cover.

The obvious response is “use Testcontainers.” Testcontainers is a real tool, addresses a real gap (the test container is the production engine, not SQLite-as-Postgres), and most teams should adopt it. It still only addresses one of the categories below. A team that adopts Testcontainers and stops there has moved from “no database tests” to “application tests with a real database engine” — better, and still missing the four other categories. The same applies to every other “the one thing we should do” answer: each tool below addresses a class of failure the others can’t see, and an AI-introduced bug can land in any of them.

Five categories of database test

1. DDL safety: catch the lock and rewrite patterns before they ship

The category most teams don’t have at all. The test runs against the migration source itself — not the database — and checks for patterns known to lock or rewrite tables in production: ADD COLUMN ... NOT NULL without a default-then-backfill split (Postgres pre-11) or a non-constant default (Postgres 11+ stores it metadata-only), ALTER COLUMN TYPE that triggers a table rewrite, ADD FOREIGN KEY without NOT VALID, DROP COLUMN on a table other services still write, indexes created without CONCURRENTLY (Postgres) or without ALGORITHM=INPLACE, LOCK=NONE (MySQL).

Tool: Squawk for Postgres; lints SQL migrations for known-bad patterns. Configurable rule set, fast failure, runs as a CI step or pre-commit hook. Atlas for cross-engine coverage (Postgres, MySQL, ClickHouse) — its 2025 analyzer set covers destructive changes, data-dependent modifications like ADD COLUMN NOT NULL without a default, nested transactions, and SQL-injection-prone migration code, with hooks designed to gate AI-authored migrations specifically. For MySQL, pt-online-schema-change and gh-ost defaults plus a custom lint script that flags raw ALTER on tables over a configured row threshold.
What it catches: The locking-migration class from Part 1 — ALTER TABLE users ADD COLUMN tier TINYINT NOT NULL DEFAULT 0 against a 50M-row table. Squawk’s adding-not-nullable-field, disallowed-unique-constraint, and require-concurrent-index-creation rules surface this pattern before the migration is applied anywhere.
What it misses: Anything that requires running the migration to detect. It’s a syntactic check. A migration that’s safe by lint but breaks an invariant the team relied on still passes.

2. Schema invariants: assert the catalog says what you think it says

The catalog is full of declarations that are easy to write incorrectly and hard to read back. Did the FK actually cascade, or did it default to NO ACTION? Does the partial index cover the predicate the query actually uses? Does the CHECK reject the values it claims to? After a migration applies, are the constraints, indexes, and triggers the team intended actually present, with the names, columns, and behavior they intended? Tests in this category run after migrations apply and assert against the resulting schema and the resulting behavior — not the application’s view of it.

A note specifically on database-resident business logic. The companion post Where Business Logic Lives argues to keep most of it in the application layer, where review density and tooling are higher. Triggers, stored procedures, functions, RLS policies, generated-column expressions, and multi-statement CHECK constraints encoding state-machine rules are running code that doesn’t show up on a normal PR diff and doesn’t execute in local development the way application code does. If business logic is in the database — by accident, by legacy, or by deliberate choice — every one of those needs unit-test coverage the same way an application function would. For stored procedures and functions: assert every branch, every EXCEPTION block, every side effect on the rows the procedure touches, every return value the caller depends on. For triggers: assert each firing condition (BEFORE / AFTER × INSERT / UPDATE / DELETE), every WHEN filter, the actual state change the action performs, and any interaction with other triggers on the same table. The same pgTAP / tSQLt harness covers all of it; the discipline is treating database code as code, not as configuration. A trigger that’s never been asserted against isn’t a guarantee, it’s a hope.

Tool: pgTAP for Postgres, tSQLt for SQL Server, utPLSQL for Oracle. Schema-level assertions: has_column('users', 'tier'), col_type_is('users', 'tier', 'integer'), has_index('users', 'users_email_idx'), fk_ok('orders', 'user_id', 'users', 'id'). Behavior-level assertions: insert an illegal row and assert the CHECK rejects, insert a parent and child and delete the parent and assert the cascade fired, attempt a forbidden state transition and assert it’s rejected.
What it catches: The CHECK that lists valid values but doesn’t constrain transitions. The FK declared without ON DELETE CASCADE despite the team thinking it was. The partial index whose WHERE clause has drifted from the queries that use it. The UNIQUE constraint the upsert depends on but nobody declared.
What it misses: Anything outside the catalog: query result correctness, lock duration, performance regressions, runtime data invariants.

3. Query result regressions: assert the answer, not the shape

The hardest category and the most under-covered. The query is syntactically valid, the result set is the right shape, the EXPLAIN is clean, and the number is wrong by 30%. The test that catches this asserts the result against a known dataset — given a fixture with 1,000 users where 200 are soft-deleted, the active-users query returns 800; given five orders with stacked discounts, the revenue query returns the discounted total, not the row-multiplied total.

Example-based fixtures aren’t enough on their own. The AI-generated bugs in this category live in inputs the test author didn’t think to write. Take a paginated query the model emits as ORDER BY created_at LIMIT 100 OFFSET ?. An example test inserts ten orders, paginates through them, gets all ten back, passes. The bug — created_at isn’t unique, so rows with identical timestamps swap positions between pages and rows get skipped or duplicated — never surfaces against ten hand-written rows. A property-based test that asserts “every row appears exactly once across all pages” finds the timestamp collision in seconds and the fix is a deterministic tie-breaker (ORDER BY created_at, id). The same pattern applies to round-trip (insert and read back unchanged), conservation (sum of children equals parent total), and idempotency (running twice equals running once).

Property and fixture tests assert that the result is correct against a known input. They don’t assert the query means what the human asked. The second tier addresses that gap: dual-track evaluation, where the query runs programmatically (count, aggregate, expected rows) and a separate LLM judge scores semantic alignment against the original natural-language intent, the schema, and the result set. Thomson Reuters’ internal SQL agent shipped with 73% silent-failure rate on time-based analyses (predicates landed on the parent date column but not the joined ones); adding a “consistent time constraint across joined tables” validator plus dual-track judge eval drove it below 10%.

The 10% is lab-clean and best-case. TR measured it against curated analytical queries with known ground truth on an instrumented internal data lake. The gap from there to a typical legacy schema — undeclared FKs, polysemic TINYINT status columns, tribal-knowledge soft-deletes, four-format VARCHAR dates, JSON-as-schema (the realities in What AI Gets Wrong About Your Database) — multiplies that rate, because the judge reads the same impoverished catalog the generator does. And 10% is unshippable on its own: a daily financial query at 10% silent-failure lays down corrupted data every week, errors layer into next month’s inputs, and by the time a customer flags the discrepancy six months later the WAL retention is exhausted and the backups have rolled past. That’s a continuity event, not a bug. Treat LLM-judge eval as a floor-raiser for what reaches human review — never the release gate. The gate is property tests + fixture-based result assertions + human review against representative data; the judge sits underneath all three.

Tool: dbt tests for analytical/transformation SQL — built-in unique, not_null, accepted_values, relationships, plus custom SQL tests. Soda Core for production data quality assertions. For application SQL, custom integration tests that load a representative fixture, run the query, and assert the count and one or two known aggregates. Hypothesis (Python), fast-check (TypeScript), or PropEr (Erlang) for property-based generators that exercise the input distribution the fixture doesn’t. data-diff for regression: run the query against the same dataset before and after a change, fail if the result diff is larger than expected. For semantic verification of AI-generated SQL: LLM-judge templates from Arize, Evidently, Langfuse, or Monte Carlo, scoped to the referenced tables only — schema bloat poisons the judge the same way it poisons the generator. Differential testing — running the agent’s query and a hand-written reference against the same dataset and diffing — is the natural extension; no productized tool exists for it yet, and any team adopting AI-authored SQL at scale should be ready to roll their own harness.
What it catches: The opening incident from Part 1 — the soft-delete-naive LEFT JOIN that over-reported revenue by 7%. The JOIN cardinality blowup through a bridge table without composite UNIQUE. The polysemic-TINYINT predicate landing on the wrong meaning. The pagination that drops rows on timestamp ties. The temporal-misalignment failure from the Thomson Reuters case — predicates landing on the parent date column but not the joined ones. The aggregate that ran in 80ms and was off by $1.4M. Anything where the failure shape is “result is plausible but wrong.”
What it misses: Anything that only manifests at production scale, under concurrency, or against data shapes the fixture and the property generators didn’t include. The LLM-judge tier specifically misses any failure mode invisible from the result set — a query that returns the right number for the wrong reason still passes.

4. Lock and performance budgets: run the migration, measure what it does

The category that catches the locking migration from Part 1. The test applies the migration to a database the size of production (or a representative fraction), measures lock duration with pg_locks or information_schema.innodb_trx, runs concurrent reads and writes against the table to surface metadata-lock contention, and asserts against a budget. Same idea for queries: run EXPLAIN against a representative dataset, assert the planner uses the index the team meant, assert the cost is below a budget, assert the query plan didn’t change unexpectedly between the previous and current revisions.

Tool: Testcontainers for spinning up a real database engine inside the test runner, seeded with an anonymized prod-shaped snapshot (pg_anonymizer or equivalent for the snapshot pipeline). A test harness that applies the migration with a stopwatch and a pg_locks watcher running in a parallel session. EXPLAIN budgets via per-query assertions in CI; production query-plan regressions via pg_stat_statements snapshots.
What it catches: The 40-minute migration. The query that’s fast on 10K rows and slow on 10M. The index the planner ignores. The migration that succeeds in isolation and deadlocks against concurrent writers.
What it misses: Catalog-level invariants the migration doesn’t change but the test should still verify; data invariants that drift over time independent of any migration.

5. Data invariants: catch the drift after the schema is correct

The category most useful for catching production drift, not pre-deployment bugs. Once a day, in CI or as a scheduled job, run a set of assertions against actual production data: every order has a user, every user with an active subscription has a payment method, the soft-delete column is consistent across related tables, the JSON keys in the column match the documented shape, the count of users.deleted_at IS NOT NULL matches the count of soft-delete audit records. These assertions don’t run against fixtures; they run against the real data and surface inconsistencies the schema can’t enforce — the integrity rules that live in application code, not in the catalog.

Tool: Soda Core, Great Expectations, or custom SQL assertions wrapped in a test runner that fails the job and pages on a missed assertion. Schemathesis for property-based testing of API contracts that hit the database. For schema-vs-data drift specifically: a periodic job asserting every documented invariant from the business logic location post against the live data.
What it catches: Soft-delete inconsistencies between related tables, orphaned rows the FK should have caught but didn’t (because the FK was declared after the orphans existed and was created NOT VALID), JSON shape drift, business-logic invariants that live in application code and got bypassed by a backfill or a one-off script.
What it misses: Pre-deployment bugs. By the time a data invariant fires, the bad data is already there. This category is the safety net under the others, not a replacement for them.

The minimum useful subset

Five categories is more than most teams will adopt at once. The order to add them, ranked by leverage per hour invested:

DDL safety lint (Squawk or equivalent). One config file, zero runtime cost, catches the highest-impact failure mode: schema migrations that lock production. Adopt this week.
Schema invariants (pgTAP for Postgres, tSQLt for SQL Server). One test file per migration, asserting the migration produced the schema the team intended and that constraints behave the way the team thinks. Catches the constraints that look right and aren’t.
Lock and performance budgets (Testcontainers + prod-shaped snapshot). The largest setup cost — the snapshot pipeline has to be built and maintained — and the largest payoff. Catches the failures that only manifest at production scale.
Query result regressions (dbt tests or custom integration tests). High value for analytical workloads, lower for transactional. Pick the queries that drive business decisions and assert their results against fixtures; expand from there.
Data invariants (Soda or scheduled SQL). Useful once the pre-deployment categories are in place. Without them, you’re chasing drift the earlier categories should have prevented.

A team that has none of these and adopts the first three closes most of the AI-introduced surface from Part 1. A team that has all five has built the verification layer that, pre-AI, lived in the heads of senior engineers — now written down, runnable on demand, and cheap to re-run on every change.

The recovery layer

The five categories test the database on the way in. They don’t test the path back when something AI-introduced makes it past every category and corrupts data anyway. Backups that have never been restored aren’t backups; they’re unverified hopes. Replication and failover that have never been broken on purpose aren’t HA; they’re configuration that hasn’t been disproved.

The drills are mechanical: pull a random backup from the last seven days, restore to a fresh ephemeral instance, run a checksum query against a known table, destroy the instance. Daily. Alert on failure. Same idea for point-in-time recovery — pick a timestamp from yesterday, restore to it, verify. Same for failover — kill the primary on a schedule in staging, confirm promotion, restore. Each drill costs an hour or two to automate and catches the class of failure where backups silently stopped working three months ago and nobody noticed because nobody needed them yet.

The framing shifts with agents in the loop. An agent that holds write permissions on production can cause damage at machine speed; recovery has to work at machine speed too. A four-hour restore that requires three engineers isn’t a recovery procedure, it’s a postmortem.

The performance environment

Category 4 covers per-PR migration safety and per-query plan budgets — good for CI feedback loops, where a Testcontainers engine spins up, runs one operation, asserts a budget, and tears down. That’s the wrong shape for the failure modes that only emerge under sustained load: throughput collapse under realistic concurrency, p95/p99 latency creeping past SLO under traffic mix, replication lag under write pressure that staging never produces, connection pool saturation when a new query plan blows the average query duration, buffer cache thrash when an added index pushes hot data out of shared_buffers, and the slow degradation that only shows up after the cache has warmed and the workload has run for hours.

The environment those tests need is a shadow of production: same data volume, same replication topology, ideally same hardware shape, populated from an anonymized snapshot refreshed on a regular cadence. Traffic comes from replay rather than synthesis — capture statement traces with pg_stat_statements or query logs over a representative window, and replay them with pg_replay, Percona Playback, or a custom harness, against the shadow at production rate. Synthetic load generators (k6, JMeter) work for application-level scenarios but miss the long-tail query distribution that production carries, which is exactly where AI-introduced regressions hide.

The payoff is the failure class CI cannot reach: a migration passes every Testcontainers check, deploys to production, and degrades p99 latency 40% three hours later because the new index shifted the planner’s choice for a different query the CI run never executed. That regression is invisible in a five-second container and obvious in a four-hour replay.

The cloud collapses most of the setup cost. Aurora’s fast database cloning is copy-on-write against the source — ready in minutes regardless of database size, you only pay for the diff, and you tear it down when done. Neon’s branches do the same for managed Postgres outside Aurora. Google Cloud SQL clones and Azure SQL database copy are slower but in the same family. Plain RDS without Aurora is slower still (snapshot + restore) but cheaper than building the pipeline yourself. The “we’d need to maintain a parallel copy of production” objection that used to kill this kind of testing infrastructure is a one-API-call problem in 2026 for anyone on managed cloud Postgres or MySQL — clone prod on demand, run the migration and the replay against the clone, throw it away. The blocker shifted from infrastructure to discipline: are AI-authored migrations actually running this gate before merge, or is the clone capability sitting unused?

A read-only replica in production is the degenerate version of this — it’s the cheapest shadow you’ll ever build, and the answer to “is the new query going to scan a billion rows” is to run it on the replica before merging. Many teams already have one; far fewer route AI-generated queries through it as a standing gate.

When this doesn’t apply

The minimum useful subset assumes a production-facing system with multiple writers, frequent migrations, and AI in the loop. Cases where less is enough:

A read-only analytical workload with no migrations. Categories 1 and 2 don’t apply. Category 3 (result regressions) and category 5 (data invariants) carry the load.
A throwaway service with one writer and no enduring data. None of this is necessary; the cost of a wrong query is recoverable.
A team with zero AI in its data layer. The case for the test categories is weaker — the implicit human review is intact. The categories still matter, but they aren’t load-bearing in the same way.
A schema small and stable enough to fit in one head. Twenty tables, three engineers, one service writing every row. The reviewer who wrote the migration is the test, the same way they always were. Grow the team or the schema by an order of magnitude and the math flips.

For everything else, the cost of building the test layer is a fraction of the cost of one production incident. AI made the math obvious.

The bigger picture

Tests are part of the schema’s contract. The schema declares what the database should enforce; the tests assert that the schema does what it claims and that the queries against it return what they claim. Pre-AI, that contract was partially carried by the engineer writing each migration and each query — slowly, with friction, with mistakes, but the friction itself was a check. AI removes the friction. The contract has to be written down or it isn’t a contract.

The five categories above are the contract written down. Each catches a class of failure the others can’t see, each has tools that have been stable for years, and each costs a fraction of the incident it prevents. The teams that build them now will catch what AI introduces before it ships. The teams that don’t will catch it in pages from oncall — the same pages that, before AI, were rare enough to absorb and are about to stop being rare.

The frontier under those five is intent verification — none of them asserts that a query means what the human asked. Dual-track LLM-judge eval is the early answer, covered above, and not yet a release gate. Until it is, the practical floor is the discipline already under your control: Part 1 covered why this became mandatory; this part covered what the layer looks like; pick one category and start.