I have caused exactly one production outage with a database migration. It was a column rename in the user table, deployed at 3pm on a Wednesday, and the application started returning 500s the moment the migration committed because the running app code referenced the old column name and the new one had not landed yet. The fix was to roll the migration back, but the rollback also failed, because by then the new code was deployed and was now referencing only the new column. We were in a position where neither the old code nor the new code worked against the database in either state. The recovery was an hour of frantic patching. The lesson stuck.
Since then I have run hundreds of migrations across production databases and not taken anything down. The discipline is straightforward, mechanical, and fits on a card. This article is that card. Three rules, the expand-contract pattern, the four-step rename, the safe-vs-dangerous checklist, and the operational pieces (locking, replicas, large-table strategies) that make the difference between a migration you can ship at any hour and one you have to schedule a maintenance window for.
The three rules
I distill the entire playbook down to three rules. Every migration I ship satisfies all three or it is not safe.
Rule 1: every database state must be readable and writable by both the previous and the next version of the application code. Said differently: if a migration runs and the new code is not yet deployed, the old code must still work. If the new code is deployed and the migration has not run, the new code must still work. There is always a window where one is ahead of the other; the database state must be valid for both.
Rule 2: never combine schema changes with data changes in the same migration. Add the column in one migration. Backfill it in a separate, batched, idempotent step. Add the NOT NULL constraint in a third migration. Combining them either takes a long lock, fails partway through, or both.
Rule 3: every migration runs concurrently and never blocks reads or writes longer than a few milliseconds. This means using CREATE INDEX CONCURRENTLY (Postgres) or its equivalent, splitting big updates into batches, and avoiding any DDL that takes an ACCESS EXCLUSIVE lock on a busy table.
Those three rules cover almost every migration disaster I have seen. Violating any one of them is how you take production down.
Schema migrations vs data migrations
The second rule above (never combine schema and data changes) hides a deeper distinction worth naming explicitly: schema migrations and data migrations have different blast radii and need different rollback strategies, and conflating them is how the worst incidents happen.
A schema migration changes the catalog: adds a column, drops an index, alters a constraint. It is fast, atomic, and a single SQL statement. The blast radius is the entire table for the duration of the lock; the rollback is another schema migration in the opposite direction. If it fails, it usually fails in the first millisecond (the lock could not be acquired) and leaves no half-applied state.
A data migration changes rows: a backfill, a re-encoding, a soft-delete sweep. It is slow, often runs in batches, and touches the application's data. The blast radius is whatever rows it has already updated when something goes wrong; rollback is much harder, because the inverse of a backfill is rarely "set everything back to NULL". If the original values were derived from anywhere reachable (an event stream, an audit table, a backup), the recovery is a re-derivation. If they were not, the data is gone.
The practical consequence: I version-control schema migrations alongside the application code (one migration tool, one file per change, applied during deploy), but I version-control data migrations as standalone idempotent scripts that record their own progress (last id processed, batch size, started/finished timestamps in a migration_runs table). Schema migrations run in the deploy pipeline; data migrations run as separate jobs that can be paused, resumed, throttled, or aborted without touching deploys. Treating both as "just SQL files" loses the distinction and tempts the engineer into running a 4-hour backfill inline with prisma migrate deploy, blocking every subsequent deploy until it finishes.
The expand-contract pattern
The single most useful pattern is expand-contract: every breaking schema change becomes a sequence of additive, fully-rolling-deployable steps, ending with the contraction that drops the old shape.
Take adding a NOT NULL column with a default. The naive migration is:
On older Postgres versions (and on most other databases) this rewrites the entire table to add the default value to every row, holding an ACCESS EXCLUSIVE lock the whole time. On a 10 million row table that can be 10 minutes of full table downtime.
Postgres 11 and later optimize the case of a constant default (the default is stored in the catalog, no rewrite). You can verify this is what is happening on a specific column by querying pg_attribute: rows where atthasmissing is true and attmissingval is non-null are using the catalog-stored default rather than a per-row written value, which means the column add was free. If you ever ship a default-add and want to be sure it did not silently rewrite the table, that is the column to check. On tables with checks or triggers, or on databases that lack the optimization, the naive form is slow and will rewrite.
The expand-contract version, safe everywhere:
Each step is short, non-blocking, and individually rollbackable. If migration 2 fails halfway, you fix the bad rows and resume. If migration 3 fails, the column is still nullable, the app still works, you debug at leisure.
The four-step rename
Renaming a column is the migration that taught me the most. The naive form is ALTER TABLE users RENAME COLUMN email_address TO email. It takes a brief lock, runs in milliseconds, and violates rule 1 immediately: between the migration and the deploy, the running app references email_address and the column does not exist.
The four-step safe rename:
Each step is independently deployable and rollbackable. At every point between steps, both the old code and the new code work. The total time is days or weeks, not minutes, but that is the price of zero downtime. I have shipped renames in 30 minutes by skipping steps; I have also taken sites down doing exactly that.
Drop column: the contract phase done right
Dropping a column looks easy. It is the second easiest way to take production down (after rename).
The mistake: drop the column in the same deploy as the code that stops referencing it. Now your old running pods (still being drained, still serving traffic) reference the column. The column is gone. They 500.
The discipline:
I run a check before every drop: a SELECT count(*) on the table grouped by whether the column is null, the most recent timestamp it was written. If the column has been written recently, something is still touching it.
A concrete example from a few months ago. We had a signup_source_legacy column on the users table, untouched in the application code for the better part of a year. Our usual drop checklist was: grep the entire monorepo for the column name, check that no scheduled job mentions it, wait one deploy cycle, drop it. We were about to do that when one of the staff engineers asked a question that had not been on the checklist before: "Does the analytics warehouse read this column?" We checked. The nightly Airflow DAG that sync-ed the users table to BigQuery for the marketing team's dashboards was still selecting signup_source_legacy explicitly, with the column hard-coded in the SELECT list. Drop the column, the DAG breaks, the marketing dashboard breaks the next morning, somebody pages me on a Saturday. We added a permanent step to our drop checklist: grep across the warehouse repo, the log indexer's field-extraction config, the BI tool's saved queries, and any external SELECT * consumer that might have cached column names. The actual ALTER TABLE ... DROP COLUMN is the easy part of a column drop. Finding everything else that secretly depends on the column is the work.
Indexes: CONCURRENTLY, always
Adding an index on a busy table can take an ACCESS EXCLUSIVE lock if you forget the CONCURRENTLY keyword.
Two operational facts about CONCURRENTLY:
- It cannot run inside a transaction. Most migration tools default to wrapping each migration file in a
BEGIN/COMMIT, which makesCONCURRENTLYimmediately fail. The pattern that works in every tool I have used is to author the migration as raw SQL with no explicitBEGIN, and rely on the tool committing each statement on its own. Rails exposesdisable_ddl_transaction!for this; Sqitch lets you set:no_transaction; with Prisma you write a raw-SQL migration file (Prisma's migration engine does not wrap raw.sqlfiles in an explicit transaction, so each statement commits independently). Whatever the tool, verify withpg_stat_activitythat the index build is happening outside a transaction; if it is not, it will hold a lock you do not want. - If it fails (a duplicate value collides with the unique index, the table is locked by another DDL), it leaves an INVALID index behind. The index exists but is not used. Drop it explicitly (
DROP INDEX CONCURRENTLY ...) and re-run. I have lost an afternoon to a forgotten INVALID index that the application thought was helping.
Adding a unique index CONCURRENTLY is fine; the uniqueness is enforced once the index is valid. Adding a unique constraint is not the same operation; it currently still locks. If you need uniqueness on a hot table, build the unique index CONCURRENTLY first, then add the constraint that references the index (an extra step, but lock-free).
A few migration shapes I keep a recipe card for
Beyond add/drop/rename, there are a handful of shapes I have seen go badly often enough that I keep mental recipe cards for them.
Changing a column type. Postgres can sometimes change a type in place if the new type is binary-compatible (e.g., VARCHAR(50) to TEXT). When it cannot (e.g., INT to BIGINT), the table is rewritten under an exclusive lock. Treat it as an expand-contract: add a new column with the new type, dual-write, backfill, switch reads, drop the old. Identical to rename in shape.
Adding a foreign key constraint to existing data. The constraint check scans the whole table and locks it. The safe form is ADD CONSTRAINT ... NOT VALID (which adds the constraint but does not check existing rows; new rows are validated immediately) followed later by VALIDATE CONSTRAINT ..., which scans the table without holding an exclusive lock. Two migrations, no downtime.
Splitting one table into two. Almost always done by adding the new table, copying data in batches, dual-writing for a deploy or two, then making the new table the source of truth and dropping the old. The hard part is cross-table consistency during the dual-write window; some teams use database triggers as a safety net, but in practice careful application code plus a reconciliation job has been enough for me.
Soft-delete to hard-delete migration. The old table has a deleted_at column; the new model wants the row physically gone. Run the delete in batches (rule 3); never DELETE FROM users WHERE deleted_at IS NOT NULL in one statement on a large table. The same id > last_id batching loop above works for deletes too.
Large-table strategies
When the table has hundreds of millions of rows, the rules tighten. Three patterns I use:
Batched updates with progress. Every backfill is split into batches of 1k to 10k rows, with explicit progress logged.
Three details: the id > last_id clause uses the primary key index for fast batching, the pg_sleep(0.1) pauses give the database room to breathe between batches (your application queries do not get starved), and the COMMIT inside the loop releases locks so reads see current data.
Online-DDL tools. For MySQL, pt-online-schema-change (Percona) and gh-ost (GitHub) perform schema changes by creating a shadow table, copying data, syncing changes via triggers or binlog, and atomically swapping. For Postgres, pg_repack does similar things for table rewrites; for actual DDL, you usually rely on the built-in concurrent operations and the expand-contract discipline above.
Partitioning. When the table is so big that even batched writes are slow, partition first. Partitioning splits the table into smaller per-time-range or per-tenant sub-tables. New migrations only touch one partition at a time. This is itself a migration, and a non-trivial one (declarative partitioning is fine on Postgres 12+, but converting an existing table is still a multi-step expand-contract). Worth it for the largest tables.
Replicas and read traffic
Most production setups have read replicas. Migrations apply on the primary and stream to replicas. Two pitfalls:
Replication lag during a big migration. A 10-minute table rewrite on the primary will stall replication for those 10 minutes; reads going to the replica will see stale data, then all of a sudden the new state. Run schema changes during low-traffic windows even if they are non-blocking; the replicas might still notice.
Read-after-write consistency. Right after a migration that adds a column, code reading from a replica might not see the new column yet because the replication packet has not arrived. The fix is to read from the primary for any code path that needs to see the latest schema, and to wait for replicas to catch up before completing the deploy.
An incident that taught me to take this seriously. We backfilled a last_active_at column on a 200 million row sessions table during what we thought was a low-traffic window. The primary was fine; CPU was modest, no lock waits, the batched UPDATE was chewing through about 50k rows per second. What we did not notice for the first eight minutes was that one of the two read replicas had fallen 4 GB of WAL behind. The application's read traffic was load-balanced across both replicas, and roughly half of all dashboard queries were now reading data that was eight minutes stale. Logged-in users started seeing their own activity feeds with their most recent actions missing. Customer support saw the tickets before our monitoring did.
The metric that would have caught it earlier is pg_stat_replication.lag_bytes (or pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) on the primary), exposed via your Postgres exporter. We now alert at 100 MB of lag, page at 1 GB, and the migration runner watches the lag itself and pauses the batch loop if any replica goes above the threshold. The pseudocode is straightforward:
Wrap the batched UPDATE in a circuit breaker: if max_lag_bytes exceeds, say, 200 MB, sleep for 30 seconds and re-check; if it has not recovered after 5 minutes, abort the migration and page the on-call. The point is that the migration can run all night if it has to, but it never gets so far ahead of the replicas that read traffic suffers. I would rather take 6 hours to backfill a table cleanly than 90 minutes with a stale-read incident in the middle.
What to watch during a migration deploy
During any non-trivial migration, I keep four dashboards open. If any of them spikes, I either pause the migration (for batched data work) or accept the early signal and prepare to roll back (for schema work).
Lock waits are the early-warning siren. A single long-held lock can cascade into hundreds of waiters within seconds; the query in pg_stat_activity with wait_event_type = 'Lock' and the oldest xact_start is usually your culprit. A practical query to keep on hand:
Replication lag I covered above. The threshold worth alerting on depends on your traffic shape, but a reasonable starting point is: warn at 30 seconds of lag time, page at 5 minutes.
Application errors and p99 latency are the lagging indicators that customers feel. If errors or latency move during a migration window, even slightly, it is almost always the migration's fault until proven otherwise. Roll back, investigate, ship the fix, retry. The cost of an aborted migration is one redo; the cost of pushing through degradation is incident retrospectives and trust loss.
The pre-flight checklist
Before I ship any migration, I run through this list. It takes about five minutes and has saved me from at least three incidents.
- Does this migration violate rule 1? (Old and new app code both work against both schema states.) If yes, split into expand and contract.
- Does this migration take an
ACCESS EXCLUSIVEorSHARE EXCLUSIVElock? (RunEXPLAIN-equivalent or check the docs for the operation.) If yes, find the lock-free alternative or schedule a window. - If it backfills data, is the backfill batched? Idempotent? Resumable from the last committed point if it crashes mid-run?
- If it adds an index, is it
CONCURRENTLY? - Is the rollback path tested? (Roll forward to the new state, then roll back to the old. Both code states must work against both schema states; running this exercise reveals violations of rule 1.)
- What is the recovery plan if the migration fails partway through? (Resumable script? Re-run? Manual cleanup?)
A failure I will not repeat
The outage I opened with was a textbook violation of rule 1. The migration renamed email_address to email in one step. The deploy of the new code (which referenced email) had not yet rolled out. The old code (still serving traffic) referenced email_address. Between the migration committing and the deploy completing, every request on an old pod hit a non-existent column and 500'd.
The four-step rename above would have prevented it. Step 1 (add the new column) and step 2 (dual-write deploy) would have left both email_address and email populated. Step 3 (backfill) would have copied historical values. Step 4 (read-from-new deploy + later drop) would have happened over days. At no point would there have been a window where neither code version worked against the schema.
What I would tell the team
If your team is shipping migrations weekly without a playbook like this, you are accumulating incident risk that has not yet hit you. The cost of writing migrations the disciplined way is small: one extra step or two, longer cycle times for breaking changes, more thinking up front. The cost of skipping the discipline is a customer-visible outage at the moment your highest-traffic feature is being rolled out. I have been on both sides of that ledger. The disciplined side wins, every time, and the ergonomic cost is much smaller than people think.
The pushback I hear most often is on cycle time. Expand-contract turns a one-line rename into a three-deploy, week-long sequence; the engineer who wanted to ship a feature today now has a ticket open for a week. That feels slow, and it is slow, in a narrow sense. The trade is worth understanding precisely: expand-contract eliminates an entire class of incident (the one where running app code and a freshly migrated database disagree), at the cost of cycle time on schema changes specifically. It does not slow down feature work that does not touch the schema. It does not slow down hotfixes. It changes the cost curve from "99% of migrations are quick, 1% are an outage" to "100% of migrations are slightly slower, 0% are an outage". I have not yet met a team that, after a real incident, preferred the first curve. The cycle-time cost is paid in calendar time and is recoverable; the incident cost is paid in customer trust and is not.
The discipline beats heroics every time
Zero-downtime migrations are not glamorous. Each one is two or three deploys instead of one, spread over days. There is no neat one-liner; every shape (add, drop, rename, type change) has its own multi-step recipe. The discipline is mechanical and a little tedious, and the mechanism is what makes it safe. The first time you ship a column rename in four steps over a week and have nothing break, you stop minding the extra steps. The first time you skip a step and take production down, you stop being willing to skip steps. After enough of both, the discipline is not even a choice; it is just how you work.
