Data Migration & Schema Evolution¶

Status: 🟢 Active | Owner: Data Engineering

Production databases evolve continuously. Doing this safely — without downtime, without breaking running application versions, and without risking data integrity — requires a disciplined process. These standards apply to all schema changes across operational databases.

Core Principles¶

Schema changes are code. They are version-controlled, reviewed, and tested with the same rigour as application code.

Every migration must be backward compatible during the deployment window. In a rolling deployment, the old version of the service runs against the new schema while the new version is starting up, and vice versa during rollback. The schema must work with both.

Destructive operations are always deferred. Dropping a column, renaming a column, or removing a constraint happens in a separate release, after the application code that references that column has been removed and deployed.

Migrations are never edited after they are merged to main. If a migration has a bug, a new migration fixes it. Editing a committed migration causes checksums to fail on other environments.

Approved Migration Tools¶

Language Stack	Tool	Standard
Java / Spring Boot	Flyway	Required
Python	Alembic	Required
TypeScript / Node.js	Prisma Migrate (Prisma projects) or db-migrate	Required
Any	Raw SQL migrations via Flyway	Acceptable for multi-language services

Flyway is the default. If your service uses a different language but needs migration management, Flyway's command-line tool can be used independently of the application language.

Flyway Standards (Java)¶

File Naming¶

V<version>__<description>.sql
V20240315001__create_orders_table.sql
V20240315002__add_customer_email_index.sql
V20240322001__add_order_status_column.sql

Undo migrations (optional, for rollback support):
U20240315001__drop_orders_table.sql

Use the date + sequence format (YYYYMMDDNNN) for version numbers — not sequential integers. This prevents merge conflicts when multiple teams develop migrations simultaneously.

Repeatable Migrations¶

Use repeatable migrations (prefix R__) for objects that should be recreated on every change: views, stored functions, permissions grants.

R__create_order_summary_view.sql
R__grant_readonly_permissions.sql

Required Configuration¶

# application.yml
spring:
  flyway:
    enabled: true
    locations: classpath:db/migration
    baseline-on-migrate: false   # Never auto-baseline in production
    out-of-order: false          # Migrations must run in order
    validate-on-migrate: true    # Checksum validation — never bypass
    clean-disabled: true         # REQUIRED — prevent accidental data destruction

clean-disabled: true prevents flyway clean from wiping the database. This must never be disabled in any environment that contains real data, including staging.

The Backward-Compatible Migration Pattern¶

The key constraint in zero-downtime deployments is that new code and old code must both work against the new schema for the duration of the deployment window.

The Expand-Contract Pattern¶

Schema evolution follows a three-phase lifecycle:

Phase 1: EXPAND       Phase 2: MIGRATE      Phase 3: CONTRACT
─────────────────     ──────────────────    ─────────────────
Add new column        Deploy new app code   Remove old column
(nullable, no         that writes to new    (separate release,
default required)     column AND reads       after old code is
Old code ignores it.  from new column.       fully retired)

Step-by-Step: Renaming a Column¶

Never rename a column in a single migration. Follow these phases:

Release 1 — Add new column:

-- V20240315001__add_customer_ref_column.sql
ALTER TABLE orders ADD COLUMN customer_reference TEXT;

-- Backfill existing rows (run as separate migration if table is large)
-- V20240315002__backfill_customer_reference.sql
UPDATE orders SET customer_reference = customer_id::TEXT
WHERE customer_reference IS NULL;

ALTER TABLE orders ALTER COLUMN customer_reference SET NOT NULL;

Release 2 — Migrate application code: - Deploy application code that reads from customer_reference and writes to both customer_id and customer_reference. - Validate in production.

Release 3 — Remove old column (separate release, minimum one sprint later):

-- V20240401001__drop_customer_id_column.sql
ALTER TABLE orders DROP COLUMN customer_id;

Safe vs. Unsafe Operations¶

Operation	Safe in Zero-Downtime?	Notes
`ADD COLUMN` (nullable)	✅ Safe	Immediately safe
`ADD COLUMN` with `DEFAULT` (PostgreSQL 11+)	✅ Safe	Metadata-only in PG 11+
`ADD COLUMN NOT NULL` without default	❌ Unsafe	Rewrites table; blocks reads/writes
`DROP COLUMN`	⚠️ Safe with guard	Only after all code referencing it is removed
`ADD INDEX CONCURRENTLY`	✅ Safe	Does not lock table
`ADD INDEX` (without CONCURRENTLY)	❌ Unsafe	Locks table for the duration
`ADD CONSTRAINT` (CHECK, NOT NULL)	❌ Unsafe	Full table scan with lock
`ADD CONSTRAINT NOT VALID` + `VALIDATE CONSTRAINT`	✅ Safe	Two-phase: add without validation, then validate with ShareLock
`RENAME COLUMN`	❌ Unsafe	Breaks code referencing old name until fully deployed
`RENAME TABLE`	❌ Unsafe	As above
`ALTER COLUMN TYPE`	❌ Unsafe (usually)	Rewrites table unless cast is implicit

Adding a Non-Nullable Column¶

-- ❌ Locks table for full rewrite
ALTER TABLE orders ADD COLUMN priority INTEGER NOT NULL DEFAULT 0;

-- ✅ Safe three-step approach
-- Step 1: Add nullable
ALTER TABLE orders ADD COLUMN priority INTEGER;

-- Step 2: Set default for future rows
ALTER TABLE orders ALTER COLUMN priority SET DEFAULT 0;

-- Step 3: Backfill existing rows in batches (see Large Table Migrations below)
UPDATE orders SET priority = 0 WHERE priority IS NULL;

-- Step 4: Add NOT NULL constraint using NOT VALID first
ALTER TABLE orders ADD CONSTRAINT chk_orders_priority_not_null
    CHECK (priority IS NOT NULL) NOT VALID;

-- Step 5: Validate (ShareLock — does not block reads/writes on PG 14+)
ALTER TABLE orders VALIDATE CONSTRAINT chk_orders_priority_not_null;

Large Table Migrations¶

For tables with millions of rows, batch operations are required. A single UPDATE orders SET ... on a large table creates a long-running transaction that holds locks and generates excessive WAL.

Batch Update Pattern¶

-- V20240315003__backfill_priority_batched.sql
-- This script is designed to be run as a Flyway migration or a one-time script.
-- It updates rows in batches of 10,000 to avoid long-running transactions.

DO $$
DECLARE
    batch_size INT := 10000;
    rows_updated INT;
BEGIN
    LOOP
        UPDATE orders
        SET priority = 0
        WHERE id IN (
            SELECT id FROM orders
            WHERE priority IS NULL
            LIMIT batch_size
            FOR UPDATE SKIP LOCKED
        );
        GET DIAGNOSTICS rows_updated = ROW_COUNT;
        EXIT WHEN rows_updated = 0;
        PERFORM pg_sleep(0.1);  -- Brief pause between batches
    END LOOP;
END $$;

Index Creation on Large Tables¶

Always use CONCURRENTLY:

-- ❌ Locks the table for minutes on large tables
CREATE INDEX idx_orders_customer_status ON orders (customer_id, status);

-- ✅ Runs without locking — takes longer but safe
CREATE INDEX CONCURRENTLY idx_orders_customer_status
    ON orders (customer_id, status);

Note: Flyway runs migrations inside a transaction by default. CREATE INDEX CONCURRENTLY cannot run inside a transaction. Use Flyway's @NonTransactional annotation:

// Java annotation to disable transaction wrapping for this migration
@NonTransactional
public class V20240315004__AddConcurrentIndex implements JdbcMigration {
    public void migrate(Connection conn) throws Exception {
        conn.createStatement().execute(
            "CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_orders_customer_status " +
            "ON orders (customer_id, status)"
        );
    }
}

Or use a raw SQL file with -- flyway:nonTransactional comment.

Alembic Standards (Python)¶

# alembic/env.py — required configuration
from logging.config import fileConfig
from sqlalchemy import engine_from_config, pool
from alembic import context

# Use offline mode for generating SQL scripts (useful for review)
# Use online mode for applying to a real database

def run_migrations_online():
    connectable = engine_from_config(
        config.get_section(config.config_ini_section),
        prefix="sqlalchemy.",
        poolclass=pool.NullPool,  # No pooling in migration context
    )
    with connectable.connect() as connection:
        context.configure(connection=connection, target_metadata=target_metadata)
        with context.begin_transaction():
            context.run_migrations()

Migration files use op.execute() with raw SQL for complex operations:

# migrations/versions/2024_03_15_001_add_priority.py
def upgrade():
    # Safe: ADD COLUMN with nullable first
    op.add_column('orders', sa.Column('priority', sa.Integer(), nullable=True))
    op.execute("UPDATE orders SET priority = 0 WHERE priority IS NULL")
    op.alter_column('orders', 'priority', nullable=False)

def downgrade():
    op.drop_column('orders', 'priority')

Data Migration (Operational to Analytical)¶

Beyond schema migrations, data must periodically be migrated between the operational and analytical planes. These have different requirements.

Operational → Analytical (ELT Pipeline)¶

Data flows from PostgreSQL to Snowflake via:

Change Data Capture (CDC) using Debezium (reads PostgreSQL WAL). Recommended for low-latency replication.
Scheduled extract via Fivetran or custom Airflow DAG. Acceptable for daily-batch analytical datasets.

All analytical ingestion pipelines are owned and maintained by Data Engineering — not by the product squad.

Product squads are responsible for:

Notifying Data Engineering when schemas change (via the schema change review process).
Ensuring new tables/columns that should be replicated are documented and have appropriate data classification labels.
Not dropping columns without a coordination window with Data Engineering (the pipeline will break if a source column disappears).

Schema Change Coordination¶

Whenever a migration adds, removes, or renames a column in a table that is replicated to the analytical plane, the squad must:

Open a coordination ticket with Data Engineering at least 5 business days before the migration is deployed to production.
Data Engineering updates the ELT pipeline to handle the change.
Both the schema migration and the pipeline update deploy in a coordinated release window.

References¶

Last reviewed: 2025-Q4 | Owner: Data Engineering