Zero-Downtime Deployments and Database Migration Strategies in ECS

If you've watched a rolling deployment kill live traffic because a migration dropped a column that old tasks still needed, you understand why this matters. Zero-downtime deployments and database migrations in ECS aren't separate concerns—they're tightly coupled, and breaking one breaks the other.

The Core Problem

ECS rolling deployments run old and new task versions simultaneously. If your migration removes a column, renames a field, or changes a constraint, old tasks will error the moment that migration runs. You've achieved partial downtime.

The solution isn't "run migrations before deployment." It's a multi-phase migration strategy paired with deliberate deployment sequencing.

Expand → Contract Migrations

Never ship a breaking schema change in one deployment.

Step 1 — Expand (additive only): Add the new column or table as nullable with a default. Deploy new code that writes to both old and new structures. Old tasks remain unaffected.

Step 2 — Backfill: Run a background job or one-off ECS task to populate the new column for existing rows. Keep this separate from application deployment.

Step 3 — Contract (removal): Once the new code is stable and all rows are populated, ship a follow-up migration removing the old column. No running task depends on it anymore.

This requires more deployments. It also costs you zero downtime. That trade is correct.

Wiring Migrations Into ECS

Run Migrations as One-Off Tasks

Don't run migrations in your application entrypoint. Your Dockerfile's CMD should start your server, not your schema manager. Instead, define a separate task definition (same image, different command override) and run it before the service update:

aws ecs run-task \
  --cluster my-cluster \
  --task-definition my-app:latest \
  --overrides '{"containerOverrides":[{"name":"app","command":["python","manage.py","migrate"]}]}' \
  --launch-type FARGATE \
  --network-configuration "..."

Wait for this task to reach STOPPED and verify its exit code before triggering the service update. If migration fails, you abort—no new tasks launch, no traffic disrupted.

In your CI/CD pipeline:

1. Build image → push to ECR
2. Register new task definition
3. Run migration task → wait → verify exit code
4. Update ECS service (rolling or blue/green)

Rolling vs. Blue/Green

Rolling works well for most deployments when you follow expand/contract. ECS replaces tasks gradually using minimum and maximum percent settings. Set minimumHealthyPercent: 100 and maximumPercent: 200 to maintain capacity.

Blue/Green via CodeDeploy is worth the overhead for risky migrations where instant rollback or pre-cutover validation matters. The tradeoff: more infrastructure and slower deploy cycles.

Default recommendation: rolling deployments with disciplined migrations. Blue/green when the blast radius is severe.

Health Checks Matter

New ECS tasks receive traffic immediately after passing the load balancer health check. If your application starts before it's actually ready—database connections established, dependencies warmed—you'll serve errors.

Use a health check that validates actual readiness, not just process liveness. A /health endpoint that runs SELECT 1 against your database beats one that only returns 200. Set healthCheckGracePeriodSeconds high enough to avoid killing slow startups, but not so high that broken deployments take forever to surface.

PostgreSQL Specifics

ALTER TABLE ... ADD COLUMN on large tables can hold exclusive locks long enough to cause timeouts. Add with DEFAULT NULL first, then add your default in a follow-up—newer Postgres handles this better, but the habit is worth keeping.
Long-running transactions block migrations. Set a sensible statement_timeout and ensure connection pooling (PgBouncer or RDS Proxy) doesn't hold idle transactions open.
Back up before destructive migrations. People skip this.

Rollback Reality

Rollback doesn't rewind. If a migration ran and dropped a column, deploying the old image won't restore it. Your rollback strategy is a forward fix: the schema must remain backward-compatible. This is another reason for expand/contract—rolling back application code is always safe because the schema supports both versions.

This pattern works across Django, FastAPI, and Next.js backend APIs on ECS. The stack changes; the discipline doesn't. When debugging a failed deployment, pull logs and exec into running tasks to understand what went wrong.

Get migration sequencing right and rolling deployments become boring. That's the goal.

Zero-Downtime Deployments and Database Migration Strategies in ECS

The Core Problem

Expand → Contract Migrations

Wiring Migrations Into ECS

Run Migrations as One-Off Tasks

Rolling vs. Blue/Green

Health Checks Matter

PostgreSQL Specifics

Rollback Reality

Damian Hodgkiss

Creating Freedom

Proven strategies

Technical insights

Founder mindset