Zero-Downtime Deployments and Database Migration Strategies in ECS
Master database migrations that survive rolling deployments. Learn the expand-contract pattern, deployment sequencing, and real failure modes—so old and new tasks coexist without errors.
Zero-Downtime Deployments and Database Migration Strategies in ECS
If you've watched a rolling deployment kill live traffic because a migration dropped a column that old tasks still needed, you understand why this matters. Zero-downtime deployments and database migrations in ECS aren't separate concerns—they're tightly coupled, and breaking one breaks the other.
The Core Problem
ECS rolling deployments run old and new task versions simultaneously. If your migration removes a column, renames a field, or changes a constraint, old tasks will error the moment that migration runs. You've achieved partial downtime.
The solution isn't "run migrations before deployment." It's a multi-phase migration strategy paired with deliberate deployment sequencing.
Expand → Contract Migrations
Never ship a breaking schema change in one deployment.
Step 1 — Expand (additive only): Add the new column or table as nullable with a default. Deploy new code that writes to both old and new structures. Old tasks remain unaffected.
Step 2 — Backfill: Run a background job or one-off ECS task to populate the new column for existing rows. Keep this separate from application deployment.
Step 3 — Contract (removal): Once the new code is stable and all rows are populated, ship a follow-up migration removing the old column. No running task depends on it anymore.
This requires more deployments. It also costs you zero downtime. That trade is correct.
Wiring Migrations Into ECS
Run Migrations as One-Off Tasks
Don't run migrations in your application entrypoint. Your Dockerfile's CMD should start your server, not your schema manager. Instead, define a separate task definition (same image, different command override) and run it before the service update:
Wait for this task to reach STOPPED and verify its exit code before triggering the service update. If migration fails, you abort—no new tasks launch, no traffic disrupted.
In your CI/CD pipeline:
Rolling vs. Blue/Green
Rolling works well for most deployments when you follow expand/contract. ECS replaces tasks gradually using minimum and maximum percent settings. Set minimumHealthyPercent: 100 and maximumPercent: 200 to maintain capacity.
Blue/Green via CodeDeploy is worth the overhead for risky migrations where instant rollback or pre-cutover validation matters. The tradeoff: more infrastructure and slower deploy cycles.
Default recommendation: rolling deployments with disciplined migrations. Blue/green when the blast radius is severe.
Health Checks Matter
New ECS tasks receive traffic immediately after passing the load balancer health check. If your application starts before it's actually ready—database connections established, dependencies warmed—you'll serve errors.
Use a health check that validates actual readiness, not just process liveness. A /health endpoint that runs SELECT 1 against your database beats one that only returns 200. Set healthCheckGracePeriodSeconds high enough to avoid killing slow startups, but not so high that broken deployments take forever to surface.
PostgreSQL Specifics
ALTER TABLE ... ADD COLUMNon large tables can hold exclusive locks long enough to cause timeouts. Add withDEFAULT NULLfirst, then add your default in a follow-up—newer Postgres handles this better, but the habit is worth keeping.- Long-running transactions block migrations. Set a sensible
statement_timeoutand ensure connection pooling (PgBouncer or RDS Proxy) doesn't hold idle transactions open. - Back up before destructive migrations. People skip this.
Rollback Reality
Rollback doesn't rewind. If a migration ran and dropped a column, deploying the old image won't restore it. Your rollback strategy is a forward fix: the schema must remain backward-compatible. This is another reason for expand/contract—rolling back application code is always safe because the schema supports both versions.
This pattern works across Django, FastAPI, and Next.js backend APIs on ECS. The stack changes; the discipline doesn't. When debugging a failed deployment, pull logs and exec into running tasks to understand what went wrong.
Get migration sequencing right and rolling deployments become boring. That's the goal.
Damian Hodgkiss
Senior Staff Engineer at Sumo Group, leading development of AppSumo marketplace. Technical solopreneur with 25+ years of experience building SaaS products.