Health Checks, Graceful Shutdown, and SIGTERM Handling for FastAPI and Next.js Containers on ECS
Production patterns for health checks, SIGTERM handling, and graceful shutdown in containerised FastAPI and Next.js services on AWS ECS—edge cases included.
Running containerised services on ECS sounds straightforward until you've watched a deployment drop live traffic, or seen ECS keep routing requests to a container that's been wedged for ten minutes. Getting health checks, graceful shutdown, and SIGTERM handling right is the difference between a deployment that users never notice and one that generates a 3 a.m. incident. This guide gives you working patterns for both FastAPI and Next.js, including the edge cases that actually bite teams in production.
Why This Matters More Than You Think
ECS uses health information in two distinct places: the load balancer uses it to decide whether to route traffic to a target, and the ECS service scheduler uses it to decide whether to replace a task. If either signal is wrong — too slow, too lenient, or ignored — you end up with traffic hitting unhealthy containers or ECS killing containers before they've drained in-flight requests. Neither is acceptable in production.
There are three failure modes worth naming explicitly:
- Dropped requests during deployment — ECS deregisters a task from the ALB and sends SIGTERM, but the container exits before the load balancer has finished draining active connections.
- Zombie tasks — The application process has deadlocked or crashed internally, but the container is still running, so ECS never replaces it and the ALB keeps routing to it.
- False-positive health check failures — The health check fires before the application has finished starting, causing ECS to restart a perfectly healthy container in a restart loop.
All three are preventable with the patterns below.
FastAPI: Health Check Endpoint
Keep the health check endpoint trivially fast. It should not hit the database unless you specifically want to gate traffic on database connectivity. A dedicated /health route that returns immediately is enough for the load balancer.
Why /health and /ready Should Be Separate
If you want a richer readiness check that includes database connectivity or cache warmth, put it on a separate path such as /ready and wire it only to your internal monitoring — not the ALB health check. Here is why this separation matters in practice:
- The ALB health check is a liveness signal. It answers the question: "Is this process alive and capable of handling HTTP traffic?" A simple 200 is sufficient.
- A readiness check is a traffic-gating signal. It answers: "Has this container finished initialising and should it receive user traffic?"
Mixing liveness and readiness on a single endpoint that hammers your database is a surprisingly common cascade-failure trigger under load. If your database becomes slow, every container simultaneously starts returning 5xx on the health check, the ALB marks all targets unhealthy, and you have a full outage — caused not by your application, but by your health check strategy.
A safe pattern for readiness with database verification:
Wire /health to the ALB. Wire /ready to a CloudWatch canary or your internal Prometheus scrape — not the ALB target group.
FastAPI: Handling SIGTERM Gracefully
By default, Uvicorn handles SIGTERM, but behaviour depends on how you invoke it. If you run it via a shell script or a process manager inside the container, signals can be swallowed. The safest pattern is to run Uvicorn directly as PID 1 using exec form in your Dockerfile.
The exec form (JSON array) means Docker sends SIGTERM straight to Uvicorn rather than to a shell. That single change fixes the most common graceful-shutdown failure on containerised Python services.
Uvicorn's Graceful-Shutdown Timeout
Uvicorn supports a --timeout-graceful-shutdown flag (verified in the official Uvicorn settings documentation). Set it to at least as long as your ALB deregistration delay. If your ALB deregistration delay is 30 seconds, give Uvicorn at least 30 seconds to finish draining in-flight requests before it exits.
You can confirm the flag exists on your installed version with:
FastAPI Lifespan for Clean Startup and Shutdown
FastAPI's lifespan context manager (introduced in Starlette and available via FastAPI's lifespan parameter) is the idiomatic place to open and close resources such as database connection pools. This runs after SIGTERM is received and Uvicorn begins its shutdown sequence.
The shutdown block of lifespan is guaranteed to run before Uvicorn exits, so long as the process is not sent SIGKILL. This is why your ECS stopTimeout must be long enough to allow the shutdown sequence to complete.
The Shell-Form Trap
Consider this common mistake:
In shell form, Docker runs /bin/sh -c "uvicorn ...". Uvicorn is a child process of the shell. When ECS sends SIGTERM, the shell receives it — and /bin/sh does not forward signals to its children by default. Uvicorn never sees SIGTERM; ECS waits out stopTimeout, then sends SIGKILL. Every in-flight request is terminated hard.
Next.js: Health Check Endpoint
Next.js App Router makes this trivial with a Route Handler:
Wire this path to your ALB target group health check. Set the healthy threshold to 2 and the interval to 15 seconds — aggressive enough to detect problems quickly, conservative enough to avoid false positives during cold starts.
Cold-Start Detection and the startPeriod
Next.js containers, particularly those using SSR or ISR, can take 10–30 seconds to be ready to serve requests on a cold start. If your ECS container health check fires before the server is listening, ECS counts those failures immediately. Without a startPeriod, three consecutive failures (with default settings) will cause ECS to mark the task as unhealthy and restart it — before it's ever had a chance to start properly.
Set startPeriod in your task definition health check to cover your worst-case cold-start time, plus a margin. A Next.js app that typically starts in 15 seconds should have a startPeriod of at least 30 seconds.
Next.js: Handling SIGTERM
The default next start process handles SIGTERM quietly. The critical step is again your Dockerfile's ENTRYPOINT — use exec form so the process receives signals directly:
If you need custom shutdown logic — closing database connections, flushing queues, draining a background job worker — register a handler explicitly:
For most Next.js deployments on ECS you won't need a custom server. The default next start handles it adequately as long as your task stop timeout exceeds your ALB deregistration delay. Custom servers are primarily useful when you have background workers, WebSocket connections, or queue consumers running in the same process.
ECS Task Definition: Getting the Timeouts Right
This is where most teams get burned. Three timeout values must be consistent:
| Setting | Where | Recommended starting point |
|---|---|---|
| Deregistration delay | ALB target group | 30 seconds |
| Health check grace period | ECS service | 60 seconds |
| Task stop timeout | ECS task definition | 60 seconds |
The task stop timeout must exceed your deregistration delay. ECS sends SIGTERM, waits for stopTimeout, then sends SIGKILL. If stopTimeout is shorter than the deregistration delay, your container is killed whilst the load balancer is still draining it.
The startPeriod gives your container time to initialise before ECS starts counting failed health checks against it. Set it to at least as long as your typical cold-start time.
The Deregistration Delay Race Condition
When ECS decides to stop a task (during a deployment or scale-in), the sequence is:
- ECS begins deregistering the task from the ALB target group.
- ECS simultaneously sends SIGTERM to the container.
- The ALB continues routing in-flight and queued requests to the target during the deregistration delay window.
- After the deregistration delay, the ALB stops routing new connections to the target.
- ECS waits until
stopTimeoutelapses, then sends SIGKILL.
If stopTimeout < deregistrationDelay, your container is dead before the ALB has finished draining. The requests that arrived during the gap get TCP RST errors. The fix is simple: always set stopTimeout > deregistrationDelay, with a comfortable margin for your application's shutdown work on top.
For most applications: stopTimeout = 30 (drain) + 15 (shutdown work) + 15 (margin) = 60 seconds.
The One Thing Most Teams Skip
Include a container-level health check in the task definition in addition to the ALB health check. The ALB health check only affects routing; the container health check affects whether ECS replaces a stuck task. Without it, a container that's running but internally wedged — a deadlocked thread pool, an exhausted connection pool, a hung background job — keeps receiving traffic indefinitely.
The container health check is your backstop against zombie tasks. ECS uses consecutive failures to mark the task unhealthy and replace it, even if the ALB never noticed anything wrong (because the ALB only checks the TCP connection and HTTP response code, not whether your application is actually making progress).
Quick Reference: What Goes Where
| What | FastAPI | Next.js |
|---|---|---|
| Liveness health endpoint | GET /health → 200 | GET /api/health → 200 |
| Readiness check (monitoring only) | GET /ready → checks DB | Custom Route Handler |
| SIGTERM handling | Uvicorn via exec form + lifespan shutdown | next start via exec form + optional process.on('SIGTERM') |
| Graceful-shutdown timeout | --timeout-graceful-shutdown 30 | Task stop timeout only |
| Entrypoint form | ["uvicorn", "app.main:app", ...] | ["node", "server.js"] |
Summary
Get these five things right — exec-form entrypoints, fast dedicated liveness endpoints, separate readiness checks, consistent timeout values (stopTimeout > deregistrationDelay), and container-level health checks — and your ECS deployments will be genuinely zero-downtime. The failure modes are predictable; the fixes are configuration, not code. The teams that still drop traffic at deploy time are almost always missing one of these five.
Damian Hodgkiss
Senior Staff Engineer at Sumo Group, leading development of AppSumo marketplace. Technical solopreneur with 25+ years of experience building SaaS products.