AWS ECS Timeout: Task Failed ELB Health Checks & Container Startup Timeouts — Complete Fix Guide
Fix AWS ECS timeout errors including failed ELB health checks, container startup timeouts, and deployment stalls with step-by-step CLI commands and config fixes
- ECS timeouts most often stem from misconfigured ALB target group health check paths, intervals, or grace periods — not the application itself
- Container startup timeouts occur when the app takes longer to bind to its port than the health check unhealthy threshold allows; increase the health check grace period or tune startup logic
- Security groups that block the ALB's traffic to the container port are the silent killer — always verify ingress rules on the task's security group allow the ALB security group on the container port
- stopTimeout defaults to 30 seconds; if your app needs a graceful drain window longer than that ECS will SIGKILL before it finishes shutting down
- Fargate tasks failing to pull images due to NAT/VPC endpoint misconfiguration produce timeout-like symptoms that are easy to confuse with health check failures
| Method | When to Use | Estimated Time | Risk |
|---|---|---|---|
| Increase ALB health check grace period | App starts slowly; tasks are healthy but deregistered before they finish booting | 5 min (console or CLI) | Low — no code change required |
| Tune health check path, interval & threshold | Health check path returns non-200 or check fires too aggressively | 10–15 min | Low — ALB change only |
| Fix security group ingress rules | ALB cannot reach container port; all tasks marked unhealthy immediately | 5–10 min | Low — additive rule change |
| Increase task stopTimeout | Graceful shutdown takes longer than 30 s; data loss on scale-in or deploy | 5 min (task definition update) | Low — requires new task def revision |
| Switch to awsvpc + VPC endpoints | Fargate pull timeouts in private subnets without NAT | 30–60 min | Medium — network change |
| Enable ECS Exec for live debugging | Need to inspect running container to find why app is not binding | 10 min setup | Low — read-only investigation |
| Optimize container startup (lazy init, readiness probe) | App genuinely takes 60+ s to become ready; cannot extend grace period further | Hours — code change | Medium — requires release cycle |
Understanding AWS ECS Timeout Errors
AWS ECS surfaces timeout failures in several distinct but related forms. Understanding which layer is timing out is the first step to fixing the problem fast. The three most common error surfaces are:
- ELB health check timeouts — the Application Load Balancer (ALB) or Network Load Balancer (NLB) marks targets unhealthy because it cannot get a successful HTTP response within the configured threshold.
- Container startup / deployment timeouts — the ECS service rolls back a deployment because new tasks never transition to a RUNNING+HEALTHY state within the
healthCheckGracePeriodSecondswindow. - Container stop / drain timeouts — ECS sends SIGTERM during a scale-in or rolling deploy, then forcefully SIGKILLs the task after
stopTimeoutseconds if the container has not exited.
Exact error messages you will see in the ECS console or CloudWatch Logs:
(service my-service) (task abc123) failed to register targets in
(target-group arn:aws:elasticloadbalancing:us-east-1:123456789:targetgroup/my-tg/abc):
The following targets are not in a valid state for attachment to a load balancer: i-0abc
(service my-service) tasks are failing ELB health checks in
(target-group arn:aws:elasticloadbalancing:us-east-1:123456789:targetgroup/my-tg/abc)
cancel reason: Task failed to start
DockerTimeoutError: Could not transition to started; timed out after waiting 3m0s
ResourceInitializationError: unable to pull secrets or registry auth:
The task cannot pull image; max attempts exceeded. RequestCanceled: request context canceled
Step 1: Identify the Exact Timeout Layer
Before changing anything, determine which layer is generating the timeout. Run these commands in sequence:
# 1. List recent stopped tasks and their stopped reasons
aws ecs list-tasks \
--cluster my-cluster \
--service-name my-service \
--desired-status STOPPED \
--query 'taskArns' --output text
# 2. Describe stopped tasks for stoppedReason
aws ecs describe-tasks \
--cluster my-cluster \
--tasks <task-arn-1> <task-arn-2> \
--query 'tasks[*].{id:taskArn,status:lastStatus,reason:stoppedReason,containers:containers[*].{name:name,reason:reason,exitCode:exitCode}}'
# 3. Check service events for health check failure messages
aws ecs describe-services \
--cluster my-cluster \
--services my-service \
--query 'services[0].events[:10]'
# 4. Check ALB target health directly
aws elbv2 describe-target-health \
--target-group-arn arn:aws:elasticloadbalancing:us-east-1:123456789:targetgroup/my-tg/abc \
--query 'TargetHealthDescriptions[*].{target:Target,health:TargetHealth}'
If stoppedReason contains Task failed to start, the issue is pre-health-check — usually an image pull failure, missing secret, or a container that crashes on boot. If the service events mention failing ELB health checks, proceed to Step 2.
Step 2: Fix ALB Health Check Configuration
The most common cause of ECS timeout complaints is an ALB target group misconfigured for the application's actual startup behavior.
Check current health check settings:
aws elbv2 describe-target-groups \
--target-group-arns arn:aws:elasticloadbalancing:us-east-1:123456789:targetgroup/my-tg/abc \
--query 'TargetGroups[0].{Path:HealthCheckPath,Protocol:HealthCheckProtocol,Port:HealthCheckPort,IntervalSeconds:HealthCheckIntervalSeconds,TimeoutSeconds:HealthCheckTimeoutSeconds,HealthyThreshold:HealthyThresholdCount,UnhealthyThreshold:UnhealthyThresholdCount}'
Common misconfiguration patterns:
| Symptom | Likely Setting | Recommended Fix |
|---|---|---|
| Tasks marked unhealthy immediately | Grace period = 0 | Set healthCheckGracePeriodSeconds to 60–120 |
| Path returns 404 | Wrong health check path | Update path to /health or /actuator/health |
| Timeout before app binds | HealthCheckTimeoutSeconds too low |
Increase to 10–15 s |
| Two failed checks kill task | UnhealthyThresholdCount = 2 |
Increase to 3–5 for slow starters |
Apply fixes via CLI:
# Update the health check path and thresholds on the target group
aws elbv2 modify-target-group \
--target-group-arn arn:aws:elasticloadbalancing:us-east-1:123456789:targetgroup/my-tg/abc \
--health-check-path /health \
--health-check-interval-seconds 30 \
--health-check-timeout-seconds 10 \
--healthy-threshold-count 2 \
--unhealthy-threshold-count 5
# Increase the ECS service health check grace period
aws ecs update-service \
--cluster my-cluster \
--service my-service \
--health-check-grace-period-seconds 120
Step 3: Verify Security Group Rules
If tasks are registered with the target group but immediately marked unhealthy, the ALB probe is almost certainly being blocked by security groups. In ECS Fargate with awsvpc networking, the task runs with its own ENI and its own security group — the ALB cannot reuse the task's security group.
# Find the security group attached to your ECS tasks
aws ecs describe-tasks \
--cluster my-cluster \
--tasks <task-arn> \
--query 'tasks[0].attachments[0].details'
# Get the ENI ID from output, then find its security groups
aws ec2 describe-network-interfaces \
--network-interface-ids eni-0abc123 \
--query 'NetworkInterfaces[0].Groups'
# Verify the task security group allows inbound from ALB security group
aws ec2 describe-security-groups \
--group-ids sg-task123 \
--query 'SecurityGroups[0].IpPermissions'
The task's security group must allow inbound TCP on the container port from the ALB's security group (not a CIDR range — use the ALB SG ID as the source):
# Add the correct ingress rule (replace sg-alb456 and port 8080)
aws ec2 authorize-security-group-ingress \
--group-id sg-task123 \
--protocol tcp \
--port 8080 \
--source-group sg-alb456
Step 4: Fix Container Stop Timeouts
When ECS needs to stop a task (rolling deploy, scale-in, spot reclamation), it sends SIGTERM and waits stopTimeout seconds before sending SIGKILL. The default is 30 seconds. If your application needs longer for graceful shutdown (draining connections, flushing queues), override this in the task definition:
{
"containerDefinitions": [
{
"name": "my-app",
"image": "my-repo/my-app:latest",
"stopTimeout": 120,
"portMappings": [{"containerPort": 8080}]
}
]
}
Register the new revision and force a service update:
aws ecs register-task-definition --cli-input-json file://task-def.json
aws ecs update-service \
--cluster my-cluster \
--service my-service \
--task-definition my-task:NEW_REVISION
Step 5: Fix Fargate Image Pull Timeouts in Private Subnets
If tasks in private subnets fail with DockerTimeoutError or ResourceInitializationError, the task cannot reach ECR, Secrets Manager, or CloudWatch Logs endpoints. Solutions in order of preference:
- Add VPC Interface Endpoints for
ecr.api,ecr.dkr,logs, andsecretsmanager. - Add a NAT Gateway to the private subnet's route table.
- Move tasks to public subnets (assign public IPs) — acceptable for non-production.
# Create ECR API endpoint (repeat for ecr.dkr, logs, secretsmanager)
aws ec2 create-vpc-endpoint \
--vpc-id vpc-abc123 \
--vpc-endpoint-type Interface \
--service-name com.amazonaws.us-east-1.ecr.api \
--subnet-ids subnet-private1 subnet-private2 \
--security-group-ids sg-vpc-endpoints \
--private-dns-enabled
Step 6: Enable ECS Exec for Live Container Debugging
If the above steps don't resolve the issue, connect directly to a running container to inspect what port the app is actually binding to:
# Enable ECS Exec on the service
aws ecs update-service \
--cluster my-cluster \
--service my-service \
--enable-execute-command
# Open a shell in the running container
aws ecs execute-command \
--cluster my-cluster \
--task <task-arn> \
--container my-app \
--interactive \
--command '/bin/sh'
# Inside the container — verify the port is actually bound
netstat -tlnp | grep LISTEN
curl -v http://localhost:8080/health
This reveals whether the app is listening on the wrong interface (e.g., 127.0.0.1 instead of 0.0.0.0) or on the wrong port entirely — both of which cause ECS health checks to fail silently.
Frequently Asked Questions
#!/usr/bin/env bash
# ecs-timeout-diagnose.sh — run against a misbehaving ECS service
# Usage: CLUSTER=my-cluster SERVICE=my-service bash ecs-timeout-diagnose.sh
set -euo pipefail
CLUSTER=${CLUSTER:?Set CLUSTER env var}
SERVICE=${SERVICE:?Set SERVICE env var}
echo "=== [1] Service events (last 10) ==="
aws ecs describe-services \
--cluster "$CLUSTER" --services "$SERVICE" \
--query 'services[0].events[:10].[createdAt,message]' \
--output table
echo ""
echo "=== [2] Health check grace period ==="
aws ecs describe-services \
--cluster "$CLUSTER" --services "$SERVICE" \
--query 'services[0].healthCheckGracePeriodSeconds'
echo ""
echo "=== [3] Stopped tasks — stoppedReason ==="
STOPPED_TASKS=$(aws ecs list-tasks \
--cluster "$CLUSTER" --service-name "$SERVICE" \
--desired-status STOPPED --max-results 5 \
--query 'taskArns' --output text)
if [ -n "$STOPPED_TASKS" ]; then
# shellcheck disable=SC2086
aws ecs describe-tasks \
--cluster "$CLUSTER" --tasks $STOPPED_TASKS \
--query 'tasks[*].{task:taskArn,reason:stoppedReason,containers:containers[*].{name:name,exit:exitCode,reason:reason}}' \
--output json
else
echo "No recently stopped tasks found."
fi
echo ""
echo "=== [4] Target group health ==="
TG_ARN=$(aws ecs describe-services \
--cluster "$CLUSTER" --services "$SERVICE" \
--query 'services[0].loadBalancers[0].targetGroupArn' --output text)
if [ "$TG_ARN" != "None" ] && [ -n "$TG_ARN" ]; then
aws elbv2 describe-target-health \
--target-group-arn "$TG_ARN" \
--query 'TargetHealthDescriptions[*].{id:Target.Id,port:Target.Port,state:TargetHealth.State,reason:TargetHealth.Reason,desc:TargetHealth.Description}' \
--output table
echo ""
echo "=== [5] Target group health check config ==="
aws elbv2 describe-target-groups \
--target-group-arns "$TG_ARN" \
--query 'TargetGroups[0].{Path:HealthCheckPath,IntervalSec:HealthCheckIntervalSeconds,TimeoutSec:HealthCheckTimeoutSeconds,HealthyN:HealthyThresholdCount,UnhealthyN:UnhealthyThresholdCount}' \
--output table
else
echo "No load balancer attached to this service."
fi
echo ""
echo "=== [6] Running task ENIs + security groups ==="
RUNNING_TASK=$(aws ecs list-tasks \
--cluster "$CLUSTER" --service-name "$SERVICE" \
--desired-status RUNNING --max-results 1 \
--query 'taskArns[0]' --output text)
if [ "$RUNNING_TASK" != "None" ] && [ -n "$RUNNING_TASK" ]; then
ENI_ID=$(aws ecs describe-tasks \
--cluster "$CLUSTER" --tasks "$RUNNING_TASK" \
--query 'tasks[0].attachments[0].details[?name==`networkInterfaceId`].value' \
--output text)
if [ -n "$ENI_ID" ]; then
echo "ENI: $ENI_ID"
aws ec2 describe-network-interfaces \
--network-interface-ids "$ENI_ID" \
--query 'NetworkInterfaces[0].Groups[*].{id:GroupId,name:GroupName}' \
--output table
fi
else
echo "No running tasks found."
fi
echo ""
echo "=== Done. Review output above for misconfigurations. ==="Error Medic Editorial
The Error Medic Editorial team comprises senior DevOps engineers and SREs with collective experience spanning AWS, GCP, and on-premises infrastructure. They specialize in container orchestration, distributed systems reliability, and translating cryptic cloud error messages into actionable fixes.
Sources
- https://docs.aws.amazon.com/AmazonECS/latest/developerguide/service-load-balancing.html
- https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task_definition_parameters.html#container_definition_timeout
- https://docs.aws.amazon.com/elasticloadbalancing/latest/application/target-group-health-checks.html
- https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-exec.html
- https://repost.aws/knowledge-center/ecs-fargate-tasks-health-checks