Error Medic

AWS ECS Timeout: Task Failed ELB Health Checks & Container Startup Timeouts — Complete Fix Guide

Fix AWS ECS timeout errors including failed ELB health checks, container startup timeouts, and deployment stalls with step-by-step CLI commands and config fixes

Last updated:
Last verified:
2,202 words
Key Takeaways
  • ECS timeouts most often stem from misconfigured ALB target group health check paths, intervals, or grace periods — not the application itself
  • Container startup timeouts occur when the app takes longer to bind to its port than the health check unhealthy threshold allows; increase the health check grace period or tune startup logic
  • Security groups that block the ALB's traffic to the container port are the silent killer — always verify ingress rules on the task's security group allow the ALB security group on the container port
  • stopTimeout defaults to 30 seconds; if your app needs a graceful drain window longer than that ECS will SIGKILL before it finishes shutting down
  • Fargate tasks failing to pull images due to NAT/VPC endpoint misconfiguration produce timeout-like symptoms that are easy to confuse with health check failures
ECS Timeout Fix Approaches Compared
MethodWhen to UseEstimated TimeRisk
Increase ALB health check grace periodApp starts slowly; tasks are healthy but deregistered before they finish booting5 min (console or CLI)Low — no code change required
Tune health check path, interval & thresholdHealth check path returns non-200 or check fires too aggressively10–15 minLow — ALB change only
Fix security group ingress rulesALB cannot reach container port; all tasks marked unhealthy immediately5–10 minLow — additive rule change
Increase task stopTimeoutGraceful shutdown takes longer than 30 s; data loss on scale-in or deploy5 min (task definition update)Low — requires new task def revision
Switch to awsvpc + VPC endpointsFargate pull timeouts in private subnets without NAT30–60 minMedium — network change
Enable ECS Exec for live debuggingNeed to inspect running container to find why app is not binding10 min setupLow — read-only investigation
Optimize container startup (lazy init, readiness probe)App genuinely takes 60+ s to become ready; cannot extend grace period furtherHours — code changeMedium — requires release cycle

Understanding AWS ECS Timeout Errors

AWS ECS surfaces timeout failures in several distinct but related forms. Understanding which layer is timing out is the first step to fixing the problem fast. The three most common error surfaces are:

  1. ELB health check timeouts — the Application Load Balancer (ALB) or Network Load Balancer (NLB) marks targets unhealthy because it cannot get a successful HTTP response within the configured threshold.
  2. Container startup / deployment timeouts — the ECS service rolls back a deployment because new tasks never transition to a RUNNING+HEALTHY state within the healthCheckGracePeriodSeconds window.
  3. Container stop / drain timeouts — ECS sends SIGTERM during a scale-in or rolling deploy, then forcefully SIGKILLs the task after stopTimeout seconds if the container has not exited.

Exact error messages you will see in the ECS console or CloudWatch Logs:

(service my-service) (task abc123) failed to register targets in 
(target-group arn:aws:elasticloadbalancing:us-east-1:123456789:targetgroup/my-tg/abc): 
The following targets are not in a valid state for attachment to a load balancer: i-0abc

(service my-service) tasks are failing ELB health checks in 
(target-group arn:aws:elasticloadbalancing:us-east-1:123456789:targetgroup/my-tg/abc)

cancel reason: Task failed to start

DockerTimeoutError: Could not transition to started; timed out after waiting 3m0s

ResourceInitializationError: unable to pull secrets or registry auth: 
The task cannot pull image; max attempts exceeded. RequestCanceled: request context canceled

Step 1: Identify the Exact Timeout Layer

Before changing anything, determine which layer is generating the timeout. Run these commands in sequence:

# 1. List recent stopped tasks and their stopped reasons
aws ecs list-tasks \
  --cluster my-cluster \
  --service-name my-service \
  --desired-status STOPPED \
  --query 'taskArns' --output text

# 2. Describe stopped tasks for stoppedReason
aws ecs describe-tasks \
  --cluster my-cluster \
  --tasks <task-arn-1> <task-arn-2> \
  --query 'tasks[*].{id:taskArn,status:lastStatus,reason:stoppedReason,containers:containers[*].{name:name,reason:reason,exitCode:exitCode}}'

# 3. Check service events for health check failure messages
aws ecs describe-services \
  --cluster my-cluster \
  --services my-service \
  --query 'services[0].events[:10]'

# 4. Check ALB target health directly
aws elbv2 describe-target-health \
  --target-group-arn arn:aws:elasticloadbalancing:us-east-1:123456789:targetgroup/my-tg/abc \
  --query 'TargetHealthDescriptions[*].{target:Target,health:TargetHealth}'

If stoppedReason contains Task failed to start, the issue is pre-health-check — usually an image pull failure, missing secret, or a container that crashes on boot. If the service events mention failing ELB health checks, proceed to Step 2.


Step 2: Fix ALB Health Check Configuration

The most common cause of ECS timeout complaints is an ALB target group misconfigured for the application's actual startup behavior.

Check current health check settings:

aws elbv2 describe-target-groups \
  --target-group-arns arn:aws:elasticloadbalancing:us-east-1:123456789:targetgroup/my-tg/abc \
  --query 'TargetGroups[0].{Path:HealthCheckPath,Protocol:HealthCheckProtocol,Port:HealthCheckPort,IntervalSeconds:HealthCheckIntervalSeconds,TimeoutSeconds:HealthCheckTimeoutSeconds,HealthyThreshold:HealthyThresholdCount,UnhealthyThreshold:UnhealthyThresholdCount}'

Common misconfiguration patterns:

Symptom Likely Setting Recommended Fix
Tasks marked unhealthy immediately Grace period = 0 Set healthCheckGracePeriodSeconds to 60–120
Path returns 404 Wrong health check path Update path to /health or /actuator/health
Timeout before app binds HealthCheckTimeoutSeconds too low Increase to 10–15 s
Two failed checks kill task UnhealthyThresholdCount = 2 Increase to 3–5 for slow starters

Apply fixes via CLI:

# Update the health check path and thresholds on the target group
aws elbv2 modify-target-group \
  --target-group-arn arn:aws:elasticloadbalancing:us-east-1:123456789:targetgroup/my-tg/abc \
  --health-check-path /health \
  --health-check-interval-seconds 30 \
  --health-check-timeout-seconds 10 \
  --healthy-threshold-count 2 \
  --unhealthy-threshold-count 5

# Increase the ECS service health check grace period
aws ecs update-service \
  --cluster my-cluster \
  --service my-service \
  --health-check-grace-period-seconds 120

Step 3: Verify Security Group Rules

If tasks are registered with the target group but immediately marked unhealthy, the ALB probe is almost certainly being blocked by security groups. In ECS Fargate with awsvpc networking, the task runs with its own ENI and its own security group — the ALB cannot reuse the task's security group.

# Find the security group attached to your ECS tasks
aws ecs describe-tasks \
  --cluster my-cluster \
  --tasks <task-arn> \
  --query 'tasks[0].attachments[0].details'

# Get the ENI ID from output, then find its security groups
aws ec2 describe-network-interfaces \
  --network-interface-ids eni-0abc123 \
  --query 'NetworkInterfaces[0].Groups'

# Verify the task security group allows inbound from ALB security group
aws ec2 describe-security-groups \
  --group-ids sg-task123 \
  --query 'SecurityGroups[0].IpPermissions'

The task's security group must allow inbound TCP on the container port from the ALB's security group (not a CIDR range — use the ALB SG ID as the source):

# Add the correct ingress rule (replace sg-alb456 and port 8080)
aws ec2 authorize-security-group-ingress \
  --group-id sg-task123 \
  --protocol tcp \
  --port 8080 \
  --source-group sg-alb456

Step 4: Fix Container Stop Timeouts

When ECS needs to stop a task (rolling deploy, scale-in, spot reclamation), it sends SIGTERM and waits stopTimeout seconds before sending SIGKILL. The default is 30 seconds. If your application needs longer for graceful shutdown (draining connections, flushing queues), override this in the task definition:

{
  "containerDefinitions": [
    {
      "name": "my-app",
      "image": "my-repo/my-app:latest",
      "stopTimeout": 120,
      "portMappings": [{"containerPort": 8080}]
    }
  ]
}

Register the new revision and force a service update:

aws ecs register-task-definition --cli-input-json file://task-def.json
aws ecs update-service \
  --cluster my-cluster \
  --service my-service \
  --task-definition my-task:NEW_REVISION

Step 5: Fix Fargate Image Pull Timeouts in Private Subnets

If tasks in private subnets fail with DockerTimeoutError or ResourceInitializationError, the task cannot reach ECR, Secrets Manager, or CloudWatch Logs endpoints. Solutions in order of preference:

  1. Add VPC Interface Endpoints for ecr.api, ecr.dkr, logs, and secretsmanager.
  2. Add a NAT Gateway to the private subnet's route table.
  3. Move tasks to public subnets (assign public IPs) — acceptable for non-production.
# Create ECR API endpoint (repeat for ecr.dkr, logs, secretsmanager)
aws ec2 create-vpc-endpoint \
  --vpc-id vpc-abc123 \
  --vpc-endpoint-type Interface \
  --service-name com.amazonaws.us-east-1.ecr.api \
  --subnet-ids subnet-private1 subnet-private2 \
  --security-group-ids sg-vpc-endpoints \
  --private-dns-enabled

Step 6: Enable ECS Exec for Live Container Debugging

If the above steps don't resolve the issue, connect directly to a running container to inspect what port the app is actually binding to:

# Enable ECS Exec on the service
aws ecs update-service \
  --cluster my-cluster \
  --service my-service \
  --enable-execute-command

# Open a shell in the running container
aws ecs execute-command \
  --cluster my-cluster \
  --task <task-arn> \
  --container my-app \
  --interactive \
  --command '/bin/sh'

# Inside the container — verify the port is actually bound
netstat -tlnp | grep LISTEN
curl -v http://localhost:8080/health

This reveals whether the app is listening on the wrong interface (e.g., 127.0.0.1 instead of 0.0.0.0) or on the wrong port entirely — both of which cause ECS health checks to fail silently.

Frequently Asked Questions

bash
#!/usr/bin/env bash
# ecs-timeout-diagnose.sh — run against a misbehaving ECS service
# Usage: CLUSTER=my-cluster SERVICE=my-service bash ecs-timeout-diagnose.sh

set -euo pipefail
CLUSTER=${CLUSTER:?Set CLUSTER env var}
SERVICE=${SERVICE:?Set SERVICE env var}

echo "=== [1] Service events (last 10) ==="
aws ecs describe-services \
  --cluster "$CLUSTER" --services "$SERVICE" \
  --query 'services[0].events[:10].[createdAt,message]' \
  --output table

echo ""
echo "=== [2] Health check grace period ==="
aws ecs describe-services \
  --cluster "$CLUSTER" --services "$SERVICE" \
  --query 'services[0].healthCheckGracePeriodSeconds'

echo ""
echo "=== [3] Stopped tasks — stoppedReason ==="
STOPPED_TASKS=$(aws ecs list-tasks \
  --cluster "$CLUSTER" --service-name "$SERVICE" \
  --desired-status STOPPED --max-results 5 \
  --query 'taskArns' --output text)

if [ -n "$STOPPED_TASKS" ]; then
  # shellcheck disable=SC2086
  aws ecs describe-tasks \
    --cluster "$CLUSTER" --tasks $STOPPED_TASKS \
    --query 'tasks[*].{task:taskArn,reason:stoppedReason,containers:containers[*].{name:name,exit:exitCode,reason:reason}}' \
    --output json
else
  echo "No recently stopped tasks found."
fi

echo ""
echo "=== [4] Target group health ==="
TG_ARN=$(aws ecs describe-services \
  --cluster "$CLUSTER" --services "$SERVICE" \
  --query 'services[0].loadBalancers[0].targetGroupArn' --output text)

if [ "$TG_ARN" != "None" ] && [ -n "$TG_ARN" ]; then
  aws elbv2 describe-target-health \
    --target-group-arn "$TG_ARN" \
    --query 'TargetHealthDescriptions[*].{id:Target.Id,port:Target.Port,state:TargetHealth.State,reason:TargetHealth.Reason,desc:TargetHealth.Description}' \
    --output table

  echo ""
  echo "=== [5] Target group health check config ==="
  aws elbv2 describe-target-groups \
    --target-group-arns "$TG_ARN" \
    --query 'TargetGroups[0].{Path:HealthCheckPath,IntervalSec:HealthCheckIntervalSeconds,TimeoutSec:HealthCheckTimeoutSeconds,HealthyN:HealthyThresholdCount,UnhealthyN:UnhealthyThresholdCount}' \
    --output table
else
  echo "No load balancer attached to this service."
fi

echo ""
echo "=== [6] Running task ENIs + security groups ==="
RUNNING_TASK=$(aws ecs list-tasks \
  --cluster "$CLUSTER" --service-name "$SERVICE" \
  --desired-status RUNNING --max-results 1 \
  --query 'taskArns[0]' --output text)

if [ "$RUNNING_TASK" != "None" ] && [ -n "$RUNNING_TASK" ]; then
  ENI_ID=$(aws ecs describe-tasks \
    --cluster "$CLUSTER" --tasks "$RUNNING_TASK" \
    --query 'tasks[0].attachments[0].details[?name==`networkInterfaceId`].value' \
    --output text)
  if [ -n "$ENI_ID" ]; then
    echo "ENI: $ENI_ID"
    aws ec2 describe-network-interfaces \
      --network-interface-ids "$ENI_ID" \
      --query 'NetworkInterfaces[0].Groups[*].{id:GroupId,name:GroupName}' \
      --output table
  fi
else
  echo "No running tasks found."
fi

echo ""
echo "=== Done. Review output above for misconfigurations. ==="
E

Error Medic Editorial

The Error Medic Editorial team comprises senior DevOps engineers and SREs with collective experience spanning AWS, GCP, and on-premises infrastructure. They specialize in container orchestration, distributed systems reliability, and translating cryptic cloud error messages into actionable fixes.

Sources

Related Articles in AWS ECS

Explore More DevOps Config Guides