Error Medic

Troubleshooting AWS ECS Timeout Errors: ResourceInitializationError & 504 Gateways

Comprehensive guide to fixing AWS ECS timeout errors, including ResourceInitializationError, ALB 504 Gateway Timeouts, and failing health checks.

Last updated:
Last verified:
1,700 words
Key Takeaways
  • Tasks stuck in PENDING usually indicate a missing NAT Gateway or VPC Endpoint preventing ECR image pulls.
  • ALB health check timeouts often happen when application startup exceeds the ECS service's healthCheckGracePeriodSeconds.
  • HTTP 504 Gateway Timeouts indicate the container took longer to respond than the ALB's configured idle timeout.
  • Missing IAM permissions (Task Execution Role) for Secrets Manager or ECR will cause container initialization timeouts.
Common ECS Timeout Signatures and Resolutions
Error SignaturePrimary Root CauseDiagnostic ToolRecommended Fix
ResourceInitializationErrorMissing VPC route to ECR/SecretsVPC Reachability AnalyzerAdd NAT Gateway or VPC Endpoints
Task failed ELB health checksApp boot exceeds grace periodCloudWatch Metrics / ECS EventsIncrease healthCheckGracePeriodSeconds
HTTP 504 Gateway TimeoutSlow backend processingALB Access LogsIncrease ALB idle timeout or optimize DB queries
CannotPullContainerErrorIAM or Network blockCloudTrailFix Task Execution Role permissions

Understanding AWS ECS Timeout Errors

When working with Amazon Elastic Container Service (ECS), particularly on AWS Fargate, "timeout" is a symptom rather than a singular root cause. Because ECS is a deeply integrated service, a timeout can stem from networking constraints (VPC, Subnets, Security Groups), IAM permission boundaries, Load Balancer configurations, or the application layer itself.

As a DevOps engineer or SRE, your first step is categorizing the timeout. Did the task fail to start? Did it start but fail health checks? Or is it running fine, but clients are experiencing timeout errors?

Below, we break down the most common ECS timeout scenarios, the exact error messages you will encounter, and step-by-step resolution paths.


Scenario 1: The Provisioning Timeout (ResourceInitializationError)

The Symptom: Your ECS task transitions to the PENDING state and stays there for several minutes before finally transitioning to STOPPED.

Exact Error Messages:

  • ResourceInitializationError: unable to pull secrets or registry auth: execution resource missing
  • CannotPullContainerError: inspect image has been retried 1 time(s): failed to resolve reference
  • CannotPullContainerError: pull image manifest has been retried 1 time(s): error during connect: Get https://api.ecr... timeout

The Root Cause: The ECS agent (running on the underlying EC2 instance or Fargate fleet) must communicate with several AWS APIs to start your container: ECR (to pull the image), Secrets Manager/SSM (to pull environment variables), and CloudWatch (to configure logging). If it cannot reach these endpoints over the network, it will eventually time out and kill the task.

How to Fix It:

1. Check Subnet Routing (The #1 Culprit) If your task is deployed in a Private Subnet, it has no direct route to the internet. To pull an image from ECR, it must either route through a NAT Gateway or use AWS PrivateLink (VPC Endpoints).

  • Fix A (NAT Gateway): Ensure your private subnet's route table has a route 0.0.0.0/0 pointing to a NAT Gateway located in a Public Subnet.
  • Fix B (VPC Endpoints): If you are running in a disconnected VPC (no internet access), you MUST provision the following interface VPC Endpoints:
    • com.amazonaws.<region>.ecr.api
    • com.amazonaws.<region>.ecr.dkr
    • com.amazonaws.<region>.logs (for CloudWatch)
    • com.amazonaws.<region>.secretsmanager (if using Secrets)
    • Crucial: You must also create a Gateway VPC Endpoint for S3 (com.amazonaws.<region>.s3), because ECR actually stores the image layers in S3.

2. Check Security Groups Ensure the Security Group attached to your ECS task allows outbound traffic (Egress) to 0.0.0.0/0 on port 443 (HTTPS). The ECS agent requires HTTPS to communicate with ECR and CloudWatch.

3. Check Public IP Assignment If your task is in a Public Subnet (using an Internet Gateway), Fargate tasks must have the Assign public IP setting set to ENABLED. Without a public IP, the task cannot route out through the Internet Gateway, resulting in a timeout.


Scenario 2: The Health Check Timeout (ALB Target Group Failure)

The Symptom: The ECS task successfully pulls the image and enters the RUNNING state. However, 30 to 60 seconds later, it is terminated, and ECS attempts to start a new one. This loop continues indefinitely.

Exact Error Message:

  • Task failed ELB health checks in (target-group arn)

The Root Cause: Your application framework (e.g., Spring Boot, Django, heavy Node.js apps) takes a significant amount of time to initialize, connect to the database, run migrations, and bind to the port. Meanwhile, the Application Load Balancer (ALB) begins sending health check pings immediately. If the container doesn't respond with a 200 OK within the configured time, the ALB marks the target as UNHEALTHY and instructs ECS to kill and replace the task.

How to Fix It:

1. Increase the Health Check Grace Period In your ECS Service definition, locate the healthCheckGracePeriodSeconds parameter. This tells ECS to ignore failing load balancer health checks for a specified duration after the task enters the RUNNING state.

  • Action: Increase this to 120 or 300 seconds, depending on your application's bootstrap time.

2. Tune the ALB Health Check Interval and Threshold Go to the EC2 Console -> Target Groups -> Health Checks.

  • Ensure the Timeout is reasonable (e.g., 5 seconds).
  • Ensure the Interval gives the app time to breathe (e.g., 30 seconds).
  • Check the Healthy threshold (e.g., 2) and Unhealthy threshold (e.g., 3).

3. Verify Port Binding A common "silent timeout" occurs when your application binds to 127.0.0.1 (localhost) instead of 0.0.0.0. The container will start, but the ALB health check packets arriving at the container's ENI will drop, leading to a timeout. Ensure your web server configuration binds to 0.0.0.0.


Scenario 3: The Client-Facing Timeout (HTTP 504 Gateway Timeout)

The Symptom: The ECS tasks are stable and running. Health checks are passing. However, when users or API clients send certain requests to the Load Balancer, they receive an HTTP 504 error after exactly 60 seconds.

Exact Error Message:

  • HTTP 504 Gateway Timeout

The Root Cause: The ALB successfully forwarded the request to your ECS container, but the container failed to return an HTTP response before the ALB's configured Idle Timeout was reached. The default ALB idle timeout is 60 seconds.

How to Fix It:

1. Increase the ALB Idle Timeout If your application legitimately requires more than 60 seconds to process a request (e.g., generating a massive PDF report, complex data processing), you must increase the ALB's idle timeout.

  • Action: Go to EC2 Console -> Load Balancers -> Select your ALB -> Attributes -> Edit -> Set Idle timeout to your desired value (up to 4000 seconds).

2. Investigate Application Bottlenecks If the request shouldn't take 60 seconds, an infrastructure change won't fix the root problem. You need to look at application metrics:

  • Are database connection pools exhausted, causing requests to queue?
  • Are downstream third-party APIs timing out, cascading the delay to your container?
  • Implement distributed tracing (e.g., AWS X-Ray or OpenTelemetry) to pinpoint exactly where the time is being spent inside the ECS task.

Scenario 4: The Container Stop Timeout

The Symptom: When deploying a new version of your service, the old tasks hang in the DEPROVISIONING state for a long time before finally terminating.

Exact Error Message:

  • ECS Event Log: Stopped reason: Stop timeout

The Root Cause: When ECS decides to stop a task (due to a deployment, scaling in, or failing health checks), it sends a SIGTERM signal to the container. The application is supposed to catch this signal, finish in-flight requests gracefully, and exit. If the application ignores SIGTERM, ECS waits for a specific duration (default 30 seconds) before sending a hard SIGKILL.

How to Fix It: Ensure your application framework handles SIGTERM gracefully. If your application legitimately needs more time to drain long-running WebSocket connections or background jobs, you can configure the stopTimeout parameter in your ECS Task Definition (container definitions section) to extend the wait time up to 120 seconds.

Frequently Asked Questions

bash
# A bash script to quickly diagnose ECS task timeout reasons
# Requirements: AWS CLI configured with appropriate permissions

CLUSTER_NAME="my-production-cluster"

# 1. Find the most recently stopped task
TASK_ARN=$(aws ecs list-tasks \
  --cluster $CLUSTER_NAME \
  --desired-status STOPPED \
  --max-results 1 \
  --query 'taskArns[0]' \
  --output text)

if [ "$TASK_ARN" == "None" ]; then
    echo "No stopped tasks found in cluster $CLUSTER_NAME."
    exit 0
fi

echo "Analyzing stopped task: $TASK_ARN"

# 2. Extract the exact stop reason and container exit codes
aws ecs describe-tasks \
  --cluster $CLUSTER_NAME \
  --tasks $TASK_ARN \
  --query 'tasks[0].{StopReason: stoppedReason, ContainerReason: containers[0].reason, ExitCode: containers[0].exitCode}' \
  --output table

# 3. Check for specific timeout keywords
STOP_REASON=$(aws ecs describe-tasks --cluster $CLUSTER_NAME --tasks $TASK_ARN --query 'tasks[0].stoppedReason' --output text)

if [[ "$STOP_REASON" == *"ELB health checks"* ]]; then
    echo "[DIAGNOSIS]: Task failed ALB health checks. Consider increasing 'healthCheckGracePeriodSeconds' in your service."
elif [[ "$STOP_REASON" == *"ResourceInitializationError"* ]]; then
    echo "[DIAGNOSIS]: Networking/IAM failure. Check NAT Gateway, VPC Endpoints, or Task Execution Role permissions."
fi
E

Error Medic Editorial

Error Medic Editorial is a collective of senior Site Reliability Engineers and DevOps practitioners dedicated to solving complex cloud infrastructure issues and maintaining high-availability architectures.

Sources

Related Articles in AWS ECS

Explore More DevOps Config Guides