Why do my AWS Fargate tasks stay in the PENDING state before timing out?

This is almost always a networking issue. The ECS agent is trying to download the container image from ECR but cannot reach the internet. Ensure your private subnets have a route to a NAT Gateway, or that you have provisioned VPC Endpoints (PrivateLink) for ECR (api and dkr) and S3.

How do I fix 'ResourceInitializationError: unable to pull secrets'?

This error occurs when the task cannot retrieve environment variables stored in AWS Secrets Manager or Parameter Store. Verify that the task's 'Execution Role' has the 'secretsmanager:GetSecretValue' or 'ssm:GetParameters' permissions, and that the container has network access to reach those AWS service endpoints.

Why is my ECS target group health check failing with a timeout?

This happens when your application takes longer to boot than the configured health check timeout. To fix this, update your ECS Service to include a 'healthCheckGracePeriodSeconds' of 60 to 120 seconds. Also verify your application is listening on 0.0.0.0 rather than 127.0.0.1.

Does AWS Fargate need a public IP to pull images from ECR?

If your Fargate task is deployed in a Public Subnet (with an Internet Gateway), yes, it MUST be assigned a public IP to pull images. If it is deployed in a Private Subnet, it should NOT have a public IP, but it must use a NAT Gateway or VPC Endpoints to access ECR.

Troubleshooting AWS ECS Timeout Errors: ResourceInitializationError & 504 Gateways

Comprehensive guide to fixing AWS ECS timeout errors, including ResourceInitializationError, ALB 504 Gateway Timeouts, and failing health checks.

Last updated: February 23, 2026

Last verified: February 23, 2026

1,700 words

Key Takeaways

Tasks stuck in PENDING usually indicate a missing NAT Gateway or VPC Endpoint preventing ECR image pulls.
ALB health check timeouts often happen when application startup exceeds the ECS service's healthCheckGracePeriodSeconds.
HTTP 504 Gateway Timeouts indicate the container took longer to respond than the ALB's configured idle timeout.
Missing IAM permissions (Task Execution Role) for Secrets Manager or ECR will cause container initialization timeouts.

Common ECS Timeout Signatures and Resolutions
Error Signature	Primary Root Cause	Diagnostic Tool	Recommended Fix
ResourceInitializationError	Missing VPC route to ECR/Secrets	VPC Reachability Analyzer	Add NAT Gateway or VPC Endpoints
Task failed ELB health checks	App boot exceeds grace period	CloudWatch Metrics / ECS Events	Increase healthCheckGracePeriodSeconds
HTTP 504 Gateway Timeout	Slow backend processing	ALB Access Logs	Increase ALB idle timeout or optimize DB queries
CannotPullContainerError	IAM or Network block	CloudTrail	Fix Task Execution Role permissions

Understanding AWS ECS Timeout Errors

When working with Amazon Elastic Container Service (ECS), particularly on AWS Fargate, "timeout" is a symptom rather than a singular root cause. Because ECS is a deeply integrated service, a timeout can stem from networking constraints (VPC, Subnets, Security Groups), IAM permission boundaries, Load Balancer configurations, or the application layer itself.

As a DevOps engineer or SRE, your first step is categorizing the timeout. Did the task fail to start? Did it start but fail health checks? Or is it running fine, but clients are experiencing timeout errors?

Below, we break down the most common ECS timeout scenarios, the exact error messages you will encounter, and step-by-step resolution paths.

Scenario 1: The Provisioning Timeout (ResourceInitializationError)

The Symptom: Your ECS task transitions to the PENDING state and stays there for several minutes before finally transitioning to STOPPED.

Exact Error Messages:

ResourceInitializationError: unable to pull secrets or registry auth: execution resource missing
CannotPullContainerError: inspect image has been retried 1 time(s): failed to resolve reference
CannotPullContainerError: pull image manifest has been retried 1 time(s): error during connect: Get https://api.ecr... timeout

The Root Cause: The ECS agent (running on the underlying EC2 instance or Fargate fleet) must communicate with several AWS APIs to start your container: ECR (to pull the image), Secrets Manager/SSM (to pull environment variables), and CloudWatch (to configure logging). If it cannot reach these endpoints over the network, it will eventually time out and kill the task.

How to Fix It:

1. Check Subnet Routing (The #1 Culprit) If your task is deployed in a Private Subnet, it has no direct route to the internet. To pull an image from ECR, it must either route through a NAT Gateway or use AWS PrivateLink (VPC Endpoints).

Fix A (NAT Gateway): Ensure your private subnet's route table has a route 0.0.0.0/0 pointing to a NAT Gateway located in a Public Subnet.
Fix B (VPC Endpoints): If you are running in a disconnected VPC (no internet access), you MUST provision the following interface VPC Endpoints:
- com.amazonaws.<region>.ecr.api
- com.amazonaws.<region>.ecr.dkr
- com.amazonaws.<region>.logs (for CloudWatch)
- com.amazonaws.<region>.secretsmanager (if using Secrets)
- Crucial: You must also create a Gateway VPC Endpoint for S3 (com.amazonaws.<region>.s3), because ECR actually stores the image layers in S3.

2. Check Security Groups Ensure the Security Group attached to your ECS task allows outbound traffic (Egress) to 0.0.0.0/0 on port 443 (HTTPS). The ECS agent requires HTTPS to communicate with ECR and CloudWatch.

3. Check Public IP Assignment If your task is in a Public Subnet (using an Internet Gateway), Fargate tasks must have the Assign public IP setting set to ENABLED. Without a public IP, the task cannot route out through the Internet Gateway, resulting in a timeout.

Scenario 2: The Health Check Timeout (ALB Target Group Failure)

The Symptom: The ECS task successfully pulls the image and enters the RUNNING state. However, 30 to 60 seconds later, it is terminated, and ECS attempts to start a new one. This loop continues indefinitely.

Exact Error Message:

Task failed ELB health checks in (target-group arn)

The Root Cause: Your application framework (e.g., Spring Boot, Django, heavy Node.js apps) takes a significant amount of time to initialize, connect to the database, run migrations, and bind to the port. Meanwhile, the Application Load Balancer (ALB) begins sending health check pings immediately. If the container doesn't respond with a 200 OK within the configured time, the ALB marks the target as UNHEALTHY and instructs ECS to kill and replace the task.

How to Fix It:

1. Increase the Health Check Grace Period In your ECS Service definition, locate the healthCheckGracePeriodSeconds parameter. This tells ECS to ignore failing load balancer health checks for a specified duration after the task enters the RUNNING state.

Action: Increase this to 120 or 300 seconds, depending on your application's bootstrap time.

2. Tune the ALB Health Check Interval and Threshold Go to the EC2 Console -> Target Groups -> Health Checks.

Ensure the Timeout is reasonable (e.g., 5 seconds).
Ensure the Interval gives the app time to breathe (e.g., 30 seconds).
Check the Healthy threshold (e.g., 2) and Unhealthy threshold (e.g., 3).

3. Verify Port Binding A common "silent timeout" occurs when your application binds to 127.0.0.1 (localhost) instead of 0.0.0.0. The container will start, but the ALB health check packets arriving at the container's ENI will drop, leading to a timeout. Ensure your web server configuration binds to 0.0.0.0.

Scenario 3: The Client-Facing Timeout (HTTP 504 Gateway Timeout)

The Symptom: The ECS tasks are stable and running. Health checks are passing. However, when users or API clients send certain requests to the Load Balancer, they receive an HTTP 504 error after exactly 60 seconds.

Exact Error Message:

HTTP 504 Gateway Timeout

The Root Cause: The ALB successfully forwarded the request to your ECS container, but the container failed to return an HTTP response before the ALB's configured Idle Timeout was reached. The default ALB idle timeout is 60 seconds.

How to Fix It:

1. Increase the ALB Idle Timeout If your application legitimately requires more than 60 seconds to process a request (e.g., generating a massive PDF report, complex data processing), you must increase the ALB's idle timeout.

Action: Go to EC2 Console -> Load Balancers -> Select your ALB -> Attributes -> Edit -> Set Idle timeout to your desired value (up to 4000 seconds).

2. Investigate Application Bottlenecks If the request shouldn't take 60 seconds, an infrastructure change won't fix the root problem. You need to look at application metrics:

Are database connection pools exhausted, causing requests to queue?
Are downstream third-party APIs timing out, cascading the delay to your container?
Implement distributed tracing (e.g., AWS X-Ray or OpenTelemetry) to pinpoint exactly where the time is being spent inside the ECS task.

Scenario 4: The Container Stop Timeout

The Symptom: When deploying a new version of your service, the old tasks hang in the DEPROVISIONING state for a long time before finally terminating.

Exact Error Message:

ECS Event Log: Stopped reason: Stop timeout

The Root Cause: When ECS decides to stop a task (due to a deployment, scaling in, or failing health checks), it sends a SIGTERM signal to the container. The application is supposed to catch this signal, finish in-flight requests gracefully, and exit. If the application ignores SIGTERM, ECS waits for a specific duration (default 30 seconds) before sending a hard SIGKILL.

How to Fix It: Ensure your application framework handles SIGTERM gracefully. If your application legitimately needs more time to drain long-running WebSocket connections or background jobs, you can configure the stopTimeout parameter in your ECS Task Definition (container definitions section) to extend the wait time up to 120 seconds.

Frequently Asked Questions

bash

# A bash script to quickly diagnose ECS task timeout reasons
# Requirements: AWS CLI configured with appropriate permissions

CLUSTER_NAME="my-production-cluster"

# 1. Find the most recently stopped task
TASK_ARN=$(aws ecs list-tasks \
  --cluster $CLUSTER_NAME \
  --desired-status STOPPED \
  --max-results 1 \
  --query 'taskArns[0]' \
  --output text)

if [ "$TASK_ARN" == "None" ]; then
    echo "No stopped tasks found in cluster $CLUSTER_NAME."
    exit 0
fi

echo "Analyzing stopped task: $TASK_ARN"

# 2. Extract the exact stop reason and container exit codes
aws ecs describe-tasks \
  --cluster $CLUSTER_NAME \
  --tasks $TASK_ARN \
  --query 'tasks[0].{StopReason: stoppedReason, ContainerReason: containers[0].reason, ExitCode: containers[0].exitCode}' \
  --output table

# 3. Check for specific timeout keywords
STOP_REASON=$(aws ecs describe-tasks --cluster $CLUSTER_NAME --tasks $TASK_ARN --query 'tasks[0].stoppedReason' --output text)

if [[ "$STOP_REASON" == *"ELB health checks"* ]]; then
    echo "[DIAGNOSIS]: Task failed ALB health checks. Consider increasing 'healthCheckGracePeriodSeconds' in your service."
elif [[ "$STOP_REASON" == *"ResourceInitializationError"* ]]; then
    echo "[DIAGNOSIS]: Networking/IAM failure. Check NAT Gateway, VPC Endpoints, or Task Execution Role permissions."
fi

Error Medic Editorial

Error Medic Editorial is a collective of senior Site Reliability Engineers and DevOps practitioners dedicated to solving complex cloud infrastructure issues and maintaining high-availability architectures.

Sources

Explore More DevOps Config Guides

Ansible Failed: Fix Connection Refused, Permission Denied & Timeout Errors

Fix Ansible failures including connection refused, permission denied, and timeout errors. Step-by-step diagnosis with real commands and verified solutions.

ArgoCD 'connection refused' Error: Complete Troubleshooting Guide (2024)

Fix ArgoCD 'connection refused', CrashLoopBackOff, ImagePullBackOff, and timeout errors with step-by-step diagnostic commands and proven solutions.

ArgoCD Connection Refused: Fix CrashLoopBackOff, ImagePullBackOff, Permission Denied & Timeout Errors

Fix ArgoCD connection refused errors: diagnose CrashLoopBackOff, ImagePullBackOff, permission denied, and timeout with step-by-step kubectl commands and config

Understanding AWS ECS Timeout Errors

Scenario 1: The Provisioning Timeout (ResourceInitializationError)

Scenario 2: The Health Check Timeout (ALB Target Group Failure)

Scenario 3: The Client-Facing Timeout (HTTP 504 Gateway Timeout)

Scenario 4: The Container Stop Timeout

Frequently Asked Questions

Why do my AWS Fargate tasks stay in the PENDING state before timing out?

How do I fix 'ResourceInitializationError: unable to pull secrets'?

Why is my ECS target group health check failing with a timeout?

Does AWS Fargate need a public IP to pull images from ECR?

Sources

Related Articles in AWS ECS

Explore More DevOps Config Guides