Troubleshooting AWS ECS 502 Bad Gateway and Timeout Errors
Fix AWS ECS 502 Bad Gateway and timeout errors by diagnosing ALB to ECS connection issues, health check failures, and security group misconfigurations.
- 502 Bad Gateway typically means the Application Load Balancer (ALB) cannot communicate with your ECS tasks.
- Common root cause 1: Security Groups are blocking traffic between the ALB and the ECS container instances or Fargate ENIs.
- Common root cause 2: The container application is crashing, not listening on 0.0.0.0, or listening on the wrong port.
- Common root cause 3: ALB idle timeout is shorter than the application's processing time, leading to premature connection drops.
- Quick fix summary: Verify Target Group health status, ensure the container listens on the mapped port across all interfaces, and check Security Group ingress rules from the ALB.
| Method | When to Use | Time | Risk |
|---|---|---|---|
| Update Security Groups | Target Group shows targets as 'Unhealthy' with connection timeouts. | 5 mins | Low |
| Fix Container Port/Host | Target Group shows 'Unhealthy' but SGs are correct; app works locally. | 15 mins | Medium |
| Adjust ALB Idle Timeout | Intermittent 502s during long-running requests or large uploads. | 5 mins | Low |
| Increase Task Resources | Tasks are being OOMKilled or thrashing CPU, causing unresponsiveness. | 10 mins | Medium |
Understanding the AWS ECS 502 Bad Gateway Error
When deploying applications on Amazon Elastic Container Service (ECS), whether backed by EC2 or AWS Fargate, the architecture typically involves an Application Load Balancer (ALB) routing traffic to your containers. A 502 Bad Gateway error occurs when the ALB attempts to proxy a request to your ECS task but receives an invalid response, or no response at all, from the target container.
Unlike a 503 Service Unavailable which often points to no healthy targets being available, or a 504 Gateway Timeout which strictly indicates the target took too long, a 502 in the AWS ecosystem usually points to a fundamental communication breakdown between the load balancer and the container.
Common Symptoms and Log Indicators
When this error occurs, you will likely see the following:
- Users receive an HTTP 502 status code in their browser.
- The ALB access logs show
502in theelb_status_codefield, but often-in thetarget_status_codefield, meaning the request never reached the application. - Target Groups in the EC2 console show targets transitioning rapidly between
Initial,Unhealthy, andDrainingstates. - ECS Service events show tasks repeatedly starting and stopping (task flapping).
Step 1: Diagnose the Target Group Health
The first step in any ECS 502 investigation is checking the ALB Target Group. The Target Group acts as the bridge between the load balancer and the ECS tasks.
- Navigate to the EC2 Console -> Target Groups.
- Select the Target Group associated with your ECS service.
- Click the Targets tab.
Look at the Status and Status details columns.
- Health checks failed with these codes: [Connection refused]: The ALB reached the container, but nothing is listening on the expected port.
- Health checks failed with these codes: [Request timed out]: The ALB cannot reach the container at all. This is almost always a Security Group or VPC routing issue.
Step 2: Fix Security Group and VPC Misconfigurations
If the health checks are timing out, the ALB is physically blocked from talking to the ECS task.
For Fargate Tasks: Fargate tasks each get their own Elastic Network Interface (ENI). The Security Group attached to the ECS Service must allow inbound traffic on the container port from the ALB's Security Group.
- Source: ALB Security Group ID (e.g.,
sg-0abcd1234) - Port Range: The specific port your container listens on (e.g.,
8080) - Protocol: TCP
For EC2-backed ECS Tasks: If using bridge networking with dynamic port mapping, the ECS instances are assigned an ephemeral port (typically 32768 - 65535). The EC2 instance Security Group must allow inbound traffic from the ALB on the entire ephemeral port range.
Step 3: Container Application Issues
If the Security Groups are correct, the issue usually lies within the container itself.
The '0.0.0.0' vs '127.0.0.1' Trap
A classic mistake is configuring the application (like a Node.js Express app, Python Flask/Django, or Go server) to listen on localhost or 127.0.0.1. Inside a Docker container, 127.0.0.1 refers only to the container's internal loopback interface. The ALB cannot reach it.
Fix: Ensure your application binds to 0.0.0.0.
Premature Task Exits (Crashing) If the application crashes immediately upon startup, the ALB will try to route traffic to it just as it dies, resulting in a 502. Check CloudWatch Logs for your ECS task to look for stack traces, missing environment variables, or database connection failures preventing startup.
Step 4: Addressing AWS ECS Timeout Errors
Sometimes, a 502 or 504 is intermittent, specifically occurring during heavy load or long-running requests.
ALB Idle Timeout: By default, the ALB has an idle timeout of 60 seconds. If your container takes 65 seconds to process a report generation request, the ALB will close the connection to the client and return a 504 (or sometimes a 502 if the target closes the connection abruptly after the timeout). You must increase the ALB's idle timeout attribute to match your application's maximum expected response time.
Keep-Alive Headers:
Ensure your application's HTTP keep-alive timeout is greater than the ALB's idle timeout. If the application closes the TCP connection while the ALB is still trying to send data, a 502 Bad Gateway will occur. For example, in Node.js, you might need to set server.keepAliveTimeout = 65000; and server.headersTimeout = 66000;.
Frequently Asked Questions
#!/bin/bash
# Diagnostic script to check ALB Target Group health and fetch recent ECS task logs
TARGET_GROUP_ARN="arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/my-targets/6d0ecf831eec9f09"
CLUSTER_NAME="my-ecs-cluster"
SERVICE_NAME="my-ecs-service"
echo "=== Checking Target Health ==="
aws elbv2 describe-target-health --target-group-arn $TARGET_GROUP_ARN | jq '.TargetHealthDescriptions[] | {Id: .Target.Id, State: .TargetHealth.State, Reason: .TargetHealth.Reason, Description: .TargetHealth.Description}'
echo -e "\n=== Finding Latest Stopped Task ==="
LATEST_TASK=$(aws ecs list-tasks --cluster $CLUSTER_NAME --service-name $SERVICE_NAME --desired-status STOPPED --max-items 1 | jq -r '.taskArns[0]')
if [ "$LATEST_TASK" != "null" ]; then
echo "Found stopped task: $LATEST_TASK"
echo "Fetching container exit codes and stop reasons..."
aws ecs describe-tasks --cluster $CLUSTER_NAME --tasks $LATEST_TASK | jq '.tasks[0] | {StopReason: .stoppedReason, Containers: [.containers[] | {Name: .name, ExitCode: .exitCode, Reason: .reason}]}'
else
echo "No recently stopped tasks found."
fiError Medic Editorial
Error Medic Editorial is a collective of senior Site Reliability Engineers and Cloud Architects dedicated to demystifying complex infrastructure issues and providing actionable, real-world solutions.