AWS ALB 502 Bad Gateway & 504 Gateway Timeout: Complete Troubleshooting Guide
Fix AWS ALB 502 Bad Gateway and 504 timeout errors fast. Covers target health, keep-alive mismatches, idle timeout tuning, and exact CLI diagnostic commands.
- HTTP 502 means ALB received an invalid or malformed response from your backend target — the most common triggers are all-unhealthy target groups, keep-alive connection race conditions, and malformed HTTP responses from the application
- HTTP 504 Gateway Timeout means the ALB idle timeout (default 60 s) expired before the target sent a complete response — increase the timeout or implement async patterns for long-running work
- Enable ALB access logs immediately to get exact error_reason codes such as Target.ResponseCodeMismatch, Target.Timeout, and Target.ConnectionError before making any configuration changes; guessing without logs wastes hours
| Method | When to Use | Time to Apply | Risk |
|---|---|---|---|
| Enable ALB access logs | Always — first step before any fix | 2 min | None — read-only diagnostic |
| Increase idle timeout | 504 errors; backend needs >60 s to respond | 1 min CLI | Low — affects all connections on the ALB |
| Fix Nginx keep-alive settings | Intermittent 502 under load; error_reason Target.InvalidResponse | 5–15 min + deploy | Low — Nginx reload is zero-downtime |
| Fix Node.js keepAliveTimeout | Node/Express backends returning 502 under concurrent traffic | 5 min + deploy | Low |
| Tighten health check path/matcher | Unhealthy targets; TargetHealth.Reason FailedHealthChecks | 2 min | Medium — wrong matcher marks all hosts unhealthy |
| Add async job pattern | 504 errors on operations that cannot be optimized below timeout | Days | Low risk to ALB; requires app redesign |
| Fix Lambda response format | 502 on Lambda target groups; missing statusCode field | 10 min + deploy | Low |
Understanding AWS ALB 502 and 504 Errors
An Application Load Balancer sits between clients and your backend targets (EC2 instances, ECS containers, Lambda functions, or bare IP addresses). When clients receive HTTP 502 or 504, the problem is always between the ALB and your targets — not in the ALB infrastructure itself.
HTTP 502 Bad Gateway is returned when the ALB successfully established a TCP connection to a target but received a response it could not proxy: a malformed HTTP response, an abruptly closed connection, or no response at all from a target that immediately disconnected.
HTTP 504 Gateway Timeout is returned when the ALB connected to the target but the target failed to send a complete HTTP response within the ALB idle timeout window. The default idle timeout is 60 seconds and measures silence on the wire — not total wall-clock request time.
Both errors look identical to end users but have distinct root causes and fixes. The fastest way to distinguish them precisely is ALB access logs.
Step 1: Enable ALB Access Logs and Read error_reason
Access logs are disabled by default. Enable them before doing anything else — every 502 and 504 line includes an error_reason field that names the exact failure mode.
aws elbv2 modify-load-balancer-attributes \
--load-balancer-arn arn:aws:elasticloadbalancing:us-east-1:123456789012:loadbalancer/app/my-alb/abc123 \
--attributes Key=access_logs.s3.enabled,Value=true \
Key=access_logs.s3.bucket,Value=my-alb-logs-bucket \
Key=access_logs.s3.prefix,Value=alb
Once logs arrive in S3, filter for 502 and 504 entries. Access logs are space-delimited; the error_reason is the last field.
Common error_reason codes for 502 Bad Gateway:
| error_reason | Root Cause |
|---|---|
Target.ResponseCodeMismatch |
Health check returned a code outside the configured matcher range |
Target.FailedHealthChecks |
All targets unhealthy; ALB has nowhere to route |
Target.InvalidResponse |
Backend returned a malformed HTTP response (bad headers, wrong HTTP version) |
Target.ConnectionError |
ALB could not establish TCP to the target (security group, process not listening) |
Target.Timeout |
Target connected but did not return response headers within timeout |
Common codes for 504 Gateway Timeout:
| error_reason | Root Cause |
|---|---|
Target.Timeout |
Target exceeded the ALB idle timeout |
Target.ConnectionRefused |
Target port closed or firewall blocking ALB health-check or data path |
Step 2: Check Target Group Health
The most common cause of sustained 502 errors in production is a fully unhealthy target group. When no healthy targets exist, every request returns 502 immediately.
aws elbv2 describe-target-health \
--target-group-arn arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/my-tg/def456 \
--query 'TargetHealthDescriptions[*].{ID:Target.Id,Port:Target.Port,State:TargetHealth.State,Reason:TargetHealth.Reason,Desc:TargetHealth.Description}' \
--output table
Key TargetHealth.Reason values:
Target.FailedHealthChecks— Your/healthendpoint is not returning the expected HTTP status code, or the response is arriving after the health check timeout.Target.NotRegistered— The target deregistered (ASG scale-in, manual removal). Re-register or verify your ASG lifecycle hooks.Target.NotInUse— Target is in an Availability Zone the ALB is not enabled for. Enable the AZ on the ALB or enable cross-zone load balancing.Elb.InternalError— AWS-side issue. Open a support case.
Verify your health check configuration matches what the application actually serves:
aws elbv2 describe-target-groups \
--target-group-arns arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/my-tg/def456 \
--query 'TargetGroups[0].{Path:HealthCheckPath,Port:HealthCheckPort,Matcher:Matcher,Timeout:HealthCheckTimeoutSeconds,Interval:HealthCheckIntervalSeconds}'
If the application takes 8 seconds to start responding to health checks but HealthCheckTimeoutSeconds is 5, every target will be marked unhealthy within seconds of launch.
Step 3: Fix Keep-Alive Connection Race Conditions (502)
Intermittent 502 errors under load — especially errors that appear on roughly 1–5% of requests and cannot be reproduced on a single request — are usually caused by a keep-alive mismatch between ALB and the backend.
The race condition: ALB uses HTTP/1.1 persistent connections to targets and aggressively reuses them. If your backend closes the connection immediately after sending a response (sending Connection: close) at the exact moment ALB sends the next request on that connection, ALB receives a reset TCP connection mid-request and emits 502.
Nginx upstream fix:
upstream backend {
server 127.0.0.1:8080;
keepalive 32; # maintain a pool of 32 keep-alive connections
keepalive_timeout 65s; # must exceed ALB idle timeout
}
server {
location / {
proxy_pass http://backend;
proxy_http_version 1.1;
proxy_set_header Connection ""; # strip hop-by-hop header to enable keep-alive
}
}
Node.js / Express fix:
const app = require('express')();
const server = app.listen(3000);
// Both values MUST exceed the ALB idle timeout (default 60 s)
server.keepAliveTimeout = 65000; // ms — time to keep idle connection open
server.headersTimeout = 66000; // ms — must be strictly greater than keepAliveTimeout
This is the most overlooked fix for Node.js services behind ALB and resolves the majority of intermittent 502 patterns.
Step 4: Increase the ALB Idle Timeout (504)
For HTTP 504 errors, start by measuring actual backend processing time from the target_processing_time field in access logs, then compare to your idle timeout setting.
# View current idle timeout
aws elbv2 describe-load-balancer-attributes \
--load-balancer-arn arn:aws:elasticloadbalancing:us-east-1:123456789012:loadbalancer/app/my-alb/abc123 \
--query 'Attributes[?Key==`idle_timeout.timeout_seconds`]'
# Increase to 120 seconds
aws elbv2 modify-load-balancer-attributes \
--load-balancer-arn arn:aws:elasticloadbalancing:us-east-1:123456789012:loadbalancer/app/my-alb/abc123 \
--attributes Key=idle_timeout.timeout_seconds,Value=120
The maximum ALB idle timeout is 4000 seconds. However, simply increasing the timeout is a band-aid. For operations that genuinely take minutes, implement an asynchronous pattern: return HTTP 202 Accepted with a job ID immediately, and provide a polling endpoint the client can check. This eliminates timeout risk entirely regardless of processing duration.
For streaming responses: If your application can send partial data (chunked transfer encoding), each sent chunk resets the idle timer. This allows indefinitely long streaming responses without triggering 504.
Step 5: Verify Security Groups and Port Reachability
Security group misconfigurations cause Target.ConnectionError (502) or connection refusals. The ALB must have outbound access to targets and targets must allow inbound from the ALB — using security group references, not static IPs.
# Get the ALB's security group(s)
aws elbv2 describe-load-balancers \
--load-balancer-arns arn:aws:elasticloadbalancing:us-east-1:123456789012:loadbalancer/app/my-alb/abc123 \
--query 'LoadBalancers[0].SecurityGroups'
# Get the target's security group(s) and check inbound rules
aws ec2 describe-security-groups \
--group-ids sg-TARGET_SG_ID \
--query 'SecurityGroups[0].IpPermissions'
The target security group inbound rules must reference the ALB security group ID (e.g., sg-xxxxxxxx) as the source — not the ALB's IP addresses, which change during scale events.
Step 6: Lambda Target Group 502 Errors
When the target type is lambda, HTTP 502 has additional causes specific to Lambda:
- Invalid response format: Lambda must return a JSON object with
statusCode(integer),headers(object), andbody(string). A missing or nullstatusCodecauses ALB to return 502. - Payload size limit: ALB response payloads from Lambda are capped at 1 MB. Larger responses produce 502.
- Lambda throttling: When Lambda concurrency is exhausted, ALB cannot invoke the function and returns 502. Check the
Throttlesmetric in CloudWatch for the function.
Verify your Lambda handler returns the correct shape:
def handler(event, context):
return {
"statusCode": 200, # required — must be an integer
"statusDescription": "200 OK",
"isBase64Encoded": False,
"headers": {"Content-Type": "application/json"},
"body": '{"status": "ok"}'
}
Step 7: Monitor with CloudWatch Alarms
Once the immediate issue is resolved, set up proactive CloudWatch alarms so you catch regressions before users do:
aws cloudwatch put-metric-alarm \
--alarm-name "ALB-502-Spike" \
--metric-name HTTPCode_Target_5XX_Count \
--namespace AWS/ApplicationELB \
--statistic Sum \
--period 60 \
--threshold 10 \
--comparison-operator GreaterThanThreshold \
--dimensions Name=LoadBalancer,Value=app/my-alb/abc123 \
--evaluation-periods 2 \
--alarm-actions arn:aws:sns:us-east-1:123456789012:ops-alerts
Key metrics to track: HTTPCode_Target_5XX_Count, UnHealthyHostCount, TargetResponseTime (P99), and TargetConnectionErrorCount. A rising TargetResponseTime P99 is an early warning sign for 504 errors before they begin breaching the timeout threshold.
Frequently Asked Questions
#!/usr/bin/env bash
# AWS ALB 502/504 Diagnostic Script
# Usage: ALB_ARN="arn:aws:..." TG_ARN="arn:aws:..." bash alb-debug.sh
set -euo pipefail
REGION="${AWS_DEFAULT_REGION:-us-east-1}"
echo "=== ALB Attributes ==="
aws elbv2 describe-load-balancer-attributes \
--load-balancer-arn "$ALB_ARN" \
--query 'Attributes[?contains(`["idle_timeout.timeout_seconds","access_logs.s3.enabled"]`, Key)]' \
--output table
echo ""
echo "=== Target Group Health ==="
aws elbv2 describe-target-health \
--target-group-arn "$TG_ARN" \
--query 'TargetHealthDescriptions[*].{ID:Target.Id,Port:Target.Port,State:TargetHealth.State,Reason:TargetHealth.Reason}' \
--output table
echo ""
echo "=== Health Check Configuration ==="
aws elbv2 describe-target-groups \
--target-group-arns "$TG_ARN" \
--query 'TargetGroups[0].{Path:HealthCheckPath,Port:HealthCheckPort,Matcher:Matcher.HttpCode,Timeout:HealthCheckTimeoutSeconds,Interval:HealthCheckIntervalSeconds,Threshold:HealthyThresholdCount}' \
--output table
echo ""
echo "=== CloudWatch: 5xx counts last 30 min ==="
START=$(date -u -d '30 minutes ago' '+%Y-%m-%dT%H:%M:%SZ')
END=$(date -u '+%Y-%m-%dT%H:%M:%SZ')
LB_SUFFIX=$(echo "$ALB_ARN" | sed 's|.*:loadbalancer/||')
for METRIC in HTTPCode_Target_5XX_Count HTTPCode_ELB_5XX_Count UnHealthyHostCount TargetConnectionErrorCount; do
echo -n "$METRIC: "
aws cloudwatch get-metric-statistics \
--namespace AWS/ApplicationELB \
--metric-name "$METRIC" \
--dimensions Name=LoadBalancer,Value="$LB_SUFFIX" \
--start-time "$START" --end-time "$END" \
--period 1800 --statistics Sum \
--query 'Datapoints[0].Sum' \
--output text
done
echo ""
echo "=== Recent access log 502/504 errors (requires jq + S3 log access) ==="
echo "Run: aws s3 cp s3://YOUR-BUCKET/alb-logs/$(date +%Y/%m/%d)/ . --recursive"
echo "Then: grep ' 502 \| 504 ' *.log | awk '{print $NF}' | sort | uniq -c | sort -rn"Error Medic Editorial
Error Medic Editorial is a team of senior SREs and cloud architects with combined experience operating high-traffic systems on AWS, GCP, and Azure. We write from production post-mortems, not documentation summaries.
Sources
- https://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-troubleshooting.html
- https://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-access-logs.html
- https://docs.aws.amazon.com/elasticloadbalancing/latest/application/target-group-health-checks.html
- https://repost.aws/knowledge-center/elb-alb-troubleshoot-502-errors
- https://stackoverflow.com/questions/56366873/aws-alb-returns-502-bad-gateway-intermittently-under-load