Error Medic

How to Fix AWS ALB 503 Service Temporarily Unavailable Errors

Fix AWS ALB 503 Service Unavailable errors by diagnosing target group health, security groups, and backend capacity limits. A complete SRE troubleshooting guide

Last updated:
Last verified:
1,329 words
Key Takeaways
  • Root Cause 1: Target group has zero registered targets available to process the request.
  • Root Cause 2: All registered backend targets are failing the ALB health checks and are marked unhealthy.
  • Root Cause 3: Backend service capacity is fully exhausted (e.g., connection limits reached on instances or containers).
  • Quick Fix: Validate target health in the EC2/Target Group console, verify Security Group ingress rules allow ALB traffic, and check backend application logs for startup failures.
Fix Approaches Compared
MethodWhen to UseTimeRisk
Verify Target Health & RegistrationInitial diagnosis for all 503 errors5 minsLow
Audit Security Groups & NACLsPost-deployment or infrastructure changes10 minsLow
Scale Backend / Increase CapacityHigh traffic spikes or connection exhaustion15 minsMedium
Fix Application Startup/Crash LoopsTargets stuck in 'Initial' or immediately failing30+ minsHigh

Understanding the AWS ALB 503 Error

When an AWS Application Load Balancer (ALB) returns an HTTP 503 Service Temporarily Unavailable error, it is explicitly telling the client that the load balancer cannot find a valid, healthy backend target to which it can forward the request. It is crucial to distinguish this from a 502 Bad Gateway (where the load balancer connected to the target, but the target returned an invalid or malformed response) and a 504 Gateway Timeout (where the load balancer connected, but the target took too long to respond).

The standard response generated directly by the ALB looks like this:

HTTP/1.1 503 Service Temporarily Unavailable
Content-Type: text/html
Content-Length: 119

<html>
<head><title>503 Service Temporarily Unavailable</title></head>
<body>
<center><h1>503 Service Temporarily Unavailable</h1></center>
</body>
</html>

As a DevOps engineer or SRE, seeing this error means your application layer is entirely disconnected from your ingress layer. Let us systematically break down the root causes and how to remediate them.

Step 1: Diagnose the Current Target State

The absolute first step in troubleshooting an ALB 503 error is checking the health of your Target Groups. The ALB relies on these groups to know where to send traffic.

  1. Open the Amazon EC2 Console and navigate to Target Groups under the Load Balancing section.
  2. Select the Target Group associated with your ALB listener rules.
  3. Click on the Targets tab.

You will see your targets in one of several states: Healthy, Unhealthy, Initial, Draining, or simply no targets registered at all.

Scenario A: Empty Target Group

If the target group has no registered targets, the ALB has nowhere to send traffic. This commonly happens in ECS/EKS environments if the service failed to start completely or the deployment pipeline removed old tasks before new ones registered.

Scenario B: All Targets are 'Unhealthy'

If targets exist but are failing health checks, the ALB removes them from the active routing pool. Once the pool is empty, the ALB begins throwing 503s to clients.

Step 2: Fix Failing Health Checks

If your targets are unhealthy, you must figure out why the ALB cannot successfully ping the health check endpoint.

1. Check the Health Check Configuration: Navigate to the Health checks tab in your Target Group. Verify the following:

  • Protocol & Port: Is the ALB checking the correct port? (e.g., HTTP on port 8080).
  • Path: Does the health check path (e.g., /health or /api/status) actually exist on your application? Does it return an HTTP 200 OK?
  • Timeout & Interval: If your application takes 10 seconds to respond to a health check, but the ALB timeout is 5 seconds, it will mark the target unhealthy. Ensure the interval is longer than the timeout.

2. Verify Security Groups: This is the #1 cause of sudden 503 errors after an infrastructure update. The Security Group attached to your backend instances (EC2, ECS ENIs, or EKS Nodes) must have an inbound rule allowing traffic from the ALB's Security Group.

  • Go to the EC2/Backend Security Group.
  • Check Inbound Rules.
  • Ensure there is a rule allowing the specific application port (e.g., TCP 80, 443, or 3000), and that the Source is the Security Group ID of the ALB (sg-xxxxxxxx).

Step 3: Investigate Backend Application Crashes

If the network is open and the health check configuration is correct, the application itself might be crashing.

  • For EC2: SSH or use AWS Systems Manager Session Manager (SSM) to log into the instance. Run curl -v http://localhost:<port>/<health-check-path>. If the connection is refused, your application service (e.g., Nginx, Node.js, Tomcat) is down. Check /var/log/messages, syslog, or journalctl for crash logs.
  • For ECS/EKS: Check the container logs in CloudWatch. Often, a container will start, fail to connect to a database, crash, and restart. During this crash-loop, the target will never reach a Healthy state, resulting in continuous 503s at the ALB layer.

Step 4: Check for Resource Exhaustion and Capacity Issues

In high-traffic scenarios, your backend instances might be perfectly healthy but physically unable to accept new connections.

  • Connection Queue Limits: Web servers like Nginx, Apache, or Tomcat have maximum thread or connection limits. Once hit, they refuse new TCP connections. The ALB interprets this as a dropped connection and, if widespread, will return 503s.
  • Ephemeral Port Exhaustion: If your instances make heavy outbound connections, they may run out of ephemeral ports, causing the application to lock up.
  • CPU/Memory Maxed Out: Use CloudWatch to check the CPUUtilization and MemoryUtilization of your backend targets. An overloaded instance will fail to respond to health checks in time. The fix here is to scale out (add more instances/tasks) or scale up (increase instance size).

Summary of Diagnostic Flow

  1. Identify if the 503 is consistent or intermittent.
  2. Check HTTPCode_ELB_503_Count in CloudWatch.
  3. Inspect Target Group Health status.
  4. Validate Security Group rules.
  5. Confirm application process uptime and local health check responses.

Frequently Asked Questions

bash
# Diagnostic script to list unhealthy targets in a specific target group

TARGET_GROUP_ARN="arn:aws:elasticloadbalancing:region:account-id:targetgroup/my-tg/id"

# Get target health status
echo "Fetching target health status..."
aws elbv2 describe-target-health \
  --target-group-arn $TARGET_GROUP_ARN \
  --query 'TargetHealthDescriptions[?TargetHealth.State!=`healthy`].{ID:Target.Id,Port:Target.Port,State:TargetHealth.State,Reason:TargetHealth.Reason,Description:TargetHealth.Description}' \
  --output table

# Query CloudWatch for 503 errors over the last hour
echo "Fetching ALB 503 error metrics..."
aws cloudwatch get-metric-statistics \
  --namespace AWS/ApplicationELB \
  --metric-name HTTPCode_ELB_503_Count \
  --dimensions Name=LoadBalancer,Value=app/my-load-balancer/id \
  --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%SZ) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
  --period 300 \
  --statistics Sum \
  --output table
E

Error Medic Editorial

Error Medic Editorial comprises senior DevOps and Cloud Reliability Engineers dedicated to demystifying complex cloud infrastructure errors. We share practical, production-tested solutions for AWS, Kubernetes, and modern cloud-native stacks.

Sources

Related Articles in AWS ALB

Explore More Networking Guides