Envoy 503 Service Unavailable: Root Causes and Troubleshooting Guide
Fix Envoy 503 Service Unavailable errors. Learn how to diagnose upstream connection failures, connection pool exhaustion, and TLS issues with actionable steps.
- Upstream Connection Failure (UC): The upstream service is down, unreachable, or rejecting connections on the specified port.
- Connection Pool Exhaustion: Envoy cannot open new connections to the upstream service because the connection pool limits have been reached.
- Upstream Request Timeout (UT): The upstream service took too long to respond to the request.
- No Healthy Upstream (UH): Health checks have failed for all endpoints in the upstream cluster.
- Quick Fix: Check Envoy access logs for the specific response flag (e.g., UC, UH, URX) to pinpoint the exact reason for the 503.
| Method | When to Use | Time | Risk |
|---|---|---|---|
| Access Log Analysis | Initial triage to identify response flags (UC, UH, etc.) | Fast | Low |
| Admin Stats Interface | Checking circuit breaker states and connection pools | Medium | Low |
| Envoy Trace Logging | Deep diving into TLS handshakes or routing logic | Slow | High (Performance impact) |
| Upstream Application Logs | When Envoy indicates the upstream closed the connection (URX) | Medium | Low |
Understanding the Error
When Envoy Proxy returns a 503 Service Unavailable error, it is acting as a faithful messenger. It means Envoy successfully received the downstream request but was unable to forward it to or receive a valid response from the upstream service.
Unlike a generic web server error, Envoy provides highly granular telemetry indicating exactly why the 503 occurred via Response Flags in its access logs. A 503 is rarely an issue with Envoy itself; it's almost always a problem with the upstream service's health, network connectivity, or Envoy's configuration regarding how to communicate with that upstream.
Step 1: Diagnose with Response Flags
The most critical step in troubleshooting an Envoy 503 is checking the access logs. Envoy appends a short code (the response flag) to the log entry. Here are the most common flags associated with a 503:
- UC (Upstream Connection Failure): Envoy could not establish a TCP connection to the upstream. This usually means the upstream process is not running, is listening on the wrong port, or a firewall/security group is blocking the connection.
- UH (No Healthy Upstream): Envoy's active health checking has determined that no hosts in the upstream cluster are healthy. The request is rejected before a connection is even attempted.
- URX (Upstream Retry Limit Exceeded): The request failed, and Envoy exhausted its configured retry attempts.
- UT (Upstream Request Timeout): The upstream connection was established, but the upstream failed to respond within the configured timeout period.
- UO (Upstream Overflow): The circuit breakers for the upstream cluster tripped. This often happens if the connection pool is exhausted or pending request limits are hit.
To view these logs, you typically need to check stdout for the Envoy container or your centralized logging system. Look for entries like:
[2023-10-27T10:00:00.000Z] "GET /api/v1/data HTTP/1.1" 503 UC 0 0 5 - "-" "curl/7.68.0" ...
Notice the UC right after the 503 status code.
Step 2: Fix Common Scenarios
Scenario A: Upstream Connection Failure (UC)
If you see UC, verify the upstream is reachable from the Envoy pod/node.
- Check Upstream Status: Ensure the target pod/VM is actually running.
- Verify Ports: Double-check that Envoy's cluster configuration matches the port the upstream application is listening on.
- Network Policies: In Kubernetes, ensure no NetworkPolicies are blocking traffic from Envoy's namespace to the target namespace.
Scenario B: No Healthy Upstream (UH)
If you see UH, Envoy believes the upstream is dead.
- Check Health Check Config: Review Envoy's active health check configuration. Is the path correct? Is the expected status code 200?
- Inspect Upstream Logs: Look at the upstream service's logs. Is it failing its health check endpoints? Is it out of memory or deadlocking?
- Bypass Health Checks (Temporarily): For testing, you can temporarily disable active health checks to see if traffic flows, which confirms the health check configuration is the culprit.
Scenario C: Upstream Overflow (UO) / Circuit Breaking
If you see UO, Envoy is protecting the upstream by shedding load.
- Check Admin Stats: Port-forward to Envoy's admin port (usually 15000 or 9901) and check
/stats. Look forcluster.<cluster_name>.upstream_cx_overfloworcluster.<cluster_name>.upstream_rq_pending_overflow. - Tune Circuit Breakers: If the upstream can handle more load, increase the
max_connections,max_pending_requests, ormax_requestsin the cluster's circuit breaker configuration. - Scale Upstream: If the upstream genuinely cannot handle the load, scale out the upstream deployment.
Step 3: TLS and Protocol Issues
Sometimes a 503 occurs due to a protocol mismatch. If Envoy expects an HTTP/2 upstream but the upstream only speaks HTTP/1.1, the connection will drop. Similarly, if Envoy is configured to use TLS to talk to the upstream (transport_socket configuration), but the upstream certificate is invalid, expired, or doesn't match the expected Subject Alternative Name (SAN), Envoy will close the connection and return a 503. Check the Upstream Connection Termination (UCT) flag in these cases.
Frequently Asked Questions
# 1. Check Envoy Access Logs for the Response Flag (e.g., UC, UH, UO)
kubectl logs -l app=envoy-proxy -n gateway-system | grep " 503 "
# 2. Port-forward the Envoy Admin Interface to check stats
kubectl port-forward deployment/envoy-proxy 15000:15000 -n gateway-system
# 3. Check for Circuit Breaker overflows (UO flag)
curl -s http://localhost:15000/stats | grep overflow
# 4. Check cluster health status (UH flag)
curl -s http://localhost:15000/clusters | grep health_flags
# 5. Test upstream connectivity directly from the Envoy pod (UC flag)
kubectl exec -it deploy/envoy-proxy -n gateway-system -- curl -v http://<upstream-ip>:<port>Error Medic Editorial
Error Medic Editorial is a team of veteran Site Reliability Engineers and DevOps practitioners dedicated to demystifying complex distributed systems and providing actionable, real-world solutions for production incidents.