Error Medic

Resolving Istio 504 Gateway Timeout and 503 Connection Refused Errors

Fix Istio 504 Gateway Timeout and 503 Connection Refused errors by adjusting VirtualService timeout limits, DestinationRule settings, and diagnosing Envoy logs.

Last updated:
Last verified:
1,817 words
Key Takeaways
  • 504 Gateway Timeout errors (response flag 'UT') usually occur because an upstream service takes longer to respond than Envoy's default 15-second timeout limit.
  • 503 Service Unavailable / Connection Refused errors (response flags 'UF' or 'URX') frequently indicate a missing DestinationRule, mismatched subset labels, or mTLS strict mode misconfigurations.
  • Quick Fix: Increase the timeout threshold in your VirtualService configuration to accommodate slow-responding endpoints, and verify peer authentication mTLS settings using 'istioctl x authz check'.
  • Always inspect the Envoy sidecar access logs ('kubectl logs <pod-name> -c istio-proxy') to identify the exact HTTP response flags causing the drop.
Troubleshooting Methods for Istio Timeouts & Connection Drops
MethodWhen to UseTimeRisk
Increase VirtualService TimeoutWhen upstream application legitimately requires more than 15 seconds to process heavy requests.5 minsLow (But can mask underlying application performance degradation)
Configure Envoy Proxy RetriesFor mitigating transient network blips, temporary unavailability, or intermittent 503s.10 minsMedium (High retry counts can cause 'retry storms' and cascading failures)
Fix mTLS PeerAuthenticationWhen seeing persistent 503 Connection Refused between injected and un-injected services.15 minsHigh (Misconfigurations can lead to severe security implications or wider outages)
Scale Up Upstream Pods (HPA)When the upstream service is overwhelmed, causing processing delays and subsequent Envoy timeouts.5 minsLow (Increases cloud resource costs but stabilizes traffic flow safely)

Understanding the Error

When operating a service mesh like Istio, all ingress and inter-service traffic is intercepted and managed by Envoy sidecar proxies. While this architecture provides unparalleled observability, security, and routing capabilities, it also introduces a strict traffic management layer. Two of the most common and disruptive issues DevOps engineers and SREs face in this environment are Istio 504 Gateway Timeout errors and 503 Service Unavailable / Connection Refused errors.

Developers will typically see exact error messages such as:

  • HTTP/1.1 504 Gateway Timeout
  • upstream request timeout
  • HTTP/1.1 503 Service Unavailable
  • upstream connect error or disconnect/reset before headers. reset reason: connection failure
  • connection refused

The Anatomy of an Istio Timeout (504)

By default, Envoy enforces a strict 15-second timeout on all HTTP requests. If an upstream service (your application container) takes 15.1 seconds to process a request, Envoy forcefully terminates the connection, returning a 504 Gateway Timeout to the client. Crucially, your application is completely unaware of this termination; it will continue processing the request to completion, log a success message (like HTTP 200 OK), but the client will only ever see the 504. This discrepancy between application logs and proxy logs is a classic hallmark of an Istio timeout issue.

The Anatomy of Connection Refused (503)

A 503 Service Unavailable or connection refused error generally indicates that the Envoy sidecar cannot establish a TCP connection with the destination pod. This rarely means the target pod is down (which usually results in a 502 Bad Gateway). Instead, it points to a configuration mismatch. Common culprits include:

  1. mTLS Misconfigurations: The client proxy attempts a plaintext connection, but the server sidecar enforces strict mTLS (PeerAuthentication set to STRICT).
  2. Missing DestinationRules: Istio needs a DestinationRule to know how to route traffic to specific subsets or apply TLS settings.
  3. Port Mismatches: The port defined in the Kubernetes Service does not match the targetPort of the deployment, or the port name doesn't follow Istio's <protocol>-<suffix> naming convention (e.g., http-web).

Step 1: Diagnose the Exact Failure Reason

The most critical step in troubleshooting Istio routing issues is examining the Envoy proxy access logs. Do not rely solely on your application logs.

Run the following command to tail the proxy logs of the failing client pod:

kubectl logs <client-pod-name> -n <namespace> -c istio-proxy --tail 100

Look for the Envoy Response Flags in the log output. These are typically two- or three-letter codes appended to the response status:

  • UT (Upstream Request Timeout): Confirms a 504 error caused by the upstream application taking longer than the configured VirtualService timeout.
  • UF (Upstream Connection Failure): Envoy failed to connect to the upstream service. This often pairs with connection refused errors.
  • URX (Upstream Retry Limit Exceeded): The proxy retried the request but exhausted its retry budget.
  • UH (No Healthy Upstream): The destination service has no healthy endpoints in its load balancing pool (often a Kubernetes readiness probe failure, not an Istio routing issue).
  • NR (No Route Configured): Istio doesn't know where to send this traffic. Check your VirtualService routes.

You can also use the istioctl CLI to inspect the proxy configuration state and ensure the Envoy sidecars are perfectly synced with the Istio control plane (istiod):

istioctl proxy-status
istioctl analyze -n <namespace>

Step 2: Fix Istio 504 Timeout Errors

If you identified a UT flag, the resolution requires extending the timeout limit in the relevant VirtualService resource. Remember that increasing timeouts should be a deliberate decision; if an endpoint is slow, consider optimizing the application performance first.

To increase the timeout to 60 seconds, locate the VirtualService routing traffic to your application and add/modify the timeout directive under the HTTP route:

apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: my-slow-service-vs
  namespace: production
spec:
  hosts:
  - my-slow-service
  http:
  - route:
    - destination:
        host: my-slow-service
    # Increase default 15s timeout to 60 seconds
    timeout: 60s
    retries:
      attempts: 3
      perTryTimeout: 20s
      retryOn: gateway-error,connect-failure,refused-stream

Apply the updated configuration:

kubectl apply -f virtualservice.yaml

Note on Retries: In the example above, we also added a retries block. If the timeout is caused by intermittent network latency rather than a consistently slow process, configuring retries can significantly improve reliability without requiring a massive global timeout extension.


Step 3: Fix 503 Connection Refused Errors

If your proxy logs reveal a UF (Upstream Connection Failure) or you see upstream connect error or disconnect/reset before headers, the issue is likely rooted in mTLS or DestinationRule misconfiguration.

Scenario A: mTLS Strict Mode Mismatch

If you have incrementally adopted Istio, some namespaces might enforce strict mTLS while others do not. If a client without an Envoy sidecar (or in PERMISSIVE mode) tries to communicate with a server enforcing STRICT mTLS, the connection will be refused.

Verify the PeerAuthentication policy for the destination namespace:

apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
  name: default
  namespace: production
spec:
  mtls:
    mode: STRICT

The Fix: Ensure the calling client is also part of the mesh (injected with an Envoy sidecar) so it can negotiate the TLS handshake. If the client is external or cannot be injected, you must downgrade the destination's mTLS mode to PERMISSIVE or create a specific port-level exception.

Scenario B: Missing or Misconfigured DestinationRule

If a VirtualService routes traffic to specific subsets (e.g., v1 and v2 for canary deployments), you must have a corresponding DestinationRule defining those subsets. If the DestinationRule is missing, Envoy won't know the pod IP addresses associated with the subset, resulting in a 503.

Ensure your DestinationRule exists and accurately maps to Kubernetes pod labels:

apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
  name: my-service-dr
  namespace: production
spec:
  host: my-service.production.svc.cluster.local
  subsets:
  - name: v1
    labels:
      version: v1

Scenario C: Headless Services and Traffic Policies

If you are routing traffic to external databases (like RDS) or headless services via a ServiceEntry, ensure the resolution is configured correctly (usually resolution: DNS). If Envoy tries to route to a headless service without a proper DestinationRule defining the load balancing algorithm, connections will drop.


Step 4: Advanced Validation and Tracing

After applying your fixes, validate the traffic flow to ensure the timeouts or connection refusals are resolved.

1. Use istioctl proxy-config: Dump the Envoy cluster configuration to verify your timeouts and endpoints have propagated to the sidecar.

istioctl proxy-config cluster <pod-name> -n <namespace> --fqdn my-service.production.svc.cluster.local -o json

Search the JSON output for the timeout parameter to confirm it reflects your new 60s limit.

2. Trigger test requests: Exec into a container within the mesh and use curl -v to observe the headers and response timing.

kubectl exec -it <test-pod> -n <namespace> -c application -- curl -v http://my-service:8080/api/data

By systematically verifying proxy logs, adjusting VirtualService limits, and auditing mTLS DestinationRules, you can stabilize your Istio data plane and eliminate disruptive 504 and 503 routing errors.

Frequently Asked Questions

bash
#!/bin/bash
# Diagnostic script for Istio Timeouts and Connection Refused errors

NAMESPACE="production"
POD_NAME=$1

if [ -z "$POD_NAME" ]; then
  echo "Usage: ./istio_diagnose.sh <pod-name>"
  exit 1
fi

echo "=== 1. Checking Istio Proxy Logs for UT or UF flags ==="
kubectl logs $POD_NAME -n $NAMESPACE -c istio-proxy | grep -E '"UT"|"UF"|"URX"' | tail -n 10

echo "\n=== 2. Running Istio Analyzer in namespace ==="
istioctl analyze -n $NAMESPACE

echo "\n=== 3. Checking VirtualService Configurations (Timeouts) ==="
kubectl get virtualservice -n $NAMESPACE -o yaml | grep -A 2 -B 2 timeout

echo "\n=== 4. Checking PeerAuthentication (mTLS Strict Mode) ==="
kubectl get peerauthentication --all-namespaces

echo "\n=== 5. Dumping Envoy Cluster Config for $POD_NAME ==="
# Helpful for verifying if the timeout was propagated to the specific Envoy sidecar
istioctl proxy-config cluster $POD_NAME -n $NAMESPACE | grep -i timeout
E

Error Medic Editorial

Error Medic Editorial is a team of Senior DevOps and Site Reliability Engineers dedicated to demystifying cloud-native architectures. We specialize in Kubernetes, Istio service mesh, and large-scale incident resolution.

Sources

Related Articles in Istio

Explore More DevOps Config Guides