Fixing Traefik 502 Bad Gateway and 504 Gateway Timeout Errors
Comprehensive troubleshooting for Traefik 502 Bad Gateway, 504 Timeouts, and Connection Refused errors. Learn to diagnose Docker networks, ports, and timeouts.
- Root Cause 1: Traefik and the backend container are not sharing a common Docker network, resulting in 'Connection Refused' and a 502 error.
- Root Cause 2: Traefik is automatically routing traffic to the wrong internal port of the backend service (e.g., targeting port 80 instead of 3000).
- Root Cause 3: The backend application is legitimately taking too long to respond, exceeding Traefik's default or configured forwarding timeouts, causing a 504.
- Quick Fix Summary: Explicitly define `traefik.docker.network`, specify target ports via `loadbalancer.server.port` labels, verify backend health, and adjust `forwardingTimeouts` for long-running endpoints.
| Method | When to Use | Time | Risk |
|---|---|---|---|
| Verify Docker Networks | When seeing 'Connection Refused' logs or Traefik cannot resolve the backend IP address. | 5 mins | Low |
| Specify Target Port | When the backend exposes multiple ports or a non-standard port and Traefik guesses incorrectly. | 2 mins | Low |
| Increase Forwarding Timeouts | When facing '504 Gateway Timeout' on heavy API requests, file uploads, or long-running DB queries. | 5 mins | Low |
| Configure TLS/Scheme | When the backend application enforces HTTPS internally and rejects Traefik's default HTTP probe. | 10 mins | Medium |
Understanding Traefik Gateway Errors
When operating Traefik as a reverse proxy, ingress controller, or API gateway, encountering HTTP 502 Bad Gateway, HTTP 504 Gateway Timeout, or raw 'Connection Refused' errors is a frequent occurrence. Because Traefik dynamically discovers services via providers like Docker, Kubernetes, or HashiCorp Consul, the root cause of the disconnect often lies in the configuration bridging Traefik to your backend applications, rather than Traefik itself.
The 502 Bad Gateway
A 502 Bad Gateway error occurs when Traefik successfully receives a request from an external client, identifies the matching routing rule (Router), and attempts to forward the request to the backend server (Service), but receives an invalid response or no response at all. In the Traefik ecosystem, this almost always means Traefik cannot establish a TCP connection to the backend IP address and port it discovered.
The 504 Gateway Timeout
A 504 Gateway Timeout indicates that Traefik successfully resolved the backend and established a TCP connection, but the backend failed to return a complete HTTP response within the allowed timeframe. This happens with slow database queries, upstream external API delays, or insufficient timeout configurations explicitly set in Traefik's transport layer.
Connection Refused
When viewing Traefik debug logs, you might see the underlying network error: dial tcp <IP>:<PORT>: connect: connection refused. This is the direct network error triggering the HTTP 502. It means the target IP is reachable at the network layer, but no application process is actively listening on the specified port, or a host-level firewall/network isolation policy violently rejected the TCP SYN packet.
Step 1: Diagnosing 502 Bad Gateway and Connection Refused
The most frequent cause of a 502 in Traefik (especially when utilizing the Docker provider) is a network isolation issue, a port mismatch, or a TLS handshake failure.
1.1 Docker Network Isolation
By default, Docker Compose provisions a default bridge network for each distinct docker-compose.yml stack. If Traefik runs in its own infrastructure stack and your application runs in a separate project stack, they are placed on completely isolated bridge networks. Traefik will discover the container via the Docker socket, retrieve its internal overlay IP, attempt to route traffic to it, and fail with a 502 because it cannot route packets across isolated Docker bridges.
Diagnosis:
Inspect the networks attached to both the Traefik container and your target backend container:
docker inspect <traefik_container_name> -f '{{json .NetworkSettings.Networks}}'
docker inspect <backend_container_name> -f '{{json .NetworkSettings.Networks}}'
The Fix:
Ensure both containers share at least one common network. Create an external network (e.g., traefik-public) and attach both services to it.
In your application's docker-compose.yml:
networks:
traefik-public:
external: true
services:
myapp:
networks:
- traefik-public
Crucially, if your application container is attached to multiple networks (e.g., an internal DB network and the Traefik network), you must explicitly tell Traefik which network to use to route external traffic. Add this label to your backend service:
traefik.docker.network=traefik-public
1.2 Incorrect Port Discovery
Traefik attempts to intelligently auto-detect the internal port your container is listening on. If a container exposes multiple ports (e.g., a web server exposing 80 for HTTP and 8080 for Prometheus metrics) or doesn't explicitly expose any using the Dockerfile EXPOSE instruction, Traefik might arbitrarily guess the wrong one.
Diagnosis:
Enable debug mode in Traefik logs. You will see Traefik attempting to forward requests to a specific IP and port (e.g., Forwarding to 172.18.0.4:80). If your Node.js application listens on port 3000 but Traefik is trying to hit port 80, the OS will refuse the connection, yielding a 502.
The Fix:
Override the auto-discovery and explicitly define the internal load balancer port using Docker labels:
labels:
- "traefik.http.services.my-app-service.loadbalancer.server.port=3000"
1.3 Backend Application Crash or Boot Delay
Sometimes the network plumbing is flawless, but the application simply isn't actively running. If the backend container is trapped in a crash-loop or is still executing lengthy initialization tasks (like running synchronous database migrations on startup), the internal web server won't be ready to accept TCP connections.
Diagnosis:
Check the backend container logs: docker logs -f <backend_container_name>. Look for unhandled exceptions, stack traces, or initialization progress bars.
The Fix: Implement rigorous Healthchecks. Traefik natively integrates with Docker and Kubernetes healthchecks. By defining proper health probes, Traefik will exclude the unready container from the load balancer pool until it reports as 'healthy', preventing 502s during rollouts or restarts.
1.4 TLS/HTTPS Backend Communication Failures
Increasingly, zero-trust architectures mandate that even internal backend services enforce HTTPS. If Traefik attempts to connect via standard plain-text HTTP (its default behavior) to a backend port that strictly expects a TLS handshake, the connection will instantaneously drop or the backend will aggressively reject the malformed HTTP request, leading directly to a 502 Bad Gateway.
Diagnosis:
Utilize curl from directly within the Traefik container targeting the backend internal IP. If curl http://<ip>:<port> returns an error like curl: (52) Empty reply from server or mentions an SSL handshake failure, but curl -k https://<ip>:<port> successfully returns data, you have a protocol scheme mismatch.
The Fix:
You must explicitly instruct Traefik to negotiate an HTTPS scheme when communicating with this specific backend. Apply the following label to your service:
traefik.http.services.<service-name>.loadbalancer.server.scheme=https
Additionally, if the backend uses a self-signed or internal CA certificate (very common in microservices), Traefik will reject the connection because it cannot cryptographically verify the certificate authority. You must configure a specific serversTransport in a dynamic configuration file that skips TLS verification for that specific internal service, and reference it via labels:
traefik.http.services.<name>.loadbalancer.serversTransport=<transport-name>@file
Step 2: Diagnosing 504 Gateway Timeout Errors
When you receive a 504 Gateway Timeout, the TCP connection from Traefik to the backend was successfully established, but the response lifecycle failed to complete in time.
2.1 Backend Processing Delays
Determine if the application endpoint is genuinely designed to take a long time. For example, a heavy PDF report generation endpoint, a bulk data export, or a complex machine learning inference request might legitimately take 45 to 120 seconds to process.
The Fix:
If the delay is expected and architecturally sound, you must increase Traefik's forwarding timeouts. By default, Traefik is quite lenient, but underlying OS or infrastructure timeouts might interfere. You can configure precise timeouts at the entrypoint or service transport level using a File Provider (traefik.yml):
http:
serversTransports:
long_running_transport:
forwardingTimeouts:
dialTimeout: 30s
responseHeaderTimeout: 120s
idleConnTimeout: 90s
And attach this specialized transport to your specific service via Docker labels:
- "traefik.http.services.my-heavy-app.loadbalancer.serversTransport=long_running_transport@file"
2.2 Unresponsive Upstream Dependencies
If your backend application is synchronously waiting on an external API (like a payment gateway) or a database query that hangs indefinitely due to lock contention, the backend thread will block. Traefik will patiently wait until its internal timeout is reached, eventually severing the connection and returning a 504 to the end user.
Diagnosis: Instrument your application with distributed tracing (e.g., OpenTelemetry, Jaeger) or add detailed duration logging to see exactly where the request is stalling within your backend code pipeline.
2.3 TCP Idle Connection Drop (Cloud Load Balancers)
In cloud environments like AWS (using Elastic Load Balancers in front of Traefik), Azure, or GCP, stateful firewalls or external load balancers will automatically and silently drop TCP connections if no packets traverse the wire for a certain idle period (typically 60 seconds). If an API request processes for 65 seconds without sending data, the cloud provider drops the connection. Traefik never receives a TCP FIN/RST packet, hangs waiting, and eventually throws a 504 or a 502.
Diagnosis: Analyze if requests that take exactly a specific duration (e.g., exactly 60 seconds) consistently fail. Review your cloud provider's default idle timeout settings for your external Load Balancers or NAT Gateways.
The Fix:
Configure TCP Keep-Alive settings in your backend and in Traefik to periodically send empty probe packets to keep the connection state active on all intermediary cloud firewalls. Also, ensure Traefik's responseHeaderTimeout is strictly less than the cloud provider's hard idle timeout to return a graceful error rather than a hanging socket.
Step 3: Kubernetes Specific 502/504 Diagnostics
When operating Traefik as a Kubernetes Ingress Controller, the complexity of networking increases dramatically due to CNI (Container Network Interface) plugins, kube-proxy iptables rules, and internal DNS.
3.1 Kubernetes Endpoint Missing
A Kubernetes Service acts as a network abstraction over ephemeral Pods. Traefik routes traffic directly to the IP endpoints associated with the Service. If your Pods fail their readiness probes, Kubernetes removes their IPs from the Service's endpoint list.
Diagnosis:
Verify if the Kubernetes Service actually has registered active endpoints:
kubectl get endpoints <service-name> -n <namespace>
If the output shows <none> under the ENDPOINTS column, Traefik has absolutely nowhere to send the traffic and will immediately return a 502.
The Fix:
Investigate why the backend Pods are failing their readiness probes using kubectl describe pod <pod-name> and kubectl logs <pod-name>. Fix the underlying application initialization issue.
3.2 CoreDNS Resolution Latency
Unlike the standalone Docker provider which reads IPs directly from the local Docker socket, Traefik in Kubernetes relies heavily on the cluster's internal DNS (CoreDNS) to resolve Service names to ClusterIPs, or routes directly to Pod IPs. If CoreDNS is experiencing high latency, CPU throttling, or dropping UDP packets, Traefik will fail to resolve the backend service hostname, leading to immediate 502 errors.
Diagnosis:
Check the CoreDNS pod logs in the kube-system namespace for errors or high latency warnings. Exec into the Traefik pod and attempt to resolve your service manually to test DNS latency:
kubectl exec -it <traefik-pod> -n traefik -- nslookup my-service.my-namespace.svc.cluster.local
The Fix:
Scale out the CoreDNS deployment to handle high DNS query volumes. Ensure your Kubernetes node's resolv.conf is properly configured, and seriously consider enabling the NodeLocal DNSCache feature in your Kubernetes cluster to drastically reduce cross-node DNS lookup latency and mitigate UDP packet drop for Traefik ingress routing.
Frequently Asked Questions
# 1. Enable DEBUG logging in Traefik via Docker Compose
# command:
# - "--log.level=DEBUG"
# - "--api.insecure=true"
# 2. View live Traefik logs and filter for gateway errors
docker logs -f traefik | grep -i -E "502|504|error|connection refused"
# 3. Verify Docker network overlap for Traefik and the backend service
docker inspect traefik -f '{{json .NetworkSettings.Networks}}' | jq .
docker inspect my-backend-app -f '{{json .NetworkSettings.Networks}}' | jq .
# 4. Test pure TCP connectivity manually from inside the Traefik container
# Replace <backend-ip> and <port> with the exact values failing in the Traefik logs
docker exec -it traefik /bin/sh -c "wget -qO- http://<backend-ip>:<port> || echo 'TCP Connection Failed'"
# 5. Check Kubernetes endpoints if using Traefik as an Ingress Controller
kubectl get endpoints my-backend-service -n my-app-namespace
kubectl describe pod -l app=my-backend-service -n my-app-namespace | grep -i readinessError Medic Editorial
Our SRE and DevOps engineering team breaks down complex infrastructure issues into clear, actionable guides. We specialize in Kubernetes networking, Docker orchestration, and advanced reverse proxy configurations for Traefik, Nginx, and Envoy.