Comprehensive Guide to Fixing Nginx 502 Bad Gateway, 504 Timeouts, and Core Crashes
Diagnose and resolve Nginx 502 Bad Gateway, 504 Timeouts, connection refused errors, out-of-memory crashes, and permission denied issues with this SRE guide.
- Nginx 502 Bad Gateway errors indicate a broken connection to the upstream service (e.g., PHP-FPM, Node.js), often caused by the service being down, misconfigured ports, or socket permission issues.
- Nginx 504 Gateway Timeouts happen when the upstream application takes too long to process a request; fixing this requires application profiling and tuning proxy/fastcgi timeout directives.
- Resource exhaustion, such as 'too many connections' or 'out of memory' crashes, demands kernel-level tuning (ulimit, file descriptors) and careful configuration of worker processes and buffers.
| Symptom / Error | Common Root Cause | Primary Diagnostic Tool | Typical Resolution Time |
|---|---|---|---|
| Nginx 502 / Connection Refused | Upstream service down or wrong port | systemctl status, netstat | 5-10 mins |
| Nginx 504 / Nginx Slow | Heavy backend processing | Application APM, slow logs | 30+ mins |
| Permission Denied (Socket) | Incorrect socket owner or SELinux | ls -l, getenforce, audit2allow | 10 mins |
| Too Many Connections | Traffic spike exceeding worker limits | nginx error.log, ulimit -n | 15 mins |
| Nginx Out of Memory / Crash | OOM Killer, memory leak in module | dmesg, gdb (core dump) | Hours/Days |
Understanding Nginx Proxy Architecture & The 5xx Error Family
Nginx acts as the highly efficient, event-driven gateway for modern web infrastructure. It rarely serves dynamic content itself; instead, it proxies requests to upstream backend application servers such as PHP-FPM, Node.js, Python Gunicorn, or Java Tomcat. When you encounter errors like nginx 502, nginx 504, or experience an nginx crash, the root cause almost always lies in the communication layer between Nginx and the upstream service, or in resource exhaustion at the operating system level.
As a Site Reliability Engineer (SRE), debugging these issues requires a systematic approach: confirming the Nginx process health, verifying system resources, analyzing the error logs, and validating upstream connectivity.
Diagnosing "502 Bad Gateway" and "Connection Refused"
A 502 Bad Gateway error means Nginx successfully accepted the client's request but received an invalid response—or no response at all—from the upstream server.
When you check /var/log/nginx/error.log, you will typically see:
[error] 1234#0: *5678 connect() failed (111: Connection refused) while connecting to upstream
Step 1: Verify Upstream Health The nginx connection refused error literally means the operating system rejected the TCP or Unix domain socket connection. Your first step is to verify if the backend is actually running.
systemctl status php8.1-fpm
# or
systemctl status my-node-app
If the service is running, ensure it is listening on the expected port or socket. Use netstat -tulpn or ss -tulpn to verify the bindings.
Step 2: Addressing "Nginx Permission Denied"
If your upstream relies on Unix sockets (common for PHP-FPM or Gunicorn) instead of TCP ports, you might see:
[error] 1234#0: *5678 connect() to unix:/var/run/php-fpm.sock failed (13: Permission denied) while connecting to upstream
This is a strict file permission issue. Nginx runs under a specific user (usually nginx or www-data). If this user does not have read and write permissions to the socket file, it cannot proxy traffic.
Fix: Check the user running the upstream service. You may need to change the socket owner configuration in your PHP-FPM pool (listen.owner = www-data, listen.group = www-data). On RHEL/CentOS systems, SELinux is often the hidden culprit blocking proxy connections. If SELinux is active, run setsebool -P httpd_can_network_connect 1 to allow Nginx to connect to network proxies.
Solving "504 Gateway Timeout" and "Nginx Slow" Issues
Unlike a 502, an nginx 504 Gateway Timeout implies that Nginx established the connection to the upstream, sent the request, but the upstream failed to return a response before the proxy timeout limit was reached.
Log example:
[error] 1234#0: *5678 upstream timed out (110: Connection timed out) while reading response header from upstream
If users complain that your site is nginx slow and eventually throws a 504, the problem is your backend application code, a slow database query, or an external API call hanging.
Mitigation and Tuning:
While fixing the application is the true solution, you can temporarily increase Nginx's patience by tuning the timeout directives in nginx.conf or your server block:
location / {
proxy_pass http://backend;
proxy_read_timeout 300s;
proxy_connect_timeout 75s;
proxy_send_timeout 300s;
}
If you are using FastCGI (PHP), adjust the fastcgi_read_timeout directive instead. Keep in mind that infinitely increasing timeouts will eventually tie up all your Nginx worker connections, leading to complete service degradation.
Tackling "Nginx Too Many Connections"
During traffic spikes or DDoS attacks, your server might run out of available connection slots. The error log will clearly state:
[alert] 1234#0: *5678 1024 worker_connections are not enough
How to Fix:
- Open
/etc/nginx/nginx.conf. - In the
eventsblock, increase the limit:worker_connections 4096;or higher. - Ensure
worker_processes auto;is set so Nginx spawns one worker per CPU core.
Kernel Limits:
Nginx cannot open more connections than the Linux kernel allows file descriptors. If you increase worker_connections to 10000, but your OS limit is 1024, Nginx will still fail. Check the limit for the Nginx user by running su - nginx -c 'ulimit -n'. To increase this permanently, edit /etc/security/limits.conf:
nginx soft nofile 65535
nginx hard nofile 65535
Restart Nginx after making these kernel-level changes.
Investigating Nginx Out of Memory, High CPU, and Core Dumps
When a server suffers from nginx high cpu or an nginx out of memory event, the symptoms are severe. The service may abruptly terminate, leaving users with generic browser connection errors.
OOM Killer:
If Nginx consumes all available system RAM—perhaps due to a massive influx of traffic with large payloads, unoptimized proxy_buffers, or a memory leak in a third-party dynamic module—the Linux kernel will terminate it to protect the OS.
Check the kernel logs for OOM termination:
dmesg -T | grep -i oom-killer
If you see nginx listed here, you need to either add more physical RAM/swap, or restrict Nginx's memory footprint by tuning client_max_body_size and optimizing buffer sizes.
Nginx Crash and Core Dumps: If you encounter a segmentation fault where Nginx abruptly exits with an nginx failed status and a signal 11 or 9, you have a deep bug, often related to OpenSSL or compiled third-party modules. To trace an nginx crash log, you must enable core dumps.
Add this to the top of your nginx.conf (main context):
worker_rlimit_core 500M;
working_directory /tmp/nginx-cores;
Ensure the /tmp/nginx-cores directory exists and is writable by the nginx user. When the nginx crash happens next, a core file (e.g., core.1234) will be written. You can then use the GNU Debugger to analyze the nginx core dump:
gdb /usr/sbin/nginx /tmp/nginx-cores/core.1234
Typing bt (backtrace) in GDB will reveal the exact C function where Nginx crashed, which is invaluable for submitting bug reports or removing the offending module.
Resolving "Nginx Service Not Starting"
Often during deployments, you may find Nginx completely dead with a systemd status of nginx service not starting or nginx not working.
- Configuration Syntax: Never restart Nginx without testing the config. Run
nginx -t. A simple missing semicolon can prevent the entire master process from booting. - Port Binding Conflicts: If the error log shows
bind() to 0.0.0.0:80 failed (98: Address already in use), another process is hoarding the port. This could be Apache, an orphaned Nginx master process, or another reverse proxy. Find the culprit usingnetstat -tulpn | grep :80orlsof -i :80and terminate it usingkill -9 <PID>.
By systematically verifying upstream health, tuning timeout and connection limits, and deeply analyzing system logs and core dumps, you can ensure your Nginx infrastructure remains resilient under extreme loads.
Frequently Asked Questions
#!/bin/bash
# Nginx Diagnostic Script: Checks syntax, ports, upstream health, and logs
echo "--- 1. Testing Nginx Configuration Syntax ---"
nginx -t
echo -e "\n--- 2. Checking Nginx Process Health ---"
systemctl status nginx --no-pager | grep -i active
echo -e "\n--- 3. Identifying Processes Listening on Port 80/443 ---"
netstat -tulpn | grep -E ':80|:443'
echo -e "\n--- 4. Extracting Recent 502 and 504 Errors from Nginx Logs ---"
if [ -f /var/log/nginx/error.log ]; then
tail -n 500 /var/log/nginx/error.log | grep -E 'Connection refused|timed out|Permission denied|worker_connections'
else
echo "Log file /var/log/nginx/error.log not found."
fi
echo -e "\n--- 5. Checking for Kernel OOM (Out of Memory) Kills ---"
dmesg -T | grep -i 'oom-killer' | tail -n 5
echo -e "\n--- 6. Checking Current File Descriptor Limits (ulimit) ---"
su - nginx -s /bin/bash -c 'ulimit -n'
Error Medic Editorial
Error Medic Editorial is composed of senior Site Reliability Engineers and DevOps architects dedicated to publishing actionable, deeply technical troubleshooting guides for enterprise infrastructure.
Sources
- https://nginx.org/en/docs/http/ngx_http_proxy_module.html
- https://docs.nginx.com/nginx/admin-guide/monitoring/debugging/
- https://serverfault.com/questions/338123/nginx-error-111-connection-refused-while-connecting-to-upstream
- https://www.digitalocean.com/community/tutorials/how-to-optimize-nginx-configuration