Linux System Administration Errors: Complete Troubleshooting Guide
Linux powers the vast majority of production servers, and its errors span everything from kernel panics to misconfigured web servers. As a sysadmin or DevOps engineer, you need to diagnose and fix issues across the full stack: process management, networking, storage, security, and the dozens of services running on any given server.
Linux troubleshooting follows a pattern. Start with the basics: Is the service running? What do the logs say? Is there disk space? Is there memory? Can the server reach the network? From there, drill into service-specific configuration, permissions, and dependencies. The systemd journal (journalctl) and service-specific log files are your primary diagnostic tools.
This section covers 64 troubleshooting articles across 18 categories, including web servers (nginx, Apache), databases (MySQL, PostgreSQL, Redis), process management (systemd, cron), security (SELinux, iptables, SSH), file systems (NFS, LVM), networking (HAProxy, Postfix), and containerization (Docker). Each guide addresses specific error messages and symptoms with commands you can run immediately.
The guides assume you have root or sudo access to the server. They focus on CentOS/RHEL and Ubuntu/Debian as the most common server distributions, but most commands and concepts apply across all Linux distributions. When distribution-specific steps differ, both paths are covered.
Browse by Category
Common Patterns & Cross-Cutting Themes
Service Management & systemd Failures
"Job for [service] failed" is the most common systemd error. The fix starts with reading the actual error: run systemctl status <service> and journalctl -xeu <service> to see the full output. Common causes: syntax errors in config files (always run the service's config test command first — nginx -t, apachectl configtest, named-checkconf), missing dependencies, wrong file permissions on config or data directories, and port conflicts with another service.
Units that enter a "failed" state won't restart automatically unless you've configured Restart=on-failure in the unit file. After fixing the issue, run systemctl daemon-reload if you changed the unit file, then systemctl restart <service>. For services that fail repeatedly, check for resource exhaustion (memory, file descriptors, disk) that might be killing the process shortly after startup.
Enable and start are different: enable makes the service start at boot, start runs it now. You usually want both: systemctl enable --now <service>.
Permission & SELinux Issues
Permission denied errors on Linux have two layers: traditional Unix permissions (user/group/other with rwx bits) and mandatory access control (SELinux on RHEL/CentOS, AppArmor on Ubuntu). Even root can be blocked by SELinux.
For traditional permissions: check ownership (ls -la), ensure the service user can read configs and write to data directories, and verify execute permission on binaries and scripts. For services accessing files in non-standard locations, check the parent directory permissions too — a 755 file inside a 700 directory is still inaccessible.
For SELinux: check if it's the blocker by looking at /var/log/audit/audit.log for "avc: denied" messages, or run audit2why. Common fixes: set the correct SELinux context with chcon or semanage fcontext, or allow a specific action with setsebool. Don't reflexively disable SELinux — it's a critical security layer. Use audit2allow to generate a targeted policy module instead.
Disk Space & Filesystem Problems
"No space left on device" can mean either disk space or inode exhaustion. Check both: df -h for space and df -i for inodes. The biggest space consumers are usually logs (/var/log), package caches (/var/cache), old kernels, and Docker images/volumes.
For immediate relief: find and remove large files with du -sh /var/* | sort -h, clean package caches (apt clean or yum clean all), rotate and compress logs, and remove old Docker images with docker system prune. For inodes, find directories with millions of small files using find / -xdev -type d | while read d; do echo "$(find "$d" -maxdepth 1 | wc -l) $d"; done | sort -n | tail.
For LVM-based systems, you can extend logical volumes online if the volume group has free space. For cloud instances, extend the EBS/disk volume in the cloud console, then grow the partition and filesystem. Always set up monitoring to alert before disk reaches 85%.
Network & Connectivity Troubleshooting
Network issues on Linux follow a diagnostic ladder: can you ping the target (layer 3)? Can you connect to the port (layer 4)? Does the application respond correctly (layer 7)? Use ping, ss/netstat, curl, and tcpdump at each layer.
For services that won't bind to a port: check if something else is using it (ss -tlnp | grep <port>), verify the service configuration specifies the right listen address (0.0.0.0 vs. 127.0.0.1), and check firewall rules (iptables -L or firewall-cmd --list-all). For outbound connectivity issues, check default routes (ip route), DNS resolution (dig or nslookup), and whether a proxy is required.
SSH connection problems deserve special mention since SSH is your lifeline to the server. "Connection refused" means sshd isn't running or is on a different port. "Connection timed out" means a firewall is blocking port 22. "Permission denied (publickey)" means your key isn't in authorized_keys or the file permissions are wrong (must be 600 for the key, 700 for ~/.ssh).
Quick Troubleshooting Guide
| Symptom | Likely Cause | First Step |
|---|---|---|
| Service failed to start | Config syntax error or port conflict | Run config test (nginx -t, apachectl configtest); check journalctl -xeu <service> |
| Permission denied (not SELinux) | Wrong file ownership or mode | Check ls -la; fix with chown/chmod; verify parent directory permissions |
| Permission denied (SELinux) | Wrong SELinux context on files | Check audit.log for 'avc: denied'; fix with semanage fcontext + restorecon |
| No space left on device | Disk full or inode exhaustion | Check df -h and df -i; clean logs, caches, old packages; extend volume if possible |
| SSH connection refused | sshd not running or wrong port | Verify sshd status; check /etc/ssh/sshd_config for Port; check firewall |
| SSH permission denied (publickey) | Key not in authorized_keys or wrong permissions | Check ~/.ssh/authorized_keys; ensure 600 on key files, 700 on .ssh dir |
| OOM killer terminated process | Server out of memory | Check dmesg for OOM messages; reduce service memory usage or add swap/RAM |
| Cron job not running | Bad crontab syntax, wrong PATH, or service disabled | Check crontab -l; verify PATH in crontab; check /var/log/cron or journalctl |
| NFS mount hanging or timing out | Firewall blocking NFS ports or server unreachable | Check NFS server status; verify firewall allows ports 111, 2049; test with showmount |
| High CPU / load average | Runaway process or resource contention | Run top/htop; identify the process; check for infinite loops or fork bombs |
Category Deep Dives
Apache
Cron
Docker
HAProxy
iptables
Memcached
MySQL
NFS
Nginx
Other
Postfix
PostgreSQL
Redis
- AWS Redis Connection Refused: Troubleshooting ECONNREFUSED and tcp 127.0.0.1:6379
- Redis 'Connection Refused': Complete Troubleshooting Guide for Crashes, OOM, High CPU, and Service Failures
- Redis Connection Refused: Complete Troubleshooting Guide (Connection Refused, Crash, OOM, Permission Denied)
- View all 4 guides →