Error Medic

Linux System Administration Errors: Complete Troubleshooting Guide

Linux powers the vast majority of production servers, and its errors span everything from kernel panics to misconfigured web servers. As a sysadmin or DevOps engineer, you need to diagnose and fix issues across the full stack: process management, networking, storage, security, and the dozens of services running on any given server.

Linux troubleshooting follows a pattern. Start with the basics: Is the service running? What do the logs say? Is there disk space? Is there memory? Can the server reach the network? From there, drill into service-specific configuration, permissions, and dependencies. The systemd journal (journalctl) and service-specific log files are your primary diagnostic tools.

This section covers 64 troubleshooting articles across 18 categories, including web servers (nginx, Apache), databases (MySQL, PostgreSQL, Redis), process management (systemd, cron), security (SELinux, iptables, SSH), file systems (NFS, LVM), networking (HAProxy, Postfix), and containerization (Docker). Each guide addresses specific error messages and symptoms with commands you can run immediately.

The guides assume you have root or sudo access to the server. They focus on CentOS/RHEL and Ubuntu/Debian as the most common server distributions, but most commands and concepts apply across all Linux distributions. When distribution-specific steps differ, both paths are covered.

Browse by Category

Common Patterns & Cross-Cutting Themes

Service Management & systemd Failures

"Job for [service] failed" is the most common systemd error. The fix starts with reading the actual error: run systemctl status <service> and journalctl -xeu <service> to see the full output. Common causes: syntax errors in config files (always run the service's config test command first — nginx -t, apachectl configtest, named-checkconf), missing dependencies, wrong file permissions on config or data directories, and port conflicts with another service.

Units that enter a "failed" state won't restart automatically unless you've configured Restart=on-failure in the unit file. After fixing the issue, run systemctl daemon-reload if you changed the unit file, then systemctl restart <service>. For services that fail repeatedly, check for resource exhaustion (memory, file descriptors, disk) that might be killing the process shortly after startup.

Enable and start are different: enable makes the service start at boot, start runs it now. You usually want both: systemctl enable --now <service>.

Permission & SELinux Issues

Permission denied errors on Linux have two layers: traditional Unix permissions (user/group/other with rwx bits) and mandatory access control (SELinux on RHEL/CentOS, AppArmor on Ubuntu). Even root can be blocked by SELinux.

For traditional permissions: check ownership (ls -la), ensure the service user can read configs and write to data directories, and verify execute permission on binaries and scripts. For services accessing files in non-standard locations, check the parent directory permissions too — a 755 file inside a 700 directory is still inaccessible.

For SELinux: check if it's the blocker by looking at /var/log/audit/audit.log for "avc: denied" messages, or run audit2why. Common fixes: set the correct SELinux context with chcon or semanage fcontext, or allow a specific action with setsebool. Don't reflexively disable SELinux — it's a critical security layer. Use audit2allow to generate a targeted policy module instead.

Disk Space & Filesystem Problems

"No space left on device" can mean either disk space or inode exhaustion. Check both: df -h for space and df -i for inodes. The biggest space consumers are usually logs (/var/log), package caches (/var/cache), old kernels, and Docker images/volumes.

For immediate relief: find and remove large files with du -sh /var/* | sort -h, clean package caches (apt clean or yum clean all), rotate and compress logs, and remove old Docker images with docker system prune. For inodes, find directories with millions of small files using find / -xdev -type d | while read d; do echo "$(find "$d" -maxdepth 1 | wc -l) $d"; done | sort -n | tail.

For LVM-based systems, you can extend logical volumes online if the volume group has free space. For cloud instances, extend the EBS/disk volume in the cloud console, then grow the partition and filesystem. Always set up monitoring to alert before disk reaches 85%.

Network & Connectivity Troubleshooting

Network issues on Linux follow a diagnostic ladder: can you ping the target (layer 3)? Can you connect to the port (layer 4)? Does the application respond correctly (layer 7)? Use ping, ss/netstat, curl, and tcpdump at each layer.

For services that won't bind to a port: check if something else is using it (ss -tlnp | grep <port>), verify the service configuration specifies the right listen address (0.0.0.0 vs. 127.0.0.1), and check firewall rules (iptables -L or firewall-cmd --list-all). For outbound connectivity issues, check default routes (ip route), DNS resolution (dig or nslookup), and whether a proxy is required.

SSH connection problems deserve special mention since SSH is your lifeline to the server. "Connection refused" means sshd isn't running or is on a different port. "Connection timed out" means a firewall is blocking port 22. "Permission denied (publickey)" means your key isn't in authorized_keys or the file permissions are wrong (must be 600 for the key, 700 for ~/.ssh).

Quick Troubleshooting Guide

SymptomLikely CauseFirst Step
Service failed to startConfig syntax error or port conflictRun config test (nginx -t, apachectl configtest); check journalctl -xeu <service>
Permission denied (not SELinux)Wrong file ownership or modeCheck ls -la; fix with chown/chmod; verify parent directory permissions
Permission denied (SELinux)Wrong SELinux context on filesCheck audit.log for 'avc: denied'; fix with semanage fcontext + restorecon
No space left on deviceDisk full or inode exhaustionCheck df -h and df -i; clean logs, caches, old packages; extend volume if possible
SSH connection refusedsshd not running or wrong portVerify sshd status; check /etc/ssh/sshd_config for Port; check firewall
SSH permission denied (publickey)Key not in authorized_keys or wrong permissionsCheck ~/.ssh/authorized_keys; ensure 600 on key files, 700 on .ssh dir
OOM killer terminated processServer out of memoryCheck dmesg for OOM messages; reduce service memory usage or add swap/RAM
Cron job not runningBad crontab syntax, wrong PATH, or service disabledCheck crontab -l; verify PATH in crontab; check /var/log/cron or journalctl
NFS mount hanging or timing outFirewall blocking NFS ports or server unreachableCheck NFS server status; verify firewall allows ports 111, 2049; test with showmount
High CPU / load averageRunaway process or resource contentionRun top/htop; identify the process; check for infinite loops or fork bombs

Category Deep Dives

Frequently Asked Questions