Error Medic

Kubernetes CrashLoopBackOff: How to Fix 'back-off restarting failed container'

Resolve Kubernetes CrashLoopBackOff errors fast. Learn to diagnose 'back-off restarting failed container', fix OOMKilled, and debug CoreDNS or Alertmanager.

Last updated:
Last verified:
1,510 words
Key Takeaways
  • CrashLoopBackOff means a pod keeps crashing immediately after starting, triggering an exponential restart delay by the Kubelet.
  • Common root causes include application panics, OOMKilled (Exit Code 137), missing Secrets/ConfigMaps, and failing Liveness Probes.
  • Use 'kubectl describe pod <pod-name>' to identify the specific exit code and view recent pod events.
  • If the current logs are empty, use 'kubectl logs <pod-name> --previous' to view the crashed container's output before the restart.
Diagnostic Approaches Compared
MethodWhen to UseTimeRisk
kubectl describe podFirst step to check pod events, state, and container exit codes.< 1 minNone
kubectl logs --previousTo see the application error output right before the crash occurred.< 1 minNone
kubectl get eventsFor cluster-wide context, such as node resource starvation or image pull issues.1-2 minsNone
kubectl debug / Sleep OverrideWhen logs are missing or an interactive shell is needed in the failing environment.5-10 minsLow (modifies deployment)

Understanding the CrashLoopBackOff Error

When working with Kubernetes, encountering a pod in the CrashLoopBackOff state is a rite of passage for any DevOps engineer or SRE. You deploy your application, check the kubernetes pod status, and instead of the reassuring Running state, you see CrashLoopBackOff. When you inspect the events, you are greeted with the infamous message: back-off restarting failed container.

But what does crashloopbackoff kubernetes meaning actually entail?

CrashLoopBackOff is not an error in itself. It is a state that indicates a container in your pod is starting, failing (exiting with a non-zero exit code, or exiting when it shouldn't), and then being restarted by the kubelet. Kubernetes employs an exponential back-off delay for these restarts—10 seconds, 20 seconds, 40 seconds, up to a maximum of 5 minutes. This prevents a failing application from consuming excessive node CPU and memory by constantly spinning up and dying.

Step 1: Diagnose the CrashLoopBackOff Status

When a pod is stuck in CrashLoopBackOff, guessing the root cause is a waste of time. You need to gather empirical data from the cluster.

1. Check Pod Status and Events

The first command to run is kubectl describe pod <pod-name>. Scroll down to the Containers section and look for the State and Last State.

Pay close attention to the Exit Code:

  • Exit Code 1: General application error. The code threw an exception, panicked, or encountered a fatal logic error.
  • Exit Code 137: OOMKilled. The container exceeded its memory limit and was terminated forcefully by the Linux kernel.
  • Exit Code 255: Often indicates the entrypoint script failed, the binary is not executable, or the command wasn't found in the image.

2. Retrieve the Logs (Even When There Are 'No Logs')

A common frustration is running kubectl logs <pod-name> and getting nothing back (kubernetes crashloopbackoff no logs). This happens because the current container instance just started and hasn't logged anything yet, or it crashed before it could flush stdout to the logging driver.

To view the logs of the container that actually crashed, append the --previous flag:

kubectl logs <pod-name> --previous

This is the single most powerful command for a kubernetes pod crashloopbackoff debug session.

Step 2: Common Root Causes and Fixes

Application Misconfigurations and Panics

The most frequent cause of a kubernetes deployment crashloopbackoff is a simple application error. This could be a missing environment variable, a typo in a configuration file, or an unhandled exception in the code. Fix: Analyze the --previous logs. Ensure all required ConfigMaps and Secrets are mounted and populated correctly. Double-check the command and args in your deployment YAML.

Resource Constraints (OOMKilled)

If your kubernetes crashloopbackoff reason is OOMKilled (Exit Code 137), your container tried to use more RAM than its defined resources.limits.memory. Fix: Increase the memory limit in the deployment manifest, or profile your application locally to identify and fix the memory leak.

Failing Liveness Probes

Sometimes, the application is fine, but it takes too long to start. If the livenessProbe triggers and fails before the app is fully ready to accept connections, Kubernetes will kill it, leading to a crashloopbackoff pod kubernetes scenario. Fix: Increase the initialDelaySeconds on the probe, or better yet, configure a startupProbe to give the application dedicated time to initialize before liveness checks begin.

Init Container Failures

A kubernetes init crashloopbackoff is another variant. Init containers run to completion sequentially before the main app containers start. If an init container fails (e.g., waiting for a database to become ready, running database migrations, or changing volume permissions), the entire pod will be stuck. Fix: Check init container logs specifically using: kubectl logs <pod-name> -c <init-container-name>.

Specific Component Troubleshooting

CoreDNS and CNI (Calico/Flannel)

Infrastructure pods can also enter this state. A coredns kubernetes crashloopbackoff is frequently caused by a forwarding loop in the resolv.conf of the host node. If you see kubernetes calico node crashloopbackoff or kubernetes flannel crashloopbackoff, check the node's network interfaces, ensure IP forwarding is enabled at the OS level, and verify that there are no overlapping CIDR blocks between your host VPC and your internal pod network.

Prometheus and Alertmanager

When a kubernetes alertmanager or prometheus alertmanager kubernetes pod fails, it's almost always a configuration syntax error. Alertmanager is notoriously strict about its alertmanager.yaml file. If there is a YAML indentation error or an invalid receiver configuration, it will crash immediately on startup. Fix: Use the amtool check-config command locally before applying your ConfigMap to validate the syntax. Check the --previous logs for the exact line number of the parsing error.

Databases (PostgreSQL)

A kubernetes postgres crashloopbackoff usually stems from file permission issues on the PersistentVolume (PV). If the PostgreSQL container runs as a non-root user (which is best practice) but the underlying volume is owned by root, it cannot write to its data directory and will crash. Fix: Use an initContainer running as root to chown -R the volume directory to the correct postgres user ID before the main database container starts.

Metrics Server

The metrics server is critical for HPA (Horizontal Pod Autoscaler) and commands like kubectl top. If you see kubernetes metrics server crashloopbackoff, it is frequently due to TLS certificate validation failures. Fix: In local or test environments, you often need to add the --kubelet-insecure-tls flag to the metrics server deployment arguments to bypass strict certificate checks against the kubelet API.

Step 3: Advanced Debugging with Sleep

If the container exits so fast that you can't get logs, and you need to inspect the environment (like checking if a volume mounted properly, or testing network connectivity from inside the pod), you can temporarily override the container's command to keep it alive:

containers:
  - name: my-app
    image: my-image:latest
    command: ["sleep", "3600"]

Apply this change, wait for the pod to enter the Running state, and then use kubectl exec -it <pod-name> -- /bin/sh to poke around the filesystem, read mounted secrets, and manually run your application binary to see exactly how and why it fails.

Frequently Asked Questions

bash
# 1. Get the pod's status, exit code, and recent events
kubectl describe pod <pod-name> -n <namespace>

# 2. Check the logs of the PREVIOUSLY crashed container instance
kubectl logs <pod-name> -n <namespace> --previous

# 3. View cluster events sorted by time to identify broader node issues
kubectl get events --sort-by='.metadata.creationTimestamp' -n <namespace>

# 4. (Advanced) Spin up an ephemeral debug container attached to the failing pod
kubectl debug -it <pod-name> --image=busybox:1.28 --target=<container-name>
E

Error Medic Editorial

Error Medic Editorial is a team of seasoned Site Reliability Engineers and DevOps professionals dedicated to demystifying complex cloud-native infrastructure issues, empowering developers to build and maintain resilient systems.

Sources

Related Articles in Kubernetes

Explore More DevOps Config Guides