How to Fix Kubernetes ImagePullBackOff, CrashLoopBackOff, and OOMKilled Errors
Comprehensive troubleshooting guide for Kubernetes ImagePullBackOff, CrashLoopBackOff, and OOM killed errors. Learn to fix permissions, timeouts, and crashes.
- ImagePullBackOff usually indicates a typo in the image name, missing registry credentials, or network connectivity issues.
- Network-related pull failures often present as 'kubernetes connection refused', 'kubernetes timeout', or 'kubernetes certificate expired' errors in the pod events.
- If the image pulls successfully but fails to run, you will likely see a CrashLoopBackOff, which requires inspecting container logs.
- Containers terminating with Exit Code 137 are experiencing a 'kubernetes oom killed' event due to exceeding memory limits.
- Quick fix: Run 'kubectl describe pod <pod-name>' and check the 'Events' section at the bottom for the exact 'ErrImagePull' reason.
| Failure Mode | When to Use | Estimated Time | Risk Level |
|---|---|---|---|
| Correct Image/Tag Typo | When 'describe pod' shows 'manifest unknown' or 'not found' | < 5 mins | Low |
| Add imagePullSecrets | When encountering 'kubernetes permission denied' or 'pull access denied' | 10 mins | Low |
| Fix Registry Networking | When seeing 'kubernetes connection refused' or 'kubernetes timeout' | 30+ mins | Medium |
| Adjust Resource Limits | When facing 'kubernetes oom killed' or 'kubernetes out of memory' | 15 mins | Medium |
| Update CA Certificates | When 'kubernetes certificate expired' prevents secure registry access | 20 mins | Medium |
Understanding Kubernetes Pod Failures
When deploying applications to Kubernetes, ensuring that your pods transition smoothly from Pending to Running is the primary goal. However, DevOps engineers frequently encounter roadblocks that leave pods in a failing state. The most notorious of these are ImagePullBackOff and CrashLoopBackOff.
This guide provides a comprehensive framework for diagnosing and resolving these errors, along with related issues like kubernetes oom killed, kubernetes connection refused, and kubernetes permission denied.
Phase 1: Troubleshooting ImagePullBackOff and ErrImagePull
An ImagePullBackOff state means that the Kubelet on the worker node has tried to pull the specified container image, failed, and is now backing off (waiting) before trying again. The actual error causing the failure is typically logged as ErrImagePull.
1. The Typo or Missing Image
The most common and easily fixed cause is a simple typo in the deployment manifest.
Exact Error Message:
Failed to pull image "nginx:1.999": rpc error: code = Unknown desc = Error response from daemon: manifest for nginx:1.999 not found: manifest unknown: manifest unknown
The Fix: Verify the image repository, name, and tag. Ensure the tag exists in your container registry.
2. Authentication and Permissions (kubernetes permission denied)
If you are pulling from a private registry (like AWS ECR, Google GCR, or a private Docker Hub repo), the worker node needs credentials.
Exact Error Message:
Failed to pull image "my-private-registry.com/app:latest": rpc error: code = Unknown desc = Error response from daemon: pull access denied for my-private-registry.com/app, repository does not exist or may require 'docker login': denied: requested access to the resource is denied
The Fix: You need to create a Kubernetes Secret containing your registry credentials and link it to your Pod or ServiceAccount using imagePullSecrets.
3. Network Connectivity (kubernetes connection refused & timeout)
Sometimes the Kubelet cannot reach the registry due to firewall rules, DNS resolution failures, or egress gateway misconfigurations.
Exact Error Messages:
ErrImagePull: rpc error: code = Unknown desc = Error response from daemon: Get https://registry.example.com/v2/: dial tcp 10.0.0.5:443: connect: connection refusedErrImagePull: rpc error: code = Unknown desc = Error response from daemon: Get https://registry.example.com/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded)
The Fix: SSH into the worker node experiencing the issue. Run curl -v https://registry.example.com/v2/. If it hangs or refuses the connection, you must engage your networking team to verify egress security groups, NAT gateway configurations, or proxy settings.
4. TLS and Certificates (kubernetes certificate expired)
If you run a self-hosted registry or if the registry's SSL certificate has lapsed, Kubelet will refuse to pull the image for security reasons.
Exact Error Message:
ErrImagePull: rpc error: code = Unknown desc = Error response from daemon: Get https://registry.example.com/v2/: x509: certificate has expired or is not yet valid
The Fix: You must renew the TLS certificate on the registry server. If using a self-signed certificate, ensure the CA certificate is distributed to all Kubernetes worker nodes (typically placed in /etc/docker/certs.d/registry.example.com/ca.crt or the containerd equivalent) and restart the container runtime.
Phase 2: Progressing to CrashLoopBackOff
If the image pulls successfully, you have defeated ImagePullBackOff. However, if the container starts and immediately crashes, Kubernetes will restart it. If it continues to crash, Kubernetes puts the pod into a CrashLoopBackOff state.
1. Application Misconfiguration (kubernetes crash)
A standard kubernetes crash occurs when the application process exits with a non-zero status code. This could be due to missing environment variables, invalid configuration files, or the application failing to connect to a database.
Diagnostic Step:
You must check the logs of the previous container instance:
kubectl logs <pod-name> --previous
Look for application-level stack traces, such as missing database connection strings or panic events in the code.
2. Memory Limits Exceeded (kubernetes oom killed)
If your pod is terminated abruptly without any application-level error logs, it is likely experiencing a kubernetes out of memory event. The Linux kernel's Out-Of-Memory (OOM) killer will terminate processes that consume more memory than they are allocated via the pod's resources.limits.memory.
Exact Symptom:
When you run kubectl describe pod <pod-name>, look at the State of the container. You will see:
State: Terminated
Reason: OOMKilled
Exit Code: 137
The Fix:
- Short term: Increase the memory limit in the deployment manifest (
resources.limits.memory). - Long term: Profile your application to identify memory leaks. For Java applications, ensure JVM heap sizes (
-Xmx) are set lower than the container's memory limit so the JVM can manage its own garbage collection before the kernel kills the entire container.
3. Liveness Probe Failures
Sometimes the application is running fine, but a misconfigured Liveness Probe causes Kubernetes to think it has deadlocked. If the probe fails consecutively (e.g., due to a kubernetes timeout when hitting a /healthz endpoint), the Kubelet will kill and restart the container, leading to CrashLoopBackOff.
The Fix: Review kubectl describe pod events for Liveness probe failed: HTTP probe failed with statuscode: 500 or timeout. Adjust the initialDelaySeconds, timeoutSeconds, or fix the health check endpoint in your application code.
Frequently Asked Questions
# 1. Identify the failing pod and get detailed events to spot the exact error
kubectl get pods -n <namespace>
kubectl describe pod <pod-name> -n <namespace> | grep -A 15 Events:
# 2. If 'Permission Denied', create a docker-registry secret
kubectl create secret docker-registry private-reg-cred \
--docker-server=https://my-private-registry.com \
--docker-username=my-service-account \
--docker-password=my-super-secret-token \
--docker-email=ops@example.com -n <namespace>
# 3. Patch the default service account to automatically use this secret
kubectl patch serviceaccount default -p '{"imagePullSecrets": [{"name": "private-reg-cred"}]}' -n <namespace>
# 4. If CrashLoopBackOff, check logs of the crashed container instance
kubectl logs <pod-name> --previous -n <namespace>
# 5. Check if the pod was OOMKilled (Look for Exit Code 137)
kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.status.containerStatuses[0].lastState.terminated.reason}'Error Medic Editorial
The Error Medic Editorial team consists of senior SREs, DevOps engineers, and cloud architects dedicated to providing actionable, real-world solutions for complex infrastructure challenges.