Error Medic

How to Fix Kubernetes ImagePullBackOff, CrashLoopBackOff, and OOMKilled Errors

Comprehensive troubleshooting guide for Kubernetes ImagePullBackOff, CrashLoopBackOff, and OOM killed errors. Learn to fix permissions, timeouts, and crashes.

Last updated:
Last verified:
1,393 words
Key Takeaways
  • ImagePullBackOff usually indicates a typo in the image name, missing registry credentials, or network connectivity issues.
  • Network-related pull failures often present as 'kubernetes connection refused', 'kubernetes timeout', or 'kubernetes certificate expired' errors in the pod events.
  • If the image pulls successfully but fails to run, you will likely see a CrashLoopBackOff, which requires inspecting container logs.
  • Containers terminating with Exit Code 137 are experiencing a 'kubernetes oom killed' event due to exceeding memory limits.
  • Quick fix: Run 'kubectl describe pod <pod-name>' and check the 'Events' section at the bottom for the exact 'ErrImagePull' reason.
Pod Failure Fix Approaches Compared
Failure ModeWhen to UseEstimated TimeRisk Level
Correct Image/Tag TypoWhen 'describe pod' shows 'manifest unknown' or 'not found'< 5 minsLow
Add imagePullSecretsWhen encountering 'kubernetes permission denied' or 'pull access denied'10 minsLow
Fix Registry NetworkingWhen seeing 'kubernetes connection refused' or 'kubernetes timeout'30+ minsMedium
Adjust Resource LimitsWhen facing 'kubernetes oom killed' or 'kubernetes out of memory'15 minsMedium
Update CA CertificatesWhen 'kubernetes certificate expired' prevents secure registry access20 minsMedium

Understanding Kubernetes Pod Failures

When deploying applications to Kubernetes, ensuring that your pods transition smoothly from Pending to Running is the primary goal. However, DevOps engineers frequently encounter roadblocks that leave pods in a failing state. The most notorious of these are ImagePullBackOff and CrashLoopBackOff.

This guide provides a comprehensive framework for diagnosing and resolving these errors, along with related issues like kubernetes oom killed, kubernetes connection refused, and kubernetes permission denied.


Phase 1: Troubleshooting ImagePullBackOff and ErrImagePull

An ImagePullBackOff state means that the Kubelet on the worker node has tried to pull the specified container image, failed, and is now backing off (waiting) before trying again. The actual error causing the failure is typically logged as ErrImagePull.

1. The Typo or Missing Image

The most common and easily fixed cause is a simple typo in the deployment manifest.

Exact Error Message: Failed to pull image "nginx:1.999": rpc error: code = Unknown desc = Error response from daemon: manifest for nginx:1.999 not found: manifest unknown: manifest unknown

The Fix: Verify the image repository, name, and tag. Ensure the tag exists in your container registry.

2. Authentication and Permissions (kubernetes permission denied)

If you are pulling from a private registry (like AWS ECR, Google GCR, or a private Docker Hub repo), the worker node needs credentials.

Exact Error Message: Failed to pull image "my-private-registry.com/app:latest": rpc error: code = Unknown desc = Error response from daemon: pull access denied for my-private-registry.com/app, repository does not exist or may require 'docker login': denied: requested access to the resource is denied

The Fix: You need to create a Kubernetes Secret containing your registry credentials and link it to your Pod or ServiceAccount using imagePullSecrets.

3. Network Connectivity (kubernetes connection refused & timeout)

Sometimes the Kubelet cannot reach the registry due to firewall rules, DNS resolution failures, or egress gateway misconfigurations.

Exact Error Messages:

  • ErrImagePull: rpc error: code = Unknown desc = Error response from daemon: Get https://registry.example.com/v2/: dial tcp 10.0.0.5:443: connect: connection refused
  • ErrImagePull: rpc error: code = Unknown desc = Error response from daemon: Get https://registry.example.com/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded)

The Fix: SSH into the worker node experiencing the issue. Run curl -v https://registry.example.com/v2/. If it hangs or refuses the connection, you must engage your networking team to verify egress security groups, NAT gateway configurations, or proxy settings.

4. TLS and Certificates (kubernetes certificate expired)

If you run a self-hosted registry or if the registry's SSL certificate has lapsed, Kubelet will refuse to pull the image for security reasons.

Exact Error Message: ErrImagePull: rpc error: code = Unknown desc = Error response from daemon: Get https://registry.example.com/v2/: x509: certificate has expired or is not yet valid

The Fix: You must renew the TLS certificate on the registry server. If using a self-signed certificate, ensure the CA certificate is distributed to all Kubernetes worker nodes (typically placed in /etc/docker/certs.d/registry.example.com/ca.crt or the containerd equivalent) and restart the container runtime.


Phase 2: Progressing to CrashLoopBackOff

If the image pulls successfully, you have defeated ImagePullBackOff. However, if the container starts and immediately crashes, Kubernetes will restart it. If it continues to crash, Kubernetes puts the pod into a CrashLoopBackOff state.

1. Application Misconfiguration (kubernetes crash)

A standard kubernetes crash occurs when the application process exits with a non-zero status code. This could be due to missing environment variables, invalid configuration files, or the application failing to connect to a database.

Diagnostic Step: You must check the logs of the previous container instance: kubectl logs <pod-name> --previous

Look for application-level stack traces, such as missing database connection strings or panic events in the code.

2. Memory Limits Exceeded (kubernetes oom killed)

If your pod is terminated abruptly without any application-level error logs, it is likely experiencing a kubernetes out of memory event. The Linux kernel's Out-Of-Memory (OOM) killer will terminate processes that consume more memory than they are allocated via the pod's resources.limits.memory.

Exact Symptom: When you run kubectl describe pod <pod-name>, look at the State of the container. You will see: State: Terminated Reason: OOMKilled Exit Code: 137

The Fix:

  1. Short term: Increase the memory limit in the deployment manifest (resources.limits.memory).
  2. Long term: Profile your application to identify memory leaks. For Java applications, ensure JVM heap sizes (-Xmx) are set lower than the container's memory limit so the JVM can manage its own garbage collection before the kernel kills the entire container.

3. Liveness Probe Failures

Sometimes the application is running fine, but a misconfigured Liveness Probe causes Kubernetes to think it has deadlocked. If the probe fails consecutively (e.g., due to a kubernetes timeout when hitting a /healthz endpoint), the Kubelet will kill and restart the container, leading to CrashLoopBackOff.

The Fix: Review kubectl describe pod events for Liveness probe failed: HTTP probe failed with statuscode: 500 or timeout. Adjust the initialDelaySeconds, timeoutSeconds, or fix the health check endpoint in your application code.

Frequently Asked Questions

bash
# 1. Identify the failing pod and get detailed events to spot the exact error
kubectl get pods -n <namespace>
kubectl describe pod <pod-name> -n <namespace> | grep -A 15 Events:

# 2. If 'Permission Denied', create a docker-registry secret
kubectl create secret docker-registry private-reg-cred \
  --docker-server=https://my-private-registry.com \
  --docker-username=my-service-account \
  --docker-password=my-super-secret-token \
  --docker-email=ops@example.com -n <namespace>

# 3. Patch the default service account to automatically use this secret
kubectl patch serviceaccount default -p '{"imagePullSecrets": [{"name": "private-reg-cred"}]}' -n <namespace>

# 4. If CrashLoopBackOff, check logs of the crashed container instance
kubectl logs <pod-name> --previous -n <namespace>

# 5. Check if the pod was OOMKilled (Look for Exit Code 137)
kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.status.containerStatuses[0].lastState.terminated.reason}'
E

Error Medic Editorial

The Error Medic Editorial team consists of senior SREs, DevOps engineers, and cloud architects dedicated to providing actionable, real-world solutions for complex infrastructure challenges.

Sources

Related Articles in Kubernetes

Explore More DevOps Config Guides