Why does my pod show ImagePullBackOff but the image definitely exists?

This is almost always an authentication issue (kubernetes permission denied) or a network issue. Ensure that the namespace has the correct `imagePullSecrets` configured and that the worker nodes have outbound network access to the registry.

How do I fix a 'kubernetes certificate expired' error when pulling images?

You need to either renew the SSL certificate on your target container registry, or if it's a private/internal registry with a self-signed cert, you must update the trusted CA certificates on every Kubernetes worker node's OS level and restart the container runtime (e.g., containerd or docker).

What is the difference between ErrImagePull and ImagePullBackOff?

ErrImagePull is the actual event/error indicating that the image download failed. ImagePullBackOff is the resulting *status* of the pod, indicating that Kubernetes is in a waiting period (backing off) before attempting the pull again.

How do I troubleshoot a CrashLoopBackOff that happens right after a successful image pull?

A CrashLoopBackOff means the application itself is crashing. Use `kubectl logs --previous` to view the logs of the crashed container. If the logs are empty, check `kubectl describe pod ` for an OOMKilled status (Exit Code 137).

What does Exit Code 137 mean in Kubernetes?

Exit Code 137 typically means the container was terminated by the node's kernel because it encountered a 'kubernetes oom killed' event. The application tried to use more memory than what was specified in the pod's memory limits.

How to Fix Kubernetes ImagePullBackOff, CrashLoopBackOff, and OOMKilled Errors

Comprehensive troubleshooting guide for Kubernetes ImagePullBackOff, CrashLoopBackOff, and OOM killed errors. Learn to fix permissions, timeouts, and crashes.

Last updated: February 24, 2026

Last verified: February 24, 2026

1,393 words

Key Takeaways

ImagePullBackOff usually indicates a typo in the image name, missing registry credentials, or network connectivity issues.
Network-related pull failures often present as 'kubernetes connection refused', 'kubernetes timeout', or 'kubernetes certificate expired' errors in the pod events.
If the image pulls successfully but fails to run, you will likely see a CrashLoopBackOff, which requires inspecting container logs.
Containers terminating with Exit Code 137 are experiencing a 'kubernetes oom killed' event due to exceeding memory limits.
Quick fix: Run 'kubectl describe pod <pod-name>' and check the 'Events' section at the bottom for the exact 'ErrImagePull' reason.

Pod Failure Fix Approaches Compared
Failure Mode	When to Use	Estimated Time	Risk Level
Correct Image/Tag Typo	When 'describe pod' shows 'manifest unknown' or 'not found'	< 5 mins	Low
Add imagePullSecrets	When encountering 'kubernetes permission denied' or 'pull access denied'	10 mins	Low
Fix Registry Networking	When seeing 'kubernetes connection refused' or 'kubernetes timeout'	30+ mins	Medium
Adjust Resource Limits	When facing 'kubernetes oom killed' or 'kubernetes out of memory'	15 mins	Medium
Update CA Certificates	When 'kubernetes certificate expired' prevents secure registry access	20 mins	Medium

Understanding Kubernetes Pod Failures

When deploying applications to Kubernetes, ensuring that your pods transition smoothly from Pending to Running is the primary goal. However, DevOps engineers frequently encounter roadblocks that leave pods in a failing state. The most notorious of these are ImagePullBackOff and CrashLoopBackOff.

This guide provides a comprehensive framework for diagnosing and resolving these errors, along with related issues like kubernetes oom killed, kubernetes connection refused, and kubernetes permission denied.

Phase 1: Troubleshooting ImagePullBackOff and ErrImagePull

An ImagePullBackOff state means that the Kubelet on the worker node has tried to pull the specified container image, failed, and is now backing off (waiting) before trying again. The actual error causing the failure is typically logged as ErrImagePull.

1. The Typo or Missing Image

The most common and easily fixed cause is a simple typo in the deployment manifest.

Exact Error Message: Failed to pull image "nginx:1.999": rpc error: code = Unknown desc = Error response from daemon: manifest for nginx:1.999 not found: manifest unknown: manifest unknown

The Fix: Verify the image repository, name, and tag. Ensure the tag exists in your container registry.

2. Authentication and Permissions (kubernetes permission denied)

If you are pulling from a private registry (like AWS ECR, Google GCR, or a private Docker Hub repo), the worker node needs credentials.

Exact Error Message: Failed to pull image "my-private-registry.com/app:latest": rpc error: code = Unknown desc = Error response from daemon: pull access denied for my-private-registry.com/app, repository does not exist or may require 'docker login': denied: requested access to the resource is denied

The Fix: You need to create a Kubernetes Secret containing your registry credentials and link it to your Pod or ServiceAccount using imagePullSecrets.

3. Network Connectivity (kubernetes connection refused & timeout)

Sometimes the Kubelet cannot reach the registry due to firewall rules, DNS resolution failures, or egress gateway misconfigurations.

Exact Error Messages:

ErrImagePull: rpc error: code = Unknown desc = Error response from daemon: Get https://registry.example.com/v2/: dial tcp 10.0.0.5:443: connect: connection refused
ErrImagePull: rpc error: code = Unknown desc = Error response from daemon: Get https://registry.example.com/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded)

The Fix: SSH into the worker node experiencing the issue. Run curl -v https://registry.example.com/v2/. If it hangs or refuses the connection, you must engage your networking team to verify egress security groups, NAT gateway configurations, or proxy settings.

4. TLS and Certificates (kubernetes certificate expired)

If you run a self-hosted registry or if the registry's SSL certificate has lapsed, Kubelet will refuse to pull the image for security reasons.

Exact Error Message: ErrImagePull: rpc error: code = Unknown desc = Error response from daemon: Get https://registry.example.com/v2/: x509: certificate has expired or is not yet valid

The Fix: You must renew the TLS certificate on the registry server. If using a self-signed certificate, ensure the CA certificate is distributed to all Kubernetes worker nodes (typically placed in /etc/docker/certs.d/registry.example.com/ca.crt or the containerd equivalent) and restart the container runtime.

Phase 2: Progressing to CrashLoopBackOff

If the image pulls successfully, you have defeated ImagePullBackOff. However, if the container starts and immediately crashes, Kubernetes will restart it. If it continues to crash, Kubernetes puts the pod into a CrashLoopBackOff state.

1. Application Misconfiguration (kubernetes crash)

A standard kubernetes crash occurs when the application process exits with a non-zero status code. This could be due to missing environment variables, invalid configuration files, or the application failing to connect to a database.

Diagnostic Step: You must check the logs of the previous container instance: kubectl logs <pod-name> --previous

Look for application-level stack traces, such as missing database connection strings or panic events in the code.

2. Memory Limits Exceeded (kubernetes oom killed)

If your pod is terminated abruptly without any application-level error logs, it is likely experiencing a kubernetes out of memory event. The Linux kernel's Out-Of-Memory (OOM) killer will terminate processes that consume more memory than they are allocated via the pod's resources.limits.memory.

Exact Symptom: When you run kubectl describe pod <pod-name>, look at the State of the container. You will see: State: Terminated Reason: OOMKilled Exit Code: 137

The Fix:

Short term: Increase the memory limit in the deployment manifest (resources.limits.memory).
Long term: Profile your application to identify memory leaks. For Java applications, ensure JVM heap sizes (-Xmx) are set lower than the container's memory limit so the JVM can manage its own garbage collection before the kernel kills the entire container.

3. Liveness Probe Failures

Sometimes the application is running fine, but a misconfigured Liveness Probe causes Kubernetes to think it has deadlocked. If the probe fails consecutively (e.g., due to a kubernetes timeout when hitting a /healthz endpoint), the Kubelet will kill and restart the container, leading to CrashLoopBackOff.

The Fix: Review kubectl describe pod events for Liveness probe failed: HTTP probe failed with statuscode: 500 or timeout. Adjust the initialDelaySeconds, timeoutSeconds, or fix the health check endpoint in your application code.

Frequently Asked Questions

bash

# 1. Identify the failing pod and get detailed events to spot the exact error
kubectl get pods -n <namespace>
kubectl describe pod <pod-name> -n <namespace> | grep -A 15 Events:

# 2. If 'Permission Denied', create a docker-registry secret
kubectl create secret docker-registry private-reg-cred \
  --docker-server=https://my-private-registry.com \
  --docker-username=my-service-account \
  --docker-password=my-super-secret-token \
  --docker-email=ops@example.com -n <namespace>

# 3. Patch the default service account to automatically use this secret
kubectl patch serviceaccount default -p '{"imagePullSecrets": [{"name": "private-reg-cred"}]}' -n <namespace>

# 4. If CrashLoopBackOff, check logs of the crashed container instance
kubectl logs <pod-name> --previous -n <namespace>

# 5. Check if the pod was OOMKilled (Look for Exit Code 137)
kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.status.containerStatuses[0].lastState.terminated.reason}'

Error Medic Editorial

The Error Medic Editorial team consists of senior SREs, DevOps engineers, and cloud architects dedicated to providing actionable, real-world solutions for complex infrastructure challenges.

Sources

Explore More DevOps Config Guides

Ansible Failed: Fix Connection Refused, Permission Denied & Timeout Errors

Fix Ansible failures including connection refused, permission denied, and timeout errors. Step-by-step diagnosis with real commands and verified solutions.

ArgoCD 'connection refused' Error: Complete Troubleshooting Guide (2024)

Fix ArgoCD 'connection refused', CrashLoopBackOff, ImagePullBackOff, and timeout errors with step-by-step diagnostic commands and proven solutions.

ArgoCD Connection Refused: Fix CrashLoopBackOff, ImagePullBackOff, Permission Denied & Timeout Errors

Fix ArgoCD connection refused errors: diagnose CrashLoopBackOff, ImagePullBackOff, permission denied, and timeout with step-by-step kubectl commands and config