Question 1

Why is my Kubernetes pod stuck in CrashLoopBackOff?

Accepted Answer

CrashLoopBackOff means your container starts, exits with an error, and Kubernetes keeps restarting it with increasing delays. Run kubectl logs <pod-name> --previous to see the logs from the crashed container. Common causes: missing or wrong environment variables, application startup errors (wrong config file path, database unreachable), failing liveness probes that restart a healthy-but-slow container, and command/entrypoint mismatches in the Dockerfile or pod spec.

Question 2

How do I fix Terraform state lock errors?

Accepted Answer

First, check if another Terraform operation is genuinely running (another team member, a CI pipeline). If so, wait for it to finish. If the lock is stale from a crashed operation, use terraform force-unlock <LOCK_ID> — but only after confirming no other operation is in progress. To prevent this, use remote state backends (S3 + DynamoDB, GCS, Terraform Cloud) that support locking, and run Terraform in CI rather than locally to avoid concurrent runs.

Question 3

How do I debug CI/CD pipeline failures that don't reproduce locally?

Accepted Answer

Run your pipeline commands inside the same Docker image the CI uses — differences in OS, tool versions, and installed packages are the top cause. Check for environment-specific assumptions: hardcoded file paths, localhost references, missing environment variables or secrets. For flaky tests, look for test ordering dependencies, race conditions, or reliance on wall-clock time. Most CI systems let you SSH into a failed build environment for interactive debugging.

Question 4

What does 'OOMKilled' mean in Kubernetes and how do I fix it?

Accepted Answer

OOMKilled means the container used more memory than its configured limit and the Linux kernel killed it. Check your memory limit in the pod spec — it might be too low for the workload. Profile your application's memory usage under load. For JVM-based apps, ensure the heap size is set below the container memory limit (leave room for non-heap memory). If usage genuinely exceeds expectations, investigate memory leaks using heap dumps or memory profilers.

Question 5

How do I handle Helm chart upgrade failures?

Accepted Answer

Run helm status to see the current state. If the release is in a failed state, check what went wrong with helm history . Common issues: invalid values overrides, template rendering errors, pre/post-upgrade hooks failing, and resource conflicts with objects that already exist. Fix the issue and run helm upgrade again — Helm creates a new revision. If the cluster is in a bad state, helm rollback returns to a known-good state.

Question 6

Why does my Docker build fail in CI but work locally?

Accepted Answer

Layer caching differences are the most common cause. Locally you have cached layers; CI starts fresh. Ensure your Dockerfile orders layers from least-changed (OS packages) to most-changed (application code). Check for: BuildKit vs. legacy builder differences, multi-platform build issues (ARM vs. x86), private registry authentication in CI, and .dockerignore differences. Also verify your CI runner has enough disk space and memory for the build.

Question 7

How do I troubleshoot Istio/service mesh 503 errors?

Accepted Answer

503 errors in a service mesh usually mean the Envoy sidecar proxy can't reach the upstream service. Check: Is the destination pod running and healthy? Do VirtualService and DestinationRule configs match the service? Is mTLS mode consistent between source and destination (STRICT vs. PERMISSIVE)? Run istioctl analyze to detect configuration issues. Check the Envoy sidecar logs (kubectl logs <pod> -c istio-proxy) for specific error codes like UH (upstream unhealthy) or NR (no route).

Symptom	Likely Cause	First Step
Pod in CrashLoopBackOff	Application crashing on startup	Check logs: kubectl logs <pod> --previous; verify env vars and config
ImagePullBackOff	Wrong image name/tag or missing pull secret	Verify image exists in registry; check imagePullSecrets on ServiceAccount
Pod stuck in Pending	Insufficient cluster resources or node affinity mismatch	Run kubectl describe pod; check resource requests vs. node capacity
OOMKilled	Container exceeded memory limit	Increase memory limit; profile application memory usage; check for leaks
Terraform state lock error	Previous operation crashed or concurrent run	Verify no other operations running; force-unlock with terraform force-unlock
Terraform wants to destroy a resource	Immutable attribute changed (forces replacement)	Read plan carefully; use create_before_destroy or lifecycle rules
Helm upgrade failed / rollback	Invalid values, template error, or failed hooks	Check helm status; review rendered templates with helm template; fix values
CI pipeline auth failure	Expired token or missing secret in CI environment	Rotate credentials; verify secrets are set in CI settings; check token scopes
Service mesh 503 errors	Istio/Envoy sidecar misconfiguration	Check VirtualService and DestinationRule; verify mTLS settings match
Prometheus scrape failures	Target down or scrape timeout exceeded	Check target health in Prometheus UI; increase scrape_timeout; verify network

DevOps & CI/CD Error Guide: Kubernetes, Docker, Terraform & More

Browse by Category

Common Patterns & Cross-Cutting Themes

Container Startup Failures

CI/CD Pipeline Failures

Infrastructure as Code State Issues

Monitoring & Observability Gaps

Quick Troubleshooting Guide

Category Deep Dives

Ansible

ArgoCD

AWS ECS

Cert Manager

Circleci

Consul

Datadog

Docker

Github Actions

Gitlab Ci

Grafana

Helm

Istio

Jenkins

Kubernetes

Nginx Ingress

Other

Prometheus

Terraform

Vault

Frequently Asked Questions