Error Medic

DevOps & CI/CD Error Guide: Kubernetes, Docker, Terraform & More

DevOps errors hit at the worst possible times — during deployments, scaling events, and incident response. When your CI/CD pipeline fails, containers won't start, or Terraform destroys a resource it shouldn't have, you need answers immediately because the blast radius grows with every minute of delay.

The DevOps toolchain is deep and interconnected. A single deployment might touch GitHub Actions, Docker, Helm, Kubernetes, and a service mesh like Istio. An error at any layer cascades through the rest. This makes root cause analysis challenging because the symptom (a pod in CrashLoopBackOff) is often far removed from the cause (a misconfigured ConfigMap pushed three commits ago).

This section covers 69 troubleshooting articles across 20 tools spanning container orchestration, CI/CD pipelines, infrastructure as code, service meshes, monitoring, and secret management. From Kubernetes pod scheduling failures to Terraform state lock conflicts, each guide targets the specific error messages you'll see in logs and dashboards.

The common thread across all DevOps errors is configuration. Most failures aren't code bugs — they're YAML indentation errors, environment variable mismatches, resource limit miscalculations, or permission gaps between service accounts. These guides teach you to read the signals, find the misconfiguration, and fix it without making things worse.

Browse by Category

Common Patterns & Cross-Cutting Themes

Container Startup Failures

CrashLoopBackOff, ImagePullBackOff, and OOMKilled are the three container errors you'll see most in Kubernetes. CrashLoopBackOff means the container starts and immediately exits — check the logs with kubectl logs to see why. Common causes: missing environment variables, wrong entrypoint command, application crashes on startup, or failed health checks causing restarts.

ImagePullBackOff means Kubernetes can't pull the container image. Check image name and tag spelling, verify the image exists in the registry, and ensure pull secrets are configured if it's a private registry. OOMKilled means the container exceeded its memory limit — increase the limit or fix the memory leak in your application.

Always set both resource requests and limits. Requests determine scheduling (which node has room), limits enforce ceilings (the container gets killed if it exceeds them). Start with requests equal to your baseline usage and limits at 2× requests, then tune based on actual metrics.

CI/CD Pipeline Failures

Pipeline failures break the development feedback loop. The most common causes are: dependency installation failures (npm, pip, or apt packages unavailable or version-conflicted), test environment differences from local (missing env vars, services, or file paths), Docker build failures (layer cache invalidation, multi-stage build issues), and authentication problems with registries, cloud providers, or deployment targets.

Debug by reproducing locally first. Run the pipeline's commands in the same container image it uses. Check for environment-specific assumptions: hardcoded paths, localhost references that should be service names, or secrets that exist in your shell but not in the CI environment.

Flaky tests are a special category — tests that pass locally but fail intermittently in CI. Root causes include: test ordering dependencies, race conditions in async code, reliance on wall-clock time, and shared state between test cases. Quarantine flaky tests rather than retrying them blindly.

Infrastructure as Code State Issues

Terraform state conflicts, drift detection, and resource replacement surprises are common IaC headaches. "Error acquiring the state lock" means another operation is in progress or a previous one crashed without releasing the lock — check who holds the lock before force-unlocking.

State drift occurs when someone modifies infrastructure manually (via console or CLI) without updating the Terraform code. Run terraform plan regularly to detect drift early. Import manually-created resources with terraform import rather than recreating them.

The scariest Terraform error is an unexpected "destroy and recreate" in a plan. This happens when you change an immutable attribute (like an EC2 instance type or RDS engine version) that forces replacement. Always read the full plan output before applying. Use lifecycle { prevent_destroy = true } on critical resources, and use create_before_destroy when replacement is necessary but downtime isn't acceptable.

Monitoring & Observability Gaps

Prometheus scrape failures, Grafana dashboard errors, and Datadog agent issues can leave you flying blind during incidents. "Context deadline exceeded" in Prometheus usually means the scrape target is too slow — increase the scrape timeout or reduce the number of metrics exposed.

Grafana query errors often stem from datasource misconfiguration, PromQL syntax issues, or queries that return too many time series (cardinality explosion). Limit your label values and use recording rules to pre-aggregate expensive queries.

For logging pipelines, the most common failures are: log agents running out of disk buffer, log parsing failures silently dropping entries, and timestamp mismatches causing logs to appear out of order. Always monitor your monitoring — set up alerts on scrape failures, agent health, and log pipeline throughput.

Quick Troubleshooting Guide

SymptomLikely CauseFirst Step
Pod in CrashLoopBackOffApplication crashing on startupCheck logs: kubectl logs <pod> --previous; verify env vars and config
ImagePullBackOffWrong image name/tag or missing pull secretVerify image exists in registry; check imagePullSecrets on ServiceAccount
Pod stuck in PendingInsufficient cluster resources or node affinity mismatchRun kubectl describe pod; check resource requests vs. node capacity
OOMKilledContainer exceeded memory limitIncrease memory limit; profile application memory usage; check for leaks
Terraform state lock errorPrevious operation crashed or concurrent runVerify no other operations running; force-unlock with terraform force-unlock
Terraform wants to destroy a resourceImmutable attribute changed (forces replacement)Read plan carefully; use create_before_destroy or lifecycle rules
Helm upgrade failed / rollbackInvalid values, template error, or failed hooksCheck helm status; review rendered templates with helm template; fix values
CI pipeline auth failureExpired token or missing secret in CI environmentRotate credentials; verify secrets are set in CI settings; check token scopes
Service mesh 503 errorsIstio/Envoy sidecar misconfigurationCheck VirtualService and DestinationRule; verify mTLS settings match
Prometheus scrape failuresTarget down or scrape timeout exceededCheck target health in Prometheus UI; increase scrape_timeout; verify network

Category Deep Dives

Frequently Asked Questions