Prometheus Connection Refused: Complete Troubleshooting Guide (CrashLoopBackOff, OOM, Permission Denied)
Fix Prometheus 'connection refused', CrashLoopBackOff, OOM kills, and permission denied errors with step-by-step commands and config examples.
- Connection refused usually means Prometheus is not running, bound to the wrong address, or blocked by a firewall/NetworkPolicy — check `kubectl get pods` and `netstat -tlnp` first
- CrashLoopBackOff and OOM kills are almost always caused by insufficient memory limits, a misconfigured `--storage.tsdb.retention` flag, or a cardinality explosion from high-churn label sets
- Permission denied errors on startup point to a volume mount with wrong UID/GID ownership — the Prometheus binary runs as UID 65534 (nobody) by default and cannot write to root-owned directories
- Quick fix checklist: verify the process is up → check bind address → inspect resource limits → fix storage permissions → review scrape configs for label cardinality
| Method | When to Use | Time | Risk |
|---|---|---|---|
| Restart pod / process | Transient crash, OOM after memory limit raised | < 2 min | Low — no config change |
| Increase memory limit | Repeated OOM kills shown in `kubectl describe pod` | 5–10 min | Low — rolling restart required |
| Fix --web.listen-address flag | Prometheus not binding to 0.0.0.0 or correct port | 5 min | Low |
| Fix storage volume permissions (chown/securityContext) | Permission denied on /prometheus data dir at startup | 5–10 min | Low — requires pod restart |
| Reduce label cardinality / add metric_relabel_configs | Cardinality explosion causing OOM or slow queries | 30–60 min | Medium — may drop series |
| Tune --storage.tsdb.retention.size | Disk full causing crashes or write errors | 5 min | Low |
| Add NetworkPolicy / firewall rule | Connection refused from external client or other pod | 10–20 min | Medium — affects network topology |
| Upgrade Prometheus version | Bug in older release causing crashes or timeouts | 20–40 min | Medium — test in staging first |
Understanding Prometheus Connection Errors
Prometheus exposes an HTTP API and UI on port 9090 by default. A connection refused error means the TCP handshake never completed — the kernel sent back a RST packet because nothing was listening on that port. This is distinct from a timeout (no response at all) and from a 403/401 (process is up but rejects the request).
Common exact errors you will see:
Get "http://prometheus:9090/api/v1/query": dial tcp 10.96.0.1:9090: connect: connection refused
ts=2024-01-15T10:23:45Z level=error msg="Opening storage failed" err="open /prometheus/queries.active: permission denied"
OOMKilled
Back-off restarting failed container prometheus in pod prometheus-0
Step 1: Determine Whether Prometheus Is Running
Kubernetes environments:
kubectl get pods -n monitoring -l app=prometheus
kubectl describe pod prometheus-0 -n monitoring # look at Events and Last State
kubectl logs prometheus-0 -n monitoring --previous # logs from crashed container
Look for these fields in kubectl describe pod:
Last State: Terminated Reason: OOMKilled→ memory problemLast State: Terminated Reason: Errorwith exit code 1 or 2 → config or permission errorRestart Count: Nwhere N > 3 → CrashLoopBackOff pattern
Bare-metal / VM environments:
systemctl status prometheus
journalctl -u prometheus -n 100 --no-pager
ps aux | grep prometheus
Step 2: Diagnose Connection Refused
Once you confirm the process state, narrow down the cause:
# Is Prometheus actually listening on port 9090?
ss -tlnp | grep 9090
# or on older systems:
netstat -tlnp | grep 9090
# Can you reach it locally?
curl -v http://localhost:9090/-/healthy
# In Kubernetes — check Service selector matches pod labels:
kubectl get svc prometheus -n monitoring -o yaml
kubectl get endpoints prometheus -n monitoring
# If Endpoints shows <none>, the Service selector is wrong
If ss shows nothing on port 9090 but the process is running, check the --web.listen-address flag:
kubectl exec -it prometheus-0 -n monitoring -- /bin/prometheus --help 2>&1 | grep listen
# Then check actual flags the process was started with:
kubectl exec -it prometheus-0 -n monitoring -- cat /proc/1/cmdline | tr '\0' ' '
The flag must be --web.listen-address=0.0.0.0:9090 (or :[port]) for the service to be reachable from other pods. If it is 127.0.0.1:9090, only loopback traffic works.
Step 3: Fix CrashLoopBackOff
CrashLoopBackOff is not a root cause — it is Kubernetes backing off restarts of a container that keeps failing. Identify why it crashes:
# Get the last 200 lines from the crashed container
kubectl logs prometheus-0 -n monitoring --previous --tail=200
Scenario A — Bad configuration file:
ts=2024-01-15T10:23:45Z level=error msg="Error loading config" file=/etc/prometheus/prometheus.yml err="yaml: line 42: mapping values are not allowed in this context"
Validate the config before applying:
promtool check config /etc/prometheus/prometheus.yml
In Kubernetes, the ConfigMap is often the source. Edit it with kubectl edit configmap prometheus-config -n monitoring and look for YAML indentation errors.
Scenario B — Storage corruption:
level=error msg="Failed to open db" err="unexpected end of JSON input"
This requires removing the corrupted WAL block:
# List TSDB blocks
ls -lah /prometheus/
# Remove WAL if corrupted (data loss for in-flight samples only)
rm -rf /prometheus/wal
# Then restart Prometheus
Step 4: Fix OOM Killed
kubectl describe pod prometheus-0 -n monitoring | grep -A5 "Last State"
# Last State: Terminated
# Reason: OOMKilled
# Exit Code: 137
Exit code 137 = 128 + 9 (SIGKILL). The Linux OOM killer terminated the process.
Option 1 — Raise the memory limit (fast):
# In your Deployment or StatefulSet spec:
resources:
requests:
memory: "2Gi"
limits:
memory: "4Gi"
Apply and wait for rollout: kubectl rollout status statefulset/prometheus -n monitoring
Option 2 — Reduce ingestion cardinality (sustainable):
High cardinality (millions of unique time series) is the most common cause of Prometheus OOM. Use the built-in TSDB status endpoint to find offenders:
curl http://localhost:9090/api/v1/status/tsdb | python3 -m json.tool | head -80
Look at headStats.numSeries. If it exceeds 1–2 million on a single Prometheus instance with 4 GiB RAM, you have a cardinality problem.
Drop high-cardinality labels with metric_relabel_configs:
# prometheus.yml scrape config
scrape_configs:
- job_name: 'my-app'
metric_relabel_configs:
- source_labels: [request_id] # drop unique per-request labels
regex: '.*'
action: labeldrop
- source_labels: [__name__]
regex: 'go_gc_.*' # drop noisy Go runtime metrics
action: drop
Option 3 — Tune retention:
# Limit retention by size instead of time
--storage.tsdb.retention.size=10GB
--storage.tsdb.retention.time=15d
Step 5: Fix Permission Denied
Prometheus runs as UID 65534 (nobody) by default. If the /prometheus data directory was created by root or another user, Prometheus cannot write to it.
ts=2024-01-15T10:23:45Z level=error caller=main.go:174 msg="Opening storage failed" err="open /prometheus/queries.active: permission denied"
Fix on bare metal:
chown -R 65534:65534 /var/lib/prometheus
# or if running as a dedicated user:
chown -R prometheus:prometheus /var/lib/prometheus
systemctl restart prometheus
Fix in Kubernetes (preferred — use securityContext):
spec:
securityContext:
runAsUser: 65534
runAsNonRoot: true
fsGroup: 65534 # ensures mounted volumes are chowned to this GID
containers:
- name: prometheus
# ...
The fsGroup field is the key — Kubernetes will chown the volume mount point to that GID on pod startup, so Prometheus can write to it without running as root.
Step 6: Fix Scrape Timeouts
If Prometheus is running but you see timeout errors in the UI or logs:
level=warn msg="Scrape failed" scrape_url="http://my-app:8080/metrics" err="context deadline exceeded"
Increase the per-job scrape timeout in prometheus.yml:
global:
scrape_timeout: 10s # default is 10s; increase if targets are slow
scrape_configs:
- job_name: 'slow-app'
scrape_timeout: 30s # job-level override
scrape_interval: 60s
Note: scrape_timeout must always be less than or equal to scrape_interval.
Step 7: Check NetworkPolicies and Firewalls
In hardened Kubernetes clusters, NetworkPolicies can silently block traffic:
kubectl get networkpolicies -n monitoring
# Test connectivity from another pod:
kubectl run -it --rm debug --image=curlimages/curl --restart=Never -n default -- \
curl -v http://prometheus.monitoring.svc.cluster.local:9090/-/healthy
If the curl pod gets connection refused but Prometheus is running, check if a NetworkPolicy is blocking ingress to port 9090 from the default namespace.
Frequently Asked Questions
#!/usr/bin/env bash
# Prometheus Diagnostic Script
# Usage: Run this on the node or via kubectl exec
set -euo pipefail
NAMESPACE="monitoring"
POD=$(kubectl get pods -n "$NAMESPACE" -l app=prometheus -o jsonpath='{.items[0].metadata.name}' 2>/dev/null || echo "")
echo "=== Pod Status ==="
if [ -n "$POD" ]; then
kubectl get pod "$POD" -n "$NAMESPACE" -o wide
kubectl describe pod "$POD" -n "$NAMESPACE" | grep -A10 "Conditions:\|Last State:\|Limits:\|Requests:\|Events:"
else
# Bare-metal fallback
systemctl status prometheus --no-pager || true
ps aux | grep '[p]rometheus'
fi
echo ""
echo "=== Port Binding ==="
if [ -n "$POD" ]; then
kubectl exec "$POD" -n "$NAMESPACE" -- ss -tlnp 2>/dev/null || \
kubectl exec "$POD" -n "$NAMESPACE" -- netstat -tlnp 2>/dev/null || true
else
ss -tlnp | grep 9090 || echo "Nothing listening on 9090"
fi
echo ""
echo "=== Health Check ==="
if [ -n "$POD" ]; then
kubectl exec "$POD" -n "$NAMESPACE" -- wget -qO- http://localhost:9090/-/healthy 2>&1 || \
echo "Health check FAILED"
else
curl -sf http://localhost:9090/-/healthy && echo "OK" || echo "FAILED"
fi
echo ""
echo "=== Recent Logs ==="
if [ -n "$POD" ]; then
kubectl logs "$POD" -n "$NAMESPACE" --previous --tail=50 2>/dev/null || \
kubectl logs "$POD" -n "$NAMESPACE" --tail=50
else
journalctl -u prometheus -n 50 --no-pager
fi
echo ""
echo "=== TSDB Status (cardinality) ==="
TSDB_URL="http://localhost:9090/api/v1/status/tsdb"
if [ -n "$POD" ]; then
kubectl exec "$POD" -n "$NAMESPACE" -- wget -qO- "$TSDB_URL" 2>/dev/null | \
python3 -c "import sys,json; d=json.load(sys.stdin)['data']; print('Series:', d['headStats']['numSeries'], '| Chunks:', d['headStats']['numChunks'])" 2>/dev/null || true
else
curl -sf "$TSDB_URL" | python3 -c "import sys,json; d=json.load(sys.stdin)['data']; print('Series:', d['headStats']['numSeries'], '| Chunks:', d['headStats']['numChunks'])" 2>/dev/null || true
fi
echo ""
echo "=== Endpoints (Service wiring) ==="
if [ -n "$POD" ]; then
kubectl get endpoints -n "$NAMESPACE" | grep -i prom || echo "No endpoints found"
fi
echo ""
echo "=== Storage Permissions ==="
if [ -n "$POD" ]; then
kubectl exec "$POD" -n "$NAMESPACE" -- ls -lah /prometheus/ 2>/dev/null || true
else
ls -lah /var/lib/prometheus/ 2>/dev/null || ls -lah /prometheus/ 2>/dev/null || true
fi
echo ""
echo "Diagnostic complete."Error Medic Editorial
The Error Medic Editorial team consists of senior SREs and platform engineers with experience running Prometheus at scale across bare-metal, AWS EKS, GKE, and on-prem Kubernetes clusters. We write from production incidents, not documentation.
Sources
- https://prometheus.io/docs/prometheus/latest/storage/
- https://prometheus.io/docs/prometheus/latest/configuration/configuration/
- https://github.com/prometheus/prometheus/blob/main/docs/storage.md
- https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/
- https://stackoverflow.com/questions/53752395/prometheus-connection-refused
- https://github.com/prometheus/prometheus/issues/8741
- https://www.robustperception.io/how-much-ram-does-prometheus-2-x-need-for-cardinality-and-ingestion