Why is my host showing as '?' or 'Offline' in Datadog?

A host shows as offline when Datadog has not received metrics from the Agent for several minutes. This is typically caused by the Agent service being stopped, a network firewall blocking outbound traffic to Datadog, or an invalid API key.

How do I fix 'Error: error connecting to the agent' when running the status command?

This error means the core Agent daemon is not running. Start the service using 'sudo systemctl start datadog-agent' on Linux, or check the Windows Services panel. If it fails to start, check '/var/log/datadog/agent.log' for configuration syntax errors.

Why are my custom integration metrics not showing up?

If base host metrics work but integration metrics (like MySQL or NGINX) don't, the integration configuration in the '/etc/datadog-agent/conf.d/' directory is likely incorrect. Run 'sudo datadog-agent configcheck' to validate the YAML syntax and test the connection to the specific service.

Why is the APM Trace Agent dropping traces?

Traces are often dropped if the application cannot reach the Datadog Agent on port 8126 (the default APM port). In containerized environments, ensure the tracer is configured to send data to the host IP or Datadog's Unix Domain Socket, not localhost. Also, check if 'apm_config: enabled: true' is set in your datadog.yaml.

What is the Datadog Flare and when should I use it?

The Datadog Flare is a built-in troubleshooting tool that packages your Agent configurations, logs, and system state into an archive, scrubbing sensitive data. You should use 'sudo datadog-agent flare' when standard troubleshooting fails and you need to provide diagnostics directly to Datadog Support.

Troubleshooting 'Datadog Not Working': Fixing Agent Offline, Missing Metrics, and APM Connection Errors

Datadog not working? Learn how to diagnose and fix offline Datadog Agents, missing APM traces, connection timeouts, and API key errors with this guide.

Last updated: February 24, 2026

Last verified: February 24, 2026

1,733 words

Key Takeaways

Agent offline issues are most frequently caused by network or firewall blocks preventing outbound traffic on port 443 to Datadog endpoints.
Invalid, expired, or incorrectly configured API keys in 'datadog.yaml' will result in 403 Forbidden errors and dropped metrics.
Missing APM traces usually stem from the Trace Agent not running or applications failing to reach localhost:8126.
Resource exhaustion (CPU/Memory limits) on the host or container can cause the Datadog Agent to crash or be terminated by the OOM killer.
Quick fix: Run 'sudo datadog-agent status' to identify the failing component, and check '/var/log/datadog/agent.log' for explicit error messages.

Common 'Datadog Not Working' Fix Approaches Compared
Method	When to Use	Time	Risk
Restart Datadog Agent	When the agent process is hung, consuming too much memory, or after a configuration change.	1-2 mins	Low
Check Agent Logs	When the host appears offline in the UI or specific integrations are missing data.	5-10 mins	Low
Verify Network Connectivity	When logs show 'timeout' or 'connection refused' errors reaching Datadog APIs.	10-15 mins	Low
Run Datadog Flare	When complex issues persist and you need to send comprehensive diagnostics to Datadog Support.	5 mins	Low
Reinstall/Upgrade Agent	When dealing with an outdated, deprecated agent version or severe binary corruption.	15-30 mins	Medium (brief metric gap)

Understanding the Error: Why is Datadog Not Working?

When engineers report that 'Datadog is not working,' the symptom can manifest in several different ways. The Datadog ecosystem is vast, relying on a localized Agent (a Go-based daemon) that collects, aggregates, and forwards metrics, logs, and traces to Datadog's cloud infrastructure. A failure in any part of this pipeline can lead to missing observability data.

The most common manifestations of this problem include:

The Host is Offline: The infrastructure list in the Datadog UI shows the host as '?' or 'Offline'.
Missing Infrastructure Metrics: CPU, memory, and disk usage stop updating.
Missing Integration Data: A specific service (like PostgreSQL, Redis, or NGINX) stops reporting metrics, while base host metrics continue.
Missing APM Traces: Distributed tracing data is absent, often accompanied by connection errors in the application logs.
Log Collection Failures: Application or system logs are not appearing in the Log Explorer.

Step 1: Diagnose the Datadog Agent

The absolute first step in troubleshooting any Datadog issue is to query the Agent's status. The Agent comes with a built-in CLI that performs a comprehensive health check.

Run the following command on the affected Linux host:

sudo datadog-agent status

This command outputs a wealth of information. You need to look for specific sections:

1. The Forwarder Section

Look at the 'Forwarder' section to ensure payloads are successfully reaching Datadog. If you see an error like:

Error: error connecting to the agent: dial tcp 127.0.0.1:5001: connect: connection refused

This indicates the core Agent process is not running. Check if the service is active using systemctl status datadog-agent.

If you see errors related to the API, such as:

[ERROR] Error starting the agent: no API key configured or WARN | Forwarder | Error while sending payload to https://api.datadoghq.com/api/v1/series: 403 Forbidden - Invalid API Key

This strictly points to an authentication issue. Your datadog.yaml file either lacks an API key, has a typo, or the key has been revoked in the Datadog UI.

2. The Collector Section

If specific integrations are failing, check the 'Collector' section. It lists every integration (check) configured. If an integration is misconfigured, you will see a traceback or an error like:

[ERROR] postgres: Unresolved missing requirement or Check 'nginx' failed: dial tcp 127.0.0.1:80: connect: connection refused

This indicates the Agent is running perfectly, but it cannot access the service it is trying to monitor. This is usually due to incorrect credentials in the conf.d/ YAML file, or the target service being down.

Step 2: Fix Network and Connectivity Issues

The Datadog Agent is a push-based system. It needs outbound network access to Datadog's infrastructure. If the Agent is running but no data appears, it is almost always a network issue.

Look at the main Agent log file: /var/log/datadog/agent.log. If you see:

WARN | Forwarder | Error while sending payload: dial tcp: i/o timeout

This means the Agent's outbound traffic is being blocked by a firewall, Security Group (AWS), or network proxy.

How to fix:

Verify your Datadog Site: Datadog has multiple regions (US1, US3, US5, EU1, US1-FED). If your account is in EU1, but your Agent is sending data to US1 (the default), it will fail. Ensure site: datadoghq.eu (or your appropriate site) is set in /etc/datadog-agent/datadog.yaml.
Check Firewall Rules: The Agent requires outbound access on TCP port 443 to Datadog's domains. You can test this manually using cURL:

curl -v https://api.datadoghq.com/api/v1/validate?api_key=YOUR_API_KEY

If this command hangs, you have a firewall issue. If you use a proxy, ensure the Agent is configured to use it by setting the proxy block in datadog.yaml.

Step 3: Fix APM (Trace Agent) Failures

If host metrics are working but APM traces are missing, the issue lies with the Trace Agent. The Datadog APM architecture requires your application (instrumented with a Datadog library, e.g., dd-trace-js or dd-trace-py) to send traces to the Agent over localhost on port 8126.

Common APM Errors: Application logs showing: Error: connect ECONNREFUSED 127.0.0.1:8126.

How to fix:

Enable APM in the Agent: By default, APM is often enabled, but it can be explicitly disabled. Ensure your datadog.yaml has: apm_config: enabled: true
Verify the Port: Ensure the Trace Agent is listening. Run sudo netstat -tlnp | grep 8126 or sudo ss -tlnp | grep 8126. You should see the trace-agent process bound to 0.0.0.0:8126 or 127.0.0.1:8126.
Containerized Environments: If your application is in a Docker container and the Agent is on the host (or in another container), 127.0.0.1 will resolve to the application's container, NOT the host. You must configure your tracer to send data to the host IP. In Docker, you can pass the host IP via an environment variable or use Datadog's Unix Domain Socket (UDS) feature for APM, which is highly recommended for Kubernetes and Docker.

Step 4: Fix Log Collection Issues

Log collection is disabled by default in the Datadog Agent. If logs are not working:

Enable Logs: Ensure logs_enabled: true is present in datadog.yaml.
Check Permissions: The Agent runs as the dd-agent user. If you configure it to read /var/log/nginx/access.log, the dd-agent user MUST have read permissions for that file and execute permissions for the directory. If it doesn't, you will see a 'Permission denied' error in agent.log. Fix: Add the dd-agent user to the appropriate group (e.g., adm or nginx), or adjust file ACLs.

Step 5: Advanced Diagnostics and The Flare

When you have verified the network, checked the configuration, and restarted the Agent (sudo systemctl restart datadog-agent), but Datadog is still not working, it's time to investigate resource constraints or contact support.

Check OOM Kills: If the Agent process suddenly disappears, it might have been killed by the Linux Out-Of-Memory (OOM) killer. Run dmesg -T | grep -i oom or grep -i oom /var/log/syslog. If you see the agent process listed, you need to increase the memory limit on the machine/container, or adjust the Agent's configuration to use less memory (e.g., lower the log collection rate or disable unused integrations).
Generate a Flare: Datadog Support will require a 'Flare'. This is a comprehensive archive of all your Agent logs, configuration files (with secrets scrubbed), and diagnostic commands. Generate it by running:

sudo datadog-agent flare

The command will ask for a ticket number (you can leave it blank) and an email address. It will securely upload the diagnostics directly to Datadog's support team, allowing them to see exactly why the Agent is failing.

Frequently Asked Questions

bash

# --- Datadog Agent Diagnostic Script ---

# 1. Check the overall status of the Datadog Agent
echo "=== Checking Agent Status ==="
sudo datadog-agent status

# 2. Check if the Agent service is running at the OS level
echo "=== Checking Systemd Service ==="
systemctl status datadog-agent --no-pager

# 3. Test outbound network connectivity to Datadog API (Replace with your site if not US1)
echo "=== Testing API Connectivity ==="
curl -v https://api.datadoghq.com/api/v1/validate?api_key=$DD_API_KEY

# 4. Check the agent log for immediate errors, filtering for warnings or errors
echo "=== Checking Agent Logs ==="
grep -i "error\|warn" /var/log/datadog/agent.log | tail -n 20

# 5. Validate integration configurations for syntax errors
echo "=== Validating YAML Configurations ==="
sudo datadog-agent configcheck

# 6. Restart the agent to apply any changes
# sudo systemctl restart datadog-agent

# 7. Generate a support flare (only if required)
# sudo datadog-agent flare

Error Medic Editorial

Error Medic Editorial is a dedicated team of senior Site Reliability Engineers and DevOps practitioners. We specialize in demystifying complex cloud infrastructure issues and providing actionable, real-world solutions for modern engineering teams.

Sources

Explore More DevOps Config Guides

Ansible Failed: Fix Connection Refused, Permission Denied & Timeout Errors

Fix Ansible failures including connection refused, permission denied, and timeout errors. Step-by-step diagnosis with real commands and verified solutions.

ArgoCD 'connection refused' Error: Complete Troubleshooting Guide (2024)

Fix ArgoCD 'connection refused', CrashLoopBackOff, ImagePullBackOff, and timeout errors with step-by-step diagnostic commands and proven solutions.

ArgoCD Connection Refused: Fix CrashLoopBackOff, ImagePullBackOff, Permission Denied & Timeout Errors

Fix ArgoCD connection refused errors: diagnose CrashLoopBackOff, ImagePullBackOff, permission denied, and timeout with step-by-step kubectl commands and config