Troubleshooting 'Datadog Not Working': Fixing Agent Offline, Missing Metrics, and APM Connection Errors
Datadog not working? Learn how to diagnose and fix offline Datadog Agents, missing APM traces, connection timeouts, and API key errors with this guide.
- Agent offline issues are most frequently caused by network or firewall blocks preventing outbound traffic on port 443 to Datadog endpoints.
- Invalid, expired, or incorrectly configured API keys in 'datadog.yaml' will result in 403 Forbidden errors and dropped metrics.
- Missing APM traces usually stem from the Trace Agent not running or applications failing to reach localhost:8126.
- Resource exhaustion (CPU/Memory limits) on the host or container can cause the Datadog Agent to crash or be terminated by the OOM killer.
- Quick fix: Run 'sudo datadog-agent status' to identify the failing component, and check '/var/log/datadog/agent.log' for explicit error messages.
| Method | When to Use | Time | Risk |
|---|---|---|---|
| Restart Datadog Agent | When the agent process is hung, consuming too much memory, or after a configuration change. | 1-2 mins | Low |
| Check Agent Logs | When the host appears offline in the UI or specific integrations are missing data. | 5-10 mins | Low |
| Verify Network Connectivity | When logs show 'timeout' or 'connection refused' errors reaching Datadog APIs. | 10-15 mins | Low |
| Run Datadog Flare | When complex issues persist and you need to send comprehensive diagnostics to Datadog Support. | 5 mins | Low |
| Reinstall/Upgrade Agent | When dealing with an outdated, deprecated agent version or severe binary corruption. | 15-30 mins | Medium (brief metric gap) |
Understanding the Error: Why is Datadog Not Working?
When engineers report that 'Datadog is not working,' the symptom can manifest in several different ways. The Datadog ecosystem is vast, relying on a localized Agent (a Go-based daemon) that collects, aggregates, and forwards metrics, logs, and traces to Datadog's cloud infrastructure. A failure in any part of this pipeline can lead to missing observability data.
The most common manifestations of this problem include:
- The Host is Offline: The infrastructure list in the Datadog UI shows the host as '?' or 'Offline'.
- Missing Infrastructure Metrics: CPU, memory, and disk usage stop updating.
- Missing Integration Data: A specific service (like PostgreSQL, Redis, or NGINX) stops reporting metrics, while base host metrics continue.
- Missing APM Traces: Distributed tracing data is absent, often accompanied by connection errors in the application logs.
- Log Collection Failures: Application or system logs are not appearing in the Log Explorer.
Step 1: Diagnose the Datadog Agent
The absolute first step in troubleshooting any Datadog issue is to query the Agent's status. The Agent comes with a built-in CLI that performs a comprehensive health check.
Run the following command on the affected Linux host:
sudo datadog-agent status
This command outputs a wealth of information. You need to look for specific sections:
1. The Forwarder Section
Look at the 'Forwarder' section to ensure payloads are successfully reaching Datadog. If you see an error like:
Error: error connecting to the agent: dial tcp 127.0.0.1:5001: connect: connection refused
This indicates the core Agent process is not running. Check if the service is active using systemctl status datadog-agent.
If you see errors related to the API, such as:
[ERROR] Error starting the agent: no API key configured
or
WARN | Forwarder | Error while sending payload to https://api.datadoghq.com/api/v1/series: 403 Forbidden - Invalid API Key
This strictly points to an authentication issue. Your datadog.yaml file either lacks an API key, has a typo, or the key has been revoked in the Datadog UI.
2. The Collector Section
If specific integrations are failing, check the 'Collector' section. It lists every integration (check) configured. If an integration is misconfigured, you will see a traceback or an error like:
[ERROR] postgres: Unresolved missing requirement
or
Check 'nginx' failed: dial tcp 127.0.0.1:80: connect: connection refused
This indicates the Agent is running perfectly, but it cannot access the service it is trying to monitor. This is usually due to incorrect credentials in the conf.d/ YAML file, or the target service being down.
Step 2: Fix Network and Connectivity Issues
The Datadog Agent is a push-based system. It needs outbound network access to Datadog's infrastructure. If the Agent is running but no data appears, it is almost always a network issue.
Look at the main Agent log file: /var/log/datadog/agent.log. If you see:
WARN | Forwarder | Error while sending payload: dial tcp: i/o timeout
This means the Agent's outbound traffic is being blocked by a firewall, Security Group (AWS), or network proxy.
How to fix:
- Verify your Datadog Site: Datadog has multiple regions (US1, US3, US5, EU1, US1-FED). If your account is in EU1, but your Agent is sending data to US1 (the default), it will fail. Ensure
site: datadoghq.eu(or your appropriate site) is set in/etc/datadog-agent/datadog.yaml. - Check Firewall Rules: The Agent requires outbound access on TCP port 443 to Datadog's domains. You can test this manually using cURL:
curl -v https://api.datadoghq.com/api/v1/validate?api_key=YOUR_API_KEY
If this command hangs, you have a firewall issue. If you use a proxy, ensure the Agent is configured to use it by setting the proxy block in datadog.yaml.
Step 3: Fix APM (Trace Agent) Failures
If host metrics are working but APM traces are missing, the issue lies with the Trace Agent. The Datadog APM architecture requires your application (instrumented with a Datadog library, e.g., dd-trace-js or dd-trace-py) to send traces to the Agent over localhost on port 8126.
Common APM Errors:
Application logs showing: Error: connect ECONNREFUSED 127.0.0.1:8126.
How to fix:
- Enable APM in the Agent: By default, APM is often enabled, but it can be explicitly disabled. Ensure your
datadog.yamlhas:apm_config:enabled: true - Verify the Port: Ensure the Trace Agent is listening. Run
sudo netstat -tlnp | grep 8126orsudo ss -tlnp | grep 8126. You should see thetrace-agentprocess bound to0.0.0.0:8126or127.0.0.1:8126. - Containerized Environments: If your application is in a Docker container and the Agent is on the host (or in another container),
127.0.0.1will resolve to the application's container, NOT the host. You must configure your tracer to send data to the host IP. In Docker, you can pass the host IP via an environment variable or use Datadog's Unix Domain Socket (UDS) feature for APM, which is highly recommended for Kubernetes and Docker.
Step 4: Fix Log Collection Issues
Log collection is disabled by default in the Datadog Agent. If logs are not working:
- Enable Logs: Ensure
logs_enabled: trueis present indatadog.yaml. - Check Permissions: The Agent runs as the
dd-agentuser. If you configure it to read/var/log/nginx/access.log, thedd-agentuser MUST have read permissions for that file and execute permissions for the directory. If it doesn't, you will see a 'Permission denied' error inagent.log. Fix: Add thedd-agentuser to the appropriate group (e.g.,admornginx), or adjust file ACLs.
Step 5: Advanced Diagnostics and The Flare
When you have verified the network, checked the configuration, and restarted the Agent (sudo systemctl restart datadog-agent), but Datadog is still not working, it's time to investigate resource constraints or contact support.
Check OOM Kills: If the Agent process suddenly disappears, it might have been killed by the Linux Out-Of-Memory (OOM) killer. Run
dmesg -T | grep -i oomorgrep -i oom /var/log/syslog. If you see theagentprocess listed, you need to increase the memory limit on the machine/container, or adjust the Agent's configuration to use less memory (e.g., lower the log collection rate or disable unused integrations).Generate a Flare: Datadog Support will require a 'Flare'. This is a comprehensive archive of all your Agent logs, configuration files (with secrets scrubbed), and diagnostic commands. Generate it by running:
sudo datadog-agent flare
The command will ask for a ticket number (you can leave it blank) and an email address. It will securely upload the diagnostics directly to Datadog's support team, allowing them to see exactly why the Agent is failing.
Frequently Asked Questions
# --- Datadog Agent Diagnostic Script ---
# 1. Check the overall status of the Datadog Agent
echo "=== Checking Agent Status ==="
sudo datadog-agent status
# 2. Check if the Agent service is running at the OS level
echo "=== Checking Systemd Service ==="
systemctl status datadog-agent --no-pager
# 3. Test outbound network connectivity to Datadog API (Replace with your site if not US1)
echo "=== Testing API Connectivity ==="
curl -v https://api.datadoghq.com/api/v1/validate?api_key=$DD_API_KEY
# 4. Check the agent log for immediate errors, filtering for warnings or errors
echo "=== Checking Agent Logs ==="
grep -i "error\|warn" /var/log/datadog/agent.log | tail -n 20
# 5. Validate integration configurations for syntax errors
echo "=== Validating YAML Configurations ==="
sudo datadog-agent configcheck
# 6. Restart the agent to apply any changes
# sudo systemctl restart datadog-agent
# 7. Generate a support flare (only if required)
# sudo datadog-agent flareError Medic Editorial
Error Medic Editorial is a dedicated team of senior Site Reliability Engineers and DevOps practitioners. We specialize in demystifying complex cloud infrastructure issues and providing actionable, real-world solutions for modern engineering teams.