Why did my Elasticsearch cluster health status change from red to yellow?

This is a positive sign! It means the cluster successfully recovered or allocated all missing primary shards (resolving the red status). However, some replica shards are still unassigned, meaning the cluster is functional but lacks full redundancy. It will likely turn green once the replicas finish synchronizing.

Can I search or index data when the elasticsearch cluster is red?

Yes, but with caveats. You can index and search data on the shards that are currently active. However, any requests hitting the unassigned primary shards will fail. If you run a search across an index that is red, it will typically throw an error unless you append `?allow_partial_search_results=true` to your query.

What does 'index status red in elasticsearch' mean?

A red index means that at least one primary shard belonging to that specific index cannot be allocated to any node in the cluster. Because a primary shard is missing, a portion of the data for that index is inaccessible.

How do I forcefully delete a red index?

If the data is completely unimportant or hopelessly corrupted, you can delete the red index using the command: `curl -X DELETE "localhost:9200/my-red-index"`. Once deleted, the unassigned shards are removed from the cluster state, and if no other indices are red, the cluster status will return to green.

Will restarting the whole cluster fix a red status?

Usually no, and it might make things worse. A rolling restart or full cluster restart causes shard reallocation and can mask the root cause. It is better to use the Allocation Explain API to find the specific node or disk issue preventing the primary shard from being allocated before attempting restarts.

Fixing 'Elasticsearch Cluster Health Red': Comprehensive Troubleshooting Guide

Resolve an Elasticsearch cluster health status red error. Step-by-step diagnostic commands, unassigned shard recovery, and data restoration strategies.

Last updated: February 23, 2026

Last verified: February 23, 2026

1,603 words

Key Takeaways

A 'red' cluster status means at least one primary shard (and its replicas) is missing or unassigned, leading to data unavailability.
The first step is always diagnosing *why* shards are unassigned using the Cluster Allocation Explain API.
Common root causes include node failures, disk space exhaustion (watermark breaches), cluster network partitions, or corrupted indices.
Fixes range from simple cluster rerouting and node restarts to restoring from a snapshot if data loss has occurred.

Fix Approaches Compared
Method	When to Use	Time	Risk
Cluster Allocation Explain API	Initial diagnosis for any unassigned shard.	< 5 mins	Low
Clear Disk Space / Watermarks	Nodes hit low disk watermarks blocking allocation.	10-30 mins	Low
Manual Shard Allocation (Reroute)	Cluster is stable but failed to auto-recover shards.	10 mins	Medium
Restart Failed Nodes	Hardware/JVM crash caused nodes to drop from the cluster.	15-60 mins	Medium
Restore from Snapshot	Primary shards are permanently lost or corrupted.	Hours	High (Data Loss)

Understanding the Error

When working with Elasticsearch, monitoring the cluster health status is a daily operation. The health status can be one of three colors: Green, Yellow, or Red.

Green: All primary and replica shards are active and assigned to nodes.
Yellow: All primary shards are active, but one or more replica shards are unassigned. The cluster is fully functional, but high availability is degraded.
Red: One or more primary shards are unassigned.

When your Elasticsearch cluster status is red, it is a critical incident. It means some portion of your data is currently unavailable for indexing or searching. If you query an index that has a red status, the query will fail or return partial results (if allow_partial_search_results is true, though the default is usually strict failure for critical missing data).

The Anatomy of a Red Cluster

Elasticsearch divides indices into shards to distribute data across nodes. Every document belongs to a single primary shard. If the node holding that primary shard crashes, and no replica exists (or the replicas also crashed), that primary shard becomes unassigned. Until a primary shard is promoted from a replica or recovered from disk, the cluster remains red.

You might see this status in your monitoring tools (like Kibana or Datadog) or when hitting the _cluster/health API:

{
  "cluster_name" : "prod-logs-cluster",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 5,
  "number_of_data_nodes" : 3,
  "active_primary_shards" : 142,
  "active_shards" : 284,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 4,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 98.6
}

In the output above, "status" : "red" and "unassigned_shards" : 4 are the immediate indicators of trouble.

Step 1: Diagnose the Unassigned Shards

Do not guess why the cluster is red. Elasticsearch will tell you exactly why it refuses to allocate a shard.

First, identify which indices are red:

curl -X GET "localhost:9200/_cat/indices?v&health=red"

Next, list the specific unassigned shards and their prior node locations:

curl -X GET "localhost:9200/_cat/shards?v&h=index,shard,prirep,state,node,unassigned.reason" | grep UNASSIGNED

This will output something like:

app-logs-2023.10.25 2 p UNASSIGNED NODE_LEFT

The unassigned.reason provides a crucial hint. Common reasons include:

NODE_LEFT: The node holding the shard disconnected.
CLUSTER_RECOVERED: Full cluster restart, waiting for nodes to join.
ALLOCATION_FAILED: Shard allocation failed (often due to disk space or corrupted data).

To get the definitive reason for the allocation failure, use the Cluster Allocation Explain API:

curl -X GET "localhost:9200/_cluster/allocation/explain?pretty"

(If you know the specific index and shard, you can pass them in the request body for targeted output).

The output will contain an allocate_explanation and a deciders array. Look for deciders that return "decision": "NO".

Step 2: Address Common Root Causes

Based on the diagnosis, proceed with the appropriate fix.

Scenario A: Disk Space Exhaustion (Disk Watermarks)

Elasticsearch protects nodes from running out of disk space by enforcing disk watermarks. If a node breaches the cluster.routing.allocation.disk.watermark.low (default 85%), it won't allocate new shards to that node. If it breaches the high watermark (90%), it will attempt to relocate shards away. If it hits the flood_stage (95%), indices are set to read-only.

If the Allocation Explain API shows a decider blocking allocation due to disk space:

Free up space: Delete old indices, clear application logs on the host, or expand the underlying EBS volume/disk.

Adjust watermarks temporarily (if absolutely necessary):

PUT _cluster/settings
{
  "transient": {
    "cluster.routing.allocation.disk.watermark.low": "90%",
    "cluster.routing.allocation.disk.watermark.high": "95%"
  }
}

Once space is freed, you may need to manually trigger allocation if the cluster gave up:
```
POST /_cluster/reroute?retry_failed=true
```

Scenario B: Node Disconnection (`NODE_LEFT`)

If the primary shard was on a node that crashed, was terminated, or lost network connectivity:

Check Node Status: curl -X GET "localhost:9200/_cat/nodes?v". Is the expected number of nodes present?
Investigate the missing node: Check systemic logs (e.g., /var/log/syslog, dmesg, or the hypervisor console) and the Elasticsearch logs (/var/log/elasticsearch/*.log) on the disconnected node.
- Did the JVM OOM (Out of Memory)?
- Is there a network partition?
Restart the Node: Often, simply starting the Elasticsearch service on the failed node will allow it to rejoin the cluster. Once it rejoins, Elasticsearch will recognize the local data and promote the shards, turning the cluster from red to yellow, and eventually green as replicas sync.

Scenario C: Shard Allocation Delay

By default, Elasticsearch waits 1 minute (index.unassigned.node_left.delayed_timeout) before reallocating missing primary shards to replicas (if available) to avoid unnecessary network I/O during brief node restarts.

If a node is gone permanently, you can speed up the recovery (if replicas exist) by changing this setting:

PUT _all/_settings
{
  "settings": {
    "index.unassigned.node_left.delayed_timeout": "0"
  }
}

Note: If no replicas exist, changing the delay timeout won't help; you must recover the original node or restore from a snapshot.

Scenario D: Corrupted Shards or Permanent Data Loss

If the node containing the primary and only copy of a shard is permanently destroyed (e.g., disk failure), you have lost data.

The Allocation Explain API will show that no valid copies of the shard exist in the cluster.

Your options are:

Restore from Snapshot: This is the safest and correct approach if you have backups configured (e.g., AWS S3 repository).

POST /_snapshot/my_repository/snapshot_1/_restore
{
  "indices": "app-logs-2023.10.25",
  "ignore_unavailable": true,
  "include_global_state": false
}

Allocate Empty Primary (Accept Data Loss): If the index is highly transient (e.g., metrics from 5 minutes ago) and you don't have snapshots, you can force the cluster to allocate an empty primary shard. This will permanently delete any data that was in that shard. The cluster will turn green, but the data is gone.
```
POST /_cluster/reroute
{
  "commands": [
    {
      "allocate_empty_primary": {
        "index": "app-logs-2023.10.25",
        "shard": 2,
        "node": "node-1",
        "accept_data_loss": true
      }
    }
  ]
}
```

Step 3: Prevention

To prevent the Elasticsearch health status from changing to red in the future:

Always use replicas: Never run production indices with number_of_replicas: 0. A minimum of 1 ensures high availability if a single node fails.
Monitor disk space: Alert heavily on 80% disk utilization to proactively scale storage before hitting watermarks.
Ensure node stability: Tune JVM heap size (typically 50% of total RAM, maxing out at ~31GB) and monitor garbage collection times to prevent OOM crashes.

Frequently Asked Questions

bash

# 1. Check cluster health
curl -X GET "localhost:9200/_cluster/health?pretty"

# 2. Identify which indices are red
curl -X GET "localhost:9200/_cat/indices?v&health=red"

# 3. List all unassigned shards and their prior nodes
curl -X GET "localhost:9200/_cat/shards?v&h=index,shard,prirep,state,node,unassigned.reason" | grep UNASSIGNED

# 4. The most important command: Ask Elasticsearch WHY the shard is unassigned
curl -X GET "localhost:9200/_cluster/allocation/explain?pretty"

# 5. Retry failed allocations (useful if disk space was just cleared)
curl -X POST "localhost:9200/_cluster/reroute?retry_failed=true"

# 6. Check node disk space (to check for watermark breaches)
curl -X GET "localhost:9200/_cat/allocation?v"

Error Medic Editorial

Error Medic Editorial is composed of senior Site Reliability Engineers and database administrators specializing in distributed systems, search infrastructure, and large-scale incident response.

Sources

Explore More Database Guides

Cassandra 'Connection Refused' on Port 9042: Complete Troubleshooting Guide

Fix Cassandra connection refused errors on port 9042. Diagnose OOM kills, misconfigured listen_address, firewall blocks, slow queries, and data corruption with

DynamoDB Slow Query, Timeout & Table Lock: Complete Troubleshooting Guide

Fix DynamoDB slow queries, ProvisionedThroughputExceededException, and timeout errors. Step-by-step diagnosis with AWS CLI commands and proven solutions.

ERROR: deadlock detected - Resolving PostgreSQL Deadlocks & Connection Exhaustion

Fix PostgreSQL deadlocks (ERROR: 40P01) and connection pool exhaustion. Learn to trace lock contention, enforce consistent lock ordering, and optimize transacti