Why am I getting a timeout even though my cluster status is green?

A 'green' status only means all primary and replica shards are allocated. It does not reflect query performance. You can have a perfectly healthy cluster that throws timeouts because it is receiving unoptimized, heavy queries (like deep aggregations or wildcards) that exceed the client's configured wait time.

What is the difference between client timeout and Elasticsearch query timeout?

A client timeout is enforced by your application (e.g., waiting 10 seconds before throwing an error). The Elasticsearch server might continue running the query in the background, consuming resources. A server-side query timeout (passed in the request body) tells Elasticsearch to actively stop processing the query and return partial results if the time limit is reached.

How do I find which specific queries are causing the timeouts?

Enable 'Search Slow Logs' on your heavily queried indices. Configure the threshold (e.g., log queries taking longer than 2 seconds). Inspecting these logs will reveal the exact JSON bodies of the slow queries, allowing you to analyze and optimize them.

Should I just increase the client timeout indefinitely?

No. Increasing the client timeout is only a band-aid solution. If queries are taking too long, they are consuming precious CPU and memory resources on the cluster. Eventually, this will lead to thread pool exhaustion and cluster-wide failure. Always prioritize optimizing the queries or scaling your hardware.

How can I stop a query that is already running and slowing down the cluster?

Use the Task Management API (`GET /_tasks?detailed=true&actions=*search`) to find the running search tasks. Once you identify the `task_id` of the problematic query, you can cancel it using the Cancel Task API (`POST /_tasks/ /_cancel`).

Resolving Elasticsearch API Timeout Errors: Comprehensive Troubleshooting Guide

Diagnose and fix Elasticsearch API timeouts. Learn how to optimize slow queries, configure client timeouts, analyze slow logs, and manage cluster resources.

Last updated: February 24, 2026

Last verified: February 24, 2026

1,512 words

Key Takeaways

Client-side connection or read timeouts often trigger before the Elasticsearch server completes processing the request.
Unoptimized queries, excessive aggregations, or querying too many shards simultaneously are primary culprits for slow responses.
Cluster resource exhaustion, such as high CPU usage, GC pauses, or insufficient JVM heap, can lead to cascading timeout failures.
Quick Fix: Temporarily increase the client-side timeout parameter, then use the Task Management API and Slow Logs to identify and optimize the offending queries.

Fix Approaches Compared
Method	When to Use	Time	Risk
Increase Client Timeout	Immediate mitigation for occasional, expectedly slow queries.	Minutes	Low (May mask underlying performance issues)
Query Optimization	Consistent timeouts on specific searches; high latency.	Hours	Low
Use Async Search API	For heavy, long-running analytical queries and aggregations.	Hours	Low (Requires application logic changes)
Scale Resources/Tune JVM	Cluster-wide performance degradation, frequent long GC pauses.	Days	Medium (Involves cost and potential downtime)

Understanding Elasticsearch API Timeouts

When interacting with an Elasticsearch cluster via its REST API or through client libraries (like the Python elasticsearch-py client, Java High Level REST Client, or Node.js client), you may frequently encounter timeout errors. These errors typically manifest as ConnectionTimeout, ReadTimeoutError, or simple HTTP 504 Gateway Timeout responses if there is a proxy (like Nginx or an AWS ALB) sitting in front of your cluster.

A timeout fundamentally means that the client application waited for a response for a predefined period, and the Elasticsearch server failed to deliver the complete response within that window. It is critical to differentiate between a client-side timeout (where the client gives up, but Elasticsearch might still be processing the query in the background) and a server-side timeout (where Elasticsearch itself terminates the query execution).

Common Error Signatures

Depending on your client and environment, the error might look like:

Python: elasticsearch.exceptions.ConnectionTimeout: ConnectionTimeout caused by - ReadTimeoutError(HTTPConnectionPool(host='localhost', port=9200): Read timed out. (read timeout=10))
Node.js: TimeoutError: Request timed out
cURL: curl: (28) Operation timed out after 30001 milliseconds with 0 bytes received

Step 1: Diagnose the Root Cause

Before indiscriminately increasing timeout values, you must determine why the requests are taking so long. Is it an isolated heavy query, or is the entire cluster struggling?

1. Check Cluster Health and Load

The first step is always to check the overall health of the cluster. A "red" or "yellow" cluster state, or a cluster with a massive number of pending tasks, is highly susceptible to timeouts.

Use the _cluster/health and _nodes/stats APIs to check for high CPU utilization, high system load, or JVM heap pressure. If your JVM heap usage consistently hovers above 85-90%, the nodes will spend excessive time in Garbage Collection (GC) pauses. During a "Stop-the-World" GC pause, the node cannot process any requests, leading directly to timeouts.

2. Analyze Pending Tasks

If the cluster is overwhelmed, tasks will queue up. Check the pending tasks API: GET /_cluster/pending_tasks If you see a long list of tasks (e.g., shard allocations, mapping updates) taking a long time, the master node might be the bottleneck.

3. Enable and Inspect Slow Logs

Elasticsearch provides Search Slow Logs and Index Slow Logs. These are invaluable for identifying the specific queries that are exceeding acceptable execution times. You can dynamically enable slow logs on an index without restarting the cluster.

Once enabled, monitor the logs (usually located in /var/log/elasticsearch/ or your container's stdout) to find queries that consistently exceed your timeout thresholds. Look for queries with massive aggregations, deep pagination, or wildcard searches leading to high memory consumption.

4. Use the Task Management API

For queries that are currently running and potentially causing timeouts, you can view them in real-time using the Task Management API. This allows you to find long-running tasks and even cancel them if they are threatening cluster stability.

Step 2: Implement Solutions

Once you have identified the bottleneck, you can apply the appropriate fix.

Solution 1: Adjusting Timeout Parameters

If the queries are legitimate and simply take a long time (and you cannot optimize them further), you may need to increase the timeout settings.

Client-Side: Increase the timeout in your Elasticsearch client configuration. For example, in Python:

from elasticsearch import Elasticsearch
# Increase timeout to 60 seconds
es = Elasticsearch(["http://localhost:9200"], timeout=60)

Server-Side (Query Timeout): You can also pass a timeout parameter directly in your search request body to tell Elasticsearch to terminate the query if it takes too long, preventing it from consuming resources indefinitely.

GET /my-index/_search
{
  "timeout": "10s",
  "query": { ... }
}

Solution 2: Query Optimization

Often, the query itself is the problem.

Avoid Deep Pagination: Using from and size for deep pagination is highly inefficient. Use the search_after parameter or the Scroll API for iterating over large datasets.
Use Filters Instead of Queries: If you do not need relevance scoring (BM25), wrap your queries in a filter context (e.g., inside a bool query). Filters are faster and their results can be cached.
Optimize Aggregations: Avoid high-cardinality terms aggregations on large datasets. Consider using the composite aggregation to paginate through buckets.
Routing: If your data is routed based on a specific key (like a user ID), ensure you include the routing parameter in your search request. This allows Elasticsearch to query only the specific shard holding the data, rather than broadcasting the request to all shards.

Solution 3: The Async Search API

If you have genuinely heavy analytical queries that are expected to take minutes to complete (and thus routinely trigger timeouts on standard HTTP connections), you should migrate to the Async Search API (_async_search).

This API allows you to submit a query and immediately receive an ID. You can then poll this ID periodically to check the status of the query and retrieve the results once they are ready. This completely bypasses the limitations of standard HTTP timeouts.

Solution 4: Indexing Strategy and Shard Sizing

An excessive number of small shards (the "oversharding" problem) forces Elasticsearch to execute a query across many shards and merge the results, which creates significant overhead and can lead to timeouts. Conversely, shards that are too large can cause slow recovery and heavy queries. Aim for a shard size between 10GB and 50GB. Use the Rollover API and Index Lifecycle Management (ILM) to manage time-series data effectively.

Step 3: Preventative Monitoring

To prevent future timeouts, establish robust monitoring. Track metrics such as:

99th percentile search latency.
JVM Heap Usage and GC frequency/duration.
Thread pool rejections (especially the search and write thread pools).

When thread pool rejections start spiking, it is a clear leading indicator that the cluster is saturated and timeouts are imminent.

Frequently Asked Questions

bash

# 1. Check cluster health and active nodes
curl -X GET "localhost:9200/_cluster/health?v&pretty"
curl -X GET "localhost:9200/_nodes/stats/jvm,process,os?pretty"

# 2. Check for pending tasks queuing up
curl -X GET "localhost:9200/_cluster/pending_tasks?pretty"

# 3. Dynamically enable Search Slow Logs for an index
# This logs queries taking longer than 5 seconds to the slow log file
curl -X PUT "localhost:9200/my-index/_settings?pretty" -H 'Content-Type: application/json' -d'
{
  "index.search.slowlog.threshold.query.warn": "10s",
  "index.search.slowlog.threshold.query.info": "5s",
  "index.search.slowlog.level": "info"
}
'

# 4. View currently running search tasks
curl -X GET "localhost:9200/_tasks?detailed=true&actions=*search&pretty"

# 5. Cancel a specific long-running task (replace node_id:task_number)
# curl -X POST "localhost:9200/_tasks/node_id:task_number/_cancel"

Error Medic Editorial

The Error Medic Editorial team consists of senior Site Reliability Engineers and DevOps practitioners dedicated to providing actionable, real-world solutions for complex distributed systems.

Sources

Explore More API Errors Guides

AWS API Rate Limit Exceeded (ThrottlingException): Complete Troubleshooting Guide

Fix AWS ThrottlingException and API timeouts with exponential backoff, Service Quotas increases, and optimized API polling strategies for your workloads.

Azure API Timeout: 'The operation timed out' — Root Causes and Fixes

Fix Azure API timeouts caused by misconfigured APIM policies, backend latency, or connection limits. Step-by-step diagnostics and policy fixes included.

Azure API Timeout: Fix 504 Gateway Timeout and RequestTimeout Errors in Azure API Management, Functions, and ARM

Diagnose and fix Azure API timeout errors (504, 408, RequestTimeout) across API Management, Functions, and ARM. Includes policy fixes, host.json config, and CLI