How to Fix Elasticsearch API Timeout Errors (Request Timeout after 30000ms)
Resolve Elasticsearch API timeouts. Diagnose slow queries, GC pauses, and thread pool exhaustion. Learn to optimize queries and adjust client timeout settings.
- Client-side timeouts (e.g., RequestTimeoutError) occur when the client gives up before Elasticsearch finishes processing.
- Server-side timeouts often result from heavy queries, deep pagination, unoptimized mappings, or JVM Garbage Collection (GC) pauses.
- Increasing timeouts is a temporary band-aid; long-term fixes require query optimization, using Async Search, or scaling the cluster.
- Thread pool rejections and high CPU/Heap usage are primary indicators of an under-resourced cluster causing API timeouts.
| Method | When to Use | Time | Risk |
|---|---|---|---|
| Increase Client Timeout | Immediate mitigation for occasional spikes | 1 min | High (Can mask underlying issues and exhaust cluster resources) |
| Use Async Search API | For known long-running aggregations or reports | Hours | Low (Designed specifically for heavy workloads) |
| Optimize Queries | Permanent fix for inefficient searches (e.g., removing leading wildcards) | Days | Low (Improves overall cluster stability) |
| Scale Up/Out Cluster | When CPU, RAM, or JVM heap are consistently maxed out | Days-Weeks | Medium (Requires downtime or careful rolling restarts) |
Understanding the Error
When working with Elasticsearch at scale, one of the most common and frustrating roadblocks developers and operations teams encounter is the Elasticsearch API timeout error. Depending on where the timeout occurs—on the client, at an intermediary proxy, or within the Elasticsearch cluster itself—the exact error message you see can vary wildly.
Common error manifestations include:
- Node.js/JavaScript Client:
RequestTimeoutError: Request Timeout after 30000ms - Python Client:
elasticsearch.exceptions.ConnectionTimeout: ConnectionTimeout caused by - ReadTimeoutError(HTTPConnectionPool(host='localhost', port=9200): Read timed out. (read timeout=10)) - Direct cURL or Kibana Dev Tools:
{"error":{"root_cause":[{"type":"timeout_exception","reason":"java.util.concurrent.TimeoutException"}],"type":"timeout_exception","reason":"java.util.concurrent.TimeoutException"},"status":500} - Reverse Proxy (Nginx/HAProxy/AWS API Gateway):
504 Gateway Timeout
These errors almost invariably mean a mismatch between the time your client is willing to wait and the time Elasticsearch requires to gather, compute, and return the data. In distributed systems, this isn't just an annoyance; it's a symptom of resource contention, unoptimized data models, or poorly constructed queries.
Step 1: Diagnose the Root Cause
Before you immediately reach for the timeout dial to increase it to 5 minutes, you must diagnose why the timeout is happening. Increasing the timeout without understanding the underlying cause often leads to cascading failures, where long-running queries pile up, consume all available search threads, trigger massive Garbage Collection (GC) pauses, and eventually crash nodes.
1. Check Cluster Health and Pending Tasks
If your cluster is constantly red or yellow, or if it has a massive backlog of pending tasks, API requests will queue up and eventually time out.
Run GET /_cluster/health?pretty. Look at the status, number_of_pending_tasks, and active_shards_percent.
Next, run GET /_cluster/pending_tasks?pretty. If you see a massive list of cluster state updates, mapping updates, or shard allocations, your cluster is busy doing internal bookkeeping, starving your search requests.
2. Analyze Node Hot Threads and JVM Pressure
When a query times out, the Elasticsearch node might be actively grinding through CPU cycles, or it might be paused entirely due to JVM garbage collection.
Run GET /_nodes/hot_threads during a timeout event. This endpoint is pure gold for SREs. It returns a stack trace of the threads consuming the most CPU. If you see deep stack traces involving org.apache.lucene.search, you have an expensive query. If you see lots of GC threads, your JVM heap is likely maxed out.
3. Identify Expensive Queries via Slow Logs
Elasticsearch has built-in slow logs that you can enable dynamically. If you suspect a specific index is causing the timeouts, enable the search slow log:
PUT /my-index/_settings
{
"index.search.slowlog.threshold.query.warn": "10s",
"index.search.slowlog.threshold.fetch.debug": "500ms"
}
Monitor your Elasticsearch log files. You will likely find queries using wildcards at the beginning of terms (e.g., *keyword), heavy regex queries, massive terms aggregations, or deep pagination (from: 100000, size: 100).
Step 2: Implement Fixes
Fixing Elasticsearch timeouts requires a layered approach: short-term mitigations to restore service, and long-term architectural changes to ensure stability.
Short-Term Fix: Adjusting Timeouts Wisely
If your cluster is healthy but a specific API endpoint requires more time, you can increase the timeout. However, you must differentiate between the client-side timeout and the server-side timeout.
Client-Side: If using the official clients, you must explicitly tell the client to wait longer. For example, in Python:
from elasticsearch import Elasticsearch
es = Elasticsearch(["http://localhost:9200"], timeout=60, max_retries=3, retry_on_timeout=True)
In Node.js:
const { Client } = require('@elastic/elasticsearch')
const client = new Client({ node: 'http://localhost:9200', requestTimeout: 60000 })
Server-Side: You can also pass a timeout parameter to your search requests to tell Elasticsearch to return whatever partial results it has gathered before the time expires:
GET /my-index/_search?timeout=10s
This is highly recommended for user-facing applications where partial data is better than an error page.
Long-Term Fix 1: Adopt the Async Search API
If you are running heavy aggregations, reporting queries, or extracting massive amounts of data, you should not be using synchronous HTTP requests. A standard HTTP connection will likely be dropped by intermediate load balancers (like AWS ALB or Nginx) after 60 seconds anyway.
Instead, use the Elasticsearch _async_search API. This API allows you to submit a query and get a id back immediately. You can then poll this id to check the status and retrieve the results when they are ready, completely bypassing HTTP timeout limitations.
Long-Term Fix 2: Optimize Queries and Mappings
- Stop Deep Pagination: If you are using
fromandsizeto page through thousands of results, stop. Elasticsearch must sort and rank all results up tofrom + sizefor every page request. Usesearch_afteror the Point in Time (PIT) API for deep scrolling. - Avoid Leading Wildcards: Queries like
*error*force Lucene to scan the entire inverted index. Usematch_phrase, n-grams, or standard text analysis instead. - Use Keyword types for exact matches: Do not run terms aggregations on
textfields. Ensure your mappings specifykeywordfor fields you intend to aggregate or filter on exactly. - Pre-calculate data: If you run the same heavy aggregation constantly, use Transforms to pivot the data into a summarized index in the background.
Long-Term Fix 3: Cluster Sizing and Circuit Breakers
If queries are optimized but timeouts persist, your cluster simply lacks horsepower.
- Ensure your JVM heap size is exactly 50% of your available RAM, but no larger than 32GB to remain under the compressed object pointers (OOP) threshold.
- Check your thread pools (
GET /_cat/thread_pool?v). If you see a high number ofrejectedtasks in thesearchorwritequeues, your nodes are saturated. You need to add more data nodes to the cluster to distribute the load.
Frequently Asked Questions
# --- Diagnostic Commands ---
# 1. Check overall cluster health and unassigned shards
curl -X GET "localhost:9200/_cluster/health?pretty"
# 2. Check for operations blocking the cluster
curl -X GET "localhost:9200/_cluster/pending_tasks?pretty"
# 3. Identify what the CPU is currently doing (run during a timeout event)
curl -X GET "localhost:9200/_nodes/hot_threads?pretty"
# 4. Check thread pool rejections (look for the 'search' thread pool)
curl -X GET "localhost:9200/_cat/thread_pool?v&s=name"
# --- Remediation & Workarounds ---
# Execute a search with a strict server-side timeout to return partial data
curl -X GET "localhost:9200/my-index/_search?timeout=5s" -H 'Content-Type: application/json' -d'
{
"query": {
"match_all": {}
}
}'
# Example of using the Async Search API for long-running queries
# This returns an ID immediately instead of timing out
curl -X POST "localhost:9200/my-index/_async_search?size=0" -H 'Content-Type: application/json' -d'
{
"aggs": {
"daily_sales": {
"date_histogram": {
"field": "timestamp",
"calendar_interval": "1d"
}
}
}
}'
# To retrieve the results later using the returned ID:
# curl -X GET "localhost:9200/_async_search/<YOUR_ASYNC_SEARCH_ID>"Error Medic Editorial
Error Medic Editorial comprises senior DevOps engineers, SREs, and database administrators dedicated to solving complex infrastructure bottlenecks and distributed system failures.