Error Medic

How to Fix Elasticsearch API Timeout Errors (Request Timeout after 30000ms)

Resolve Elasticsearch API timeouts. Diagnose slow queries, GC pauses, and thread pool exhaustion. Learn to optimize queries and adjust client timeout settings.

Last updated:
Last verified:
1,583 words
Key Takeaways
  • Client-side timeouts (e.g., RequestTimeoutError) occur when the client gives up before Elasticsearch finishes processing.
  • Server-side timeouts often result from heavy queries, deep pagination, unoptimized mappings, or JVM Garbage Collection (GC) pauses.
  • Increasing timeouts is a temporary band-aid; long-term fixes require query optimization, using Async Search, or scaling the cluster.
  • Thread pool rejections and high CPU/Heap usage are primary indicators of an under-resourced cluster causing API timeouts.
Fix Approaches Compared
MethodWhen to UseTimeRisk
Increase Client TimeoutImmediate mitigation for occasional spikes1 minHigh (Can mask underlying issues and exhaust cluster resources)
Use Async Search APIFor known long-running aggregations or reportsHoursLow (Designed specifically for heavy workloads)
Optimize QueriesPermanent fix for inefficient searches (e.g., removing leading wildcards)DaysLow (Improves overall cluster stability)
Scale Up/Out ClusterWhen CPU, RAM, or JVM heap are consistently maxed outDays-WeeksMedium (Requires downtime or careful rolling restarts)

Understanding the Error

When working with Elasticsearch at scale, one of the most common and frustrating roadblocks developers and operations teams encounter is the Elasticsearch API timeout error. Depending on where the timeout occurs—on the client, at an intermediary proxy, or within the Elasticsearch cluster itself—the exact error message you see can vary wildly.

Common error manifestations include:

  • Node.js/JavaScript Client: RequestTimeoutError: Request Timeout after 30000ms
  • Python Client: elasticsearch.exceptions.ConnectionTimeout: ConnectionTimeout caused by - ReadTimeoutError(HTTPConnectionPool(host='localhost', port=9200): Read timed out. (read timeout=10))
  • Direct cURL or Kibana Dev Tools: {"error":{"root_cause":[{"type":"timeout_exception","reason":"java.util.concurrent.TimeoutException"}],"type":"timeout_exception","reason":"java.util.concurrent.TimeoutException"},"status":500}
  • Reverse Proxy (Nginx/HAProxy/AWS API Gateway): 504 Gateway Timeout

These errors almost invariably mean a mismatch between the time your client is willing to wait and the time Elasticsearch requires to gather, compute, and return the data. In distributed systems, this isn't just an annoyance; it's a symptom of resource contention, unoptimized data models, or poorly constructed queries.

Step 1: Diagnose the Root Cause

Before you immediately reach for the timeout dial to increase it to 5 minutes, you must diagnose why the timeout is happening. Increasing the timeout without understanding the underlying cause often leads to cascading failures, where long-running queries pile up, consume all available search threads, trigger massive Garbage Collection (GC) pauses, and eventually crash nodes.

1. Check Cluster Health and Pending Tasks

If your cluster is constantly red or yellow, or if it has a massive backlog of pending tasks, API requests will queue up and eventually time out. Run GET /_cluster/health?pretty. Look at the status, number_of_pending_tasks, and active_shards_percent. Next, run GET /_cluster/pending_tasks?pretty. If you see a massive list of cluster state updates, mapping updates, or shard allocations, your cluster is busy doing internal bookkeeping, starving your search requests.

2. Analyze Node Hot Threads and JVM Pressure

When a query times out, the Elasticsearch node might be actively grinding through CPU cycles, or it might be paused entirely due to JVM garbage collection. Run GET /_nodes/hot_threads during a timeout event. This endpoint is pure gold for SREs. It returns a stack trace of the threads consuming the most CPU. If you see deep stack traces involving org.apache.lucene.search, you have an expensive query. If you see lots of GC threads, your JVM heap is likely maxed out.

3. Identify Expensive Queries via Slow Logs

Elasticsearch has built-in slow logs that you can enable dynamically. If you suspect a specific index is causing the timeouts, enable the search slow log:

PUT /my-index/_settings
{
  "index.search.slowlog.threshold.query.warn": "10s",
  "index.search.slowlog.threshold.fetch.debug": "500ms"
}

Monitor your Elasticsearch log files. You will likely find queries using wildcards at the beginning of terms (e.g., *keyword), heavy regex queries, massive terms aggregations, or deep pagination (from: 100000, size: 100).

Step 2: Implement Fixes

Fixing Elasticsearch timeouts requires a layered approach: short-term mitigations to restore service, and long-term architectural changes to ensure stability.

Short-Term Fix: Adjusting Timeouts Wisely

If your cluster is healthy but a specific API endpoint requires more time, you can increase the timeout. However, you must differentiate between the client-side timeout and the server-side timeout.

Client-Side: If using the official clients, you must explicitly tell the client to wait longer. For example, in Python:

from elasticsearch import Elasticsearch
es = Elasticsearch(["http://localhost:9200"], timeout=60, max_retries=3, retry_on_timeout=True)

In Node.js:

const { Client } = require('@elastic/elasticsearch')
const client = new Client({ node: 'http://localhost:9200', requestTimeout: 60000 })

Server-Side: You can also pass a timeout parameter to your search requests to tell Elasticsearch to return whatever partial results it has gathered before the time expires: GET /my-index/_search?timeout=10s This is highly recommended for user-facing applications where partial data is better than an error page.

Long-Term Fix 1: Adopt the Async Search API

If you are running heavy aggregations, reporting queries, or extracting massive amounts of data, you should not be using synchronous HTTP requests. A standard HTTP connection will likely be dropped by intermediate load balancers (like AWS ALB or Nginx) after 60 seconds anyway.

Instead, use the Elasticsearch _async_search API. This API allows you to submit a query and get a id back immediately. You can then poll this id to check the status and retrieve the results when they are ready, completely bypassing HTTP timeout limitations.

Long-Term Fix 2: Optimize Queries and Mappings
  • Stop Deep Pagination: If you are using from and size to page through thousands of results, stop. Elasticsearch must sort and rank all results up to from + size for every page request. Use search_after or the Point in Time (PIT) API for deep scrolling.
  • Avoid Leading Wildcards: Queries like *error* force Lucene to scan the entire inverted index. Use match_phrase, n-grams, or standard text analysis instead.
  • Use Keyword types for exact matches: Do not run terms aggregations on text fields. Ensure your mappings specify keyword for fields you intend to aggregate or filter on exactly.
  • Pre-calculate data: If you run the same heavy aggregation constantly, use Transforms to pivot the data into a summarized index in the background.
Long-Term Fix 3: Cluster Sizing and Circuit Breakers

If queries are optimized but timeouts persist, your cluster simply lacks horsepower.

  • Ensure your JVM heap size is exactly 50% of your available RAM, but no larger than 32GB to remain under the compressed object pointers (OOP) threshold.
  • Check your thread pools (GET /_cat/thread_pool?v). If you see a high number of rejected tasks in the search or write queues, your nodes are saturated. You need to add more data nodes to the cluster to distribute the load.

Frequently Asked Questions

bash
# --- Diagnostic Commands ---

# 1. Check overall cluster health and unassigned shards
curl -X GET "localhost:9200/_cluster/health?pretty"

# 2. Check for operations blocking the cluster
curl -X GET "localhost:9200/_cluster/pending_tasks?pretty"

# 3. Identify what the CPU is currently doing (run during a timeout event)
curl -X GET "localhost:9200/_nodes/hot_threads?pretty"

# 4. Check thread pool rejections (look for the 'search' thread pool)
curl -X GET "localhost:9200/_cat/thread_pool?v&s=name"

# --- Remediation & Workarounds ---

# Execute a search with a strict server-side timeout to return partial data
curl -X GET "localhost:9200/my-index/_search?timeout=5s" -H 'Content-Type: application/json' -d'
{
  "query": {
    "match_all": {}
  }
}'

# Example of using the Async Search API for long-running queries
# This returns an ID immediately instead of timing out
curl -X POST "localhost:9200/my-index/_async_search?size=0" -H 'Content-Type: application/json' -d'
{
  "aggs": {
    "daily_sales": {
      "date_histogram": {
        "field": "timestamp",
        "calendar_interval": "1d"
      }
    }
  }
}'

# To retrieve the results later using the returned ID:
# curl -X GET "localhost:9200/_async_search/<YOUR_ASYNC_SEARCH_ID>"
E

Error Medic Editorial

Error Medic Editorial comprises senior DevOps engineers, SREs, and database administrators dedicated to solving complex infrastructure bottlenecks and distributed system failures.

Sources

Related Articles in Elasticsearch Api

Explore More API Errors Guides