Error Medic

Resolving Elasticsearch API Timeout Errors: 408 Request Timeout & Read timed out

Fix Elasticsearch API timeouts (408 Request Timeout, Read timed out) by optimizing heavy queries, tuning thread pools, and adjusting client-side limits.

Last updated:
Last verified:
1,341 words
Key Takeaways
  • Client-side connection limits or read timeouts are too low for the query complexity.
  • High JVM Garbage Collection (GC) pauses are freezing node responsiveness.
  • Search or write thread pools are exhausted, leading to queued and ultimately rejected requests (EsRejectedExecutionException).
  • Unoptimized queries (e.g., leading wildcards, deep pagination, massive aggregations) are consuming excessive CPU and memory.
  • Quick Fix: Temporarily increase client timeout parameters, but structurally resolve by optimizing slow queries and scaling node resources.
Fix Approaches Compared
MethodWhen to UseTime to ImplementRisk Level
Increase Client TimeoutClient drops connection before ES finishes processing a valid, complex request5 minsLow
Optimize Search QueriesHeavy queries causing high CPU/Memory loads across the cluster1-2 hoursMedium
Tune Node Thread PoolsHigh queued/rejected executions in node stats during traffic spikes15 minsMedium
Increase JVM Heap / Scale OutFrequent long GC pauses or continuous OutOfMemory (OOM) conditions1 hourHigh

Understanding the Error

Elasticsearch API timeout errors are a common and critical symptom of underlying performance bottlenecks or misconfigurations within your search infrastructure. When interacting with the Elasticsearch REST API, clients expect a response within a specific time window. If the Elasticsearch cluster fails to return the data before this window expires, the client severs the connection and throws a timeout exception.

These errors typically manifest in logs as:

  • elasticsearch.exceptions.ConnectionTimeout: ConnectionTimeout caused by - ReadTimeoutError(HTTPConnectionPool(host='localhost', port=9200): Read timed out. (read timeout=10))
  • 408 Request Timeout
  • java.net.SocketTimeoutException: Read timed out
  • EsRejectedExecutionException: rejected execution of org.elasticsearch.transport.TransportService

The root cause rarely lies in the network itself. Instead, it is usually a symptom of the Elasticsearch cluster being overwhelmed, JVM memory pressure, exhausted thread pools, or unoptimized data access patterns.

Step 1: Diagnose the Bottleneck

Before implementing a fix, you must identify whether the timeout is a client-side misconfiguration or a server-side resource exhaustion issue.

1. Check the Elasticsearch Slow Logs Slow logs are your first line of defense. If queries are taking longer than the client's timeout threshold, they will appear here. By default, slow logs are disabled. You must enable them dynamically on the indices experiencing issues. Look for queries that take longer than 5-10 seconds. Identify patterns: are they using wildcards (*term), massive aggregations, or deep pagination (from: 10000)?

2. Monitor JVM Garbage Collection (GC) Elasticsearch runs on the JVM. If the heap is undersized for the data volume, the JVM will frequently trigger 'Stop-the-World' Garbage Collection pauses. During a major GC pause, the node literally stops processing requests. If a GC pause lasts for 15 seconds, any client with a 10-second read timeout will fail. Check your Elasticsearch logs for lines containing [gc][young] or [gc][old] accompanied by high latency durations.

3. Inspect Thread Pools and Rejections Elasticsearch uses distinct thread pools for different operations (search, write, get). When a node receives a request, it hands it to the appropriate thread pool. If all threads are busy, the request goes into a queue. If the queue fills up, Elasticsearch rejects the request. You can check this by querying the _cat/thread_pool API. High numbers in the rejected column indicate that your cluster is fundamentally under-provisioned for the current workload concurrency.

Step 2: Implement the Fix

Depending on your diagnostic findings, apply one or more of the following solutions.

Fix A: Adjusting Client-Side Timeouts

If you are running intentionally long operations (e.g., bulk reindexing, snapshotting, or heavy analytical aggregations), the default client timeout (often 10 seconds) is simply too aggressive. You need to explicitly configure the client to wait longer.

Python Client (elasticsearch-py):

from elasticsearch import Elasticsearch
# Increase timeout to 60 seconds
es = Elasticsearch(
    ["http://localhost:9200"],
    timeout=60,
    max_retries=3,
    retry_on_timeout=True
)

Node.js Client (@elastic/elasticsearch):

const { Client } = require('@elastic/elasticsearch')
const client = new Client({
  node: 'http://localhost:9200',
  requestTimeout: 60000 // 60 seconds in milliseconds
})
Fix B: Query Optimization

Throwing hardware at bad queries is an expensive and temporary fix. Optimize your search requests to reduce cluster load.

  • Avoid Leading Wildcards: Queries like *searchterm force Elasticsearch to scan the entire inverted index. This is extremely slow. Use edge n-grams during indexing instead.
  • Eliminate Deep Pagination: Fetching page 1,000 using from and size requires Elasticsearch to retrieve and sort 10,000 documents just to discard 9,990 of them. Switch to the search_after API for deep scrolling.
  • Use Filter Context: If you are filtering data (e.g., status: active) and don't care about scoring, put the clause in a filter block instead of must. Filter clauses are cached and bypass the scoring algorithm entirely.
Fix C: Addressing JVM and Hardware Limits

If your queries are optimized but timeouts persist due to GC pauses or high CPU, you must scale your infrastructure.

  • Increase Heap Size: Ensure your JVM heap (-Xms and -Xmx in jvm.options) is set to 50% of the available physical RAM, but never more than 32GB (to maintain zero-based compressed oops).
  • Scale Out: Add more data nodes to the cluster. Elasticsearch scales horizontally very well. Distributing primary shards across more nodes parallelizes the workload and reduces the per-node CPU and memory pressure.
  • Storage Speed: Ensure your Elasticsearch data directories are backed by high-IOPS NVMe SSDs. Spinning disks (HDDs) or low-tier network-attached storage will cause massive IO wait times, indirectly leading to API timeouts.

Step 3: Preventative Monitoring

To prevent recurrences, establish robust alerting. Monitor the _nodes/stats endpoint. Alert if JVM heap usage consistently exceeds 85%. Alert on any increase in thread pool rejections. Ensure your APM or logging infrastructure tracks 99th percentile query latency, allowing you to catch degrading performance long before it turns into a hard timeout.

Frequently Asked Questions

bash
# 1. Check cluster health and active long-running tasks
curl -X GET "localhost:9200/_cluster/health?pretty"
curl -X GET "localhost:9200/_tasks?detailed=true&actions=*search*&pretty"

# 2. Check for thread pool rejections (Look at the 'rejected' column)
curl -X GET "localhost:9200/_cat/thread_pool/search,write?v&h=id,name,active,queue,rejected,completed"

# 3. View node memory, JVM heap stats, and GC pause times
curl -X GET "localhost:9200/_nodes/stats/jvm?pretty"

# 4. Enable Slow Logs dynamically for a specific index to catch timeout culprits
curl -X PUT "localhost:9200/my-index/_settings" -H 'Content-Type: application/json' -d'
{
  "index.search.slowlog.threshold.query.warn": "2s",
  "index.search.slowlog.threshold.fetch.warn": "2s",
  "index.search.slowlog.threshold.query.info": "1s",
  "index.search.slowlog.level": "info"
}'
E

Error Medic Editorial

Error Medic Editorial consists of senior DevOps engineers and Site Reliability Experts dedicated to untangling complex infrastructure bottlenecks. With decades of combined experience managing petabyte-scale Elasticsearch clusters, we provide actionable, production-ready solutions.

Sources

Related Articles in Elasticsearch Api

Explore More API Errors Guides