Elasticsearch API Timeout: Diagnosing and Fixing ReadTimeoutError, SocketTimeoutException, and Request Timeout Errors
Fix Elasticsearch API timeout errors—ReadTimeoutError, SocketTimeoutException, 408/504 responses—with step-by-step diagnosis and config tuning.
- The most common root causes are undersized thread pools, JVM GC pauses causing stop-the-world events, and queries hitting unoptimized large indices without shard routing
- Network-level timeouts (TCP keepalive, load balancer idle timeout) frequently mask themselves as Elasticsearch client timeouts—always check both layers
- Quick fix: increase client request_timeout to 60s as a stopgap, but the permanent fix requires identifying whether the bottleneck is at the query, index, JVM, or network layer
| Method | When to Use | Time to Apply | Risk |
|---|---|---|---|
| Increase client request_timeout | Immediate relief while diagnosing root cause | < 5 min | Low — client-side only |
| Tune search.default_search_timeout | Queries consistently slow on large indices | 5–10 min, rolling restart not required | Low — cluster setting |
| Add index routing / filter context | Queries scan too many shards | 30–60 min (index rebuild may be needed) | Medium — schema change |
| Scale JVM heap and GC tuning | Full GC pauses > 10s visible in logs | 15–30 min, requires node restart | Medium — node restart |
| Add replica shards / hot-warm tiering | Sustained high read throughput | 1–4 hours | Low-Medium — cluster rebalance |
| Fix load balancer idle timeout | Connections silently dropped mid-request | 5–15 min (infra change) | Low |
Understanding Elasticsearch API Timeouts
An Elasticsearch API timeout surfaces differently depending on where in the request chain the clock runs out. Developers typically encounter one of these exact error strings:
readtimeouterror: httpsconnectionpool(host='es-host', port=9200): read timed out. (read timeout=10)
SocketTimeoutException: 30,000 milliseconds timeout on connection http://es-host:9200
ConnectionTimeout: ConnectionTimeout caused by - ReadTimeoutError
[408 Request Timeout] or [504 Gateway Timeout] from a proxy
transportexception[failed to get node response]
The timeout can originate at three distinct layers: the client library (Python elasticsearch-py, Java High-Level REST Client, etc.), the network/proxy layer (ALB, NGINX, HAProxy), or inside Elasticsearch itself (search timeout, bulk timeout). Conflating these layers is the single most common debugging mistake.
Step 1: Identify Which Layer Is Timing Out
Check cluster health and slow-log first:
# Cluster health — are you yellow/red?
curl -s 'http://localhost:9200/_cluster/health?pretty'
# Node stats — check thread pool rejection counters
curl -s 'http://localhost:9200/_nodes/stats/thread_pool?pretty' | \
python3 -c "import sys,json; d=json.load(sys.stdin); \
[print(n, tp, v) for n,nd in d['nodes'].items() \
for tp,tv in nd['thread_pool'].items() \
for k,v in tv.items() if k in ('rejected','queue') and v>0]"
# Pending tasks
curl -s 'http://localhost:9200/_cluster/pending_tasks?pretty'
# Hot threads — what is the JVM actually doing?
curl -s 'http://localhost:9200/_nodes/hot_threads'
Enable slow logs on the problematic index:
curl -X PUT 'http://localhost:9200/my-index/_settings' -H 'Content-Type: application/json' -d '{
"index.search.slowlog.threshold.query.warn": "5s",
"index.search.slowlog.threshold.query.info": "2s",
"index.search.slowlog.threshold.fetch.warn": "1s",
"index.indexing.slowlog.threshold.index.warn": "5s"
}'
Then tail /var/log/elasticsearch/*_index_search_slowlog.log on each data node.
Check if a load balancer is killing the connection:
AWS ALB default idle timeout is 60 seconds. If your queries consistently time out at exactly 60s from the client side but Elasticsearch is still running the query, the ALB is dropping the TCP connection. Verify:
# Time an actual request end-to-end
time curl -s -w "\n\nHTTP %{http_code} | Total: %{time_total}s\n" \
'http://localhost:9200/my-index/_search?pretty' \
-H 'Content-Type: application/json' \
-d '{"query":{"match_all":{}},"size":1}'
# Compare against direct node access (bypassing LB)
time curl -s 'http://ES-NODE-IP:9200/my-index/_search' \
-H 'Content-Type: application/json' \
-d '{"query":{"match_all":{}},"size":1}'
Step 2: Diagnose JVM GC Pauses
GC stop-the-world events pause the entire JVM, making every in-flight request appear to time out simultaneously. This is a classic pattern: you see a burst of timeouts across multiple unrelated queries at the exact same second.
# Look for GC log evidence
grep -E 'GC overhead|Stopping world|pause' /var/log/elasticsearch/gc.log | tail -50
# Node JVM stats
curl -s 'http://localhost:9200/_nodes/stats/jvm?pretty' | \
python3 -c "
import sys, json
d = json.load(sys.stdin)
for nid, n in d['nodes'].items():
jvm = n['jvm']
print(n['name'])
print(' heap_used_percent:', jvm['mem']['heap_used_percent'])
print(' old_gc_count:', jvm['gc']['collectors']['old']['collection_count'])
print(' old_gc_time_ms:', jvm['gc']['collectors']['old']['collection_time_in_millis'])
"
If heap_used_percent is consistently above 75% or old_gc_time_in_millis is growing rapidly, you have a heap pressure problem.
Fix for JVM heap pressure:
- Set heap to no more than 50% of RAM, max 31GB (stays below compressed OOP threshold):
# In /etc/elasticsearch/jvm.options -Xms16g -Xmx16g - Switch to G1GC if on JDK 9+ (default since ES 7.x):
-XX:+UseG1GC -XX:G1HeapRegionSize=4m -XX:InitiatingHeapOccupancyPercent=30 - Identify field data cache bloat:
curl -s 'http://localhost:9200/_nodes/stats/indices/fielddata?pretty' | \ grep -E 'memory_size|evictions'
Step 3: Diagnose Query-Level Timeouts
A query scanning too many shards or too many documents will exhaust the search thread pool, causing queuing and eventual timeout.
# How many shards does this index have?
curl -s 'http://localhost:9200/_cat/shards/my-index?v&h=index,shard,prirep,state,docs,store,node'
# How expensive is the query? Use profile API
curl -X POST 'http://localhost:9200/my-index/_search?pretty' \
-H 'Content-Type: application/json' -d '{
"profile": true,
"query": { "YOUR_QUERY_HERE": {} }
}'
# Explain a specific document match
curl -X GET 'http://localhost:9200/my-index/_explain/DOC_ID?pretty' \
-H 'Content-Type: application/json' -d '{
"query": { "YOUR_QUERY_HERE": {} }
}'
Common query fixes:
- Replace
wildcard: {"field": "*term*"}with anngramanalyzer at index time - Move date range filters into
filtercontext (cached) instead ofquerycontext (scored) - Add
routingparameter to limit shard fan-out:POST /my-index/_search?routing=tenant_id - Set a circuit-breaker timeout at query level:
{ "timeout": "30s", "query": { ... } }
Set a cluster-wide search timeout as a safety net:
curl -X PUT 'http://localhost:9200/_cluster/settings' \
-H 'Content-Type: application/json' -d '{
"persistent": {
"search.default_search_timeout": "30s"
}
}'
Step 4: Fix Client-Side Timeout Configuration
Python (elasticsearch-py 8.x):
from elasticsearch import Elasticsearch
es = Elasticsearch(
["https://es-host:9200"],
request_timeout=60, # seconds before ReadTimeoutError
retry_on_timeout=True,
max_retries=3,
http_compress=True,
)
Node.js (@elastic/elasticsearch):
const { Client } = require('@elastic/elasticsearch')
const client = new Client({
node: 'https://es-host:9200',
requestTimeout: 60000, // milliseconds
sniffOnStart: false, // avoid sniff timeouts in prod
})
Java (RestHighLevelClient / 8.x JavaClient):
RestClientBuilder builder = RestClient.builder(
new HttpHost("es-host", 9200, "https"))
.setRequestConfigCallback(config -> config
.setConnectTimeout(5000)
.setSocketTimeout(60000)) // ms
.setHttpClientConfigCallback(httpClient -> httpClient
.setKeepAliveStrategy((response, context) -> 60_000));
Step 5: Fix Network / Proxy Timeouts
AWS ALB: Set idle timeout to 120s+ in the console or:
aws elbv2 modify-load-balancer-attributes \
--load-balancer-arn arn:aws:elasticloadbalancing:... \
--attributes Key=idle_timeout.timeout_seconds,Value=120
NGINX upstream:
upstream elasticsearch {
server es-host:9200;
keepalive 32;
}
server {
location / {
proxy_pass http://elasticsearch;
proxy_read_timeout 120s;
proxy_send_timeout 120s;
proxy_connect_timeout 10s;
}
}
HAProxy:
timeout connect 10s
timeout client 120s
timeout server 120s
Step 6: Validate the Fix
# Watch thread pool rejection counters in real time
watch -n 5 "curl -s 'http://localhost:9200/_cat/thread_pool?v&h=name,active,queue,rejected&s=rejected:desc' | head -15"
# Confirm GC pressure reduced
curl -s 'http://localhost:9200/_nodes/stats/jvm' | \
python3 -c "import sys,json; d=json.load(sys.stdin); \
[print(n['name'], n['jvm']['mem']['heap_used_percent'], '%') \
for n in d['nodes'].values()]"
# Run a representative query with timing
for i in $(seq 1 10); do
curl -s -w "%{time_total}\n" -o /dev/null \
-X POST 'http://localhost:9200/my-index/_search' \
-H 'Content-Type: application/json' \
-d '{"query":{"match_all":{}},"size":10}'
done
If median response times drop below your SLA threshold and thread pool rejections return to zero, the fix is confirmed.
Frequently Asked Questions
#!/usr/bin/env bash
# elasticsearch-timeout-diagnose.sh
# Run on any host with curl access to your ES cluster
ES="http://localhost:9200"
echo "=== 1. Cluster Health ==="
curl -s "$ES/_cluster/health?pretty"
echo ""
echo "=== 2. Thread Pool Rejections (non-zero only) ==="
curl -s "$ES/_cat/thread_pool?v&h=name,active,queue,rejected,completed&s=rejected:desc" | \
awk 'NR==1 || $4 > 0'
echo ""
echo "=== 3. JVM Heap + GC Per Node ==="
curl -s "$ES/_nodes/stats/jvm?pretty" | python3 -c "
import sys, json
d = json.load(sys.stdin)
for nid, n in d['nodes'].items():
jvm = n['jvm']
print('Node:', n['name'])
print(' heap_used_percent:', jvm['mem']['heap_used_percent'])
old = jvm['gc']['collectors'].get('old', {})
print(' old_gc_count:', old.get('collection_count', 'N/A'))
print(' old_gc_time_ms:', old.get('collection_time_in_millis', 'N/A'))
"
echo ""
echo "=== 4. Pending Cluster Tasks ==="
curl -s "$ES/_cluster/pending_tasks?pretty"
echo ""
echo "=== 5. Hot Threads (top CPU consumers) ==="
curl -s "$ES/_nodes/hot_threads?threads=3&interval=2s"
echo ""
echo "=== 6. Shard Distribution ==="
curl -s "$ES/_cat/shards?v&h=index,shard,prirep,state,docs,store,node&s=store:desc" | head -30
echo ""
echo "=== 7. FieldData Cache Memory ==="
curl -s "$ES/_nodes/stats/indices/fielddata?pretty" | python3 -c "
import sys, json
d = json.load(sys.stdin)
for nid, n in d['nodes'].items():
fd = n['indices']['fielddata']
print(n['name'], '|', 'fielddata_memory:', fd['memory_size'], '| evictions:', fd['evictions'])
"
echo ""
echo "=== 8. Timed Query Sample (5 requests) ==="
for i in $(seq 1 5); do
RESULT=$(curl -s -w "HTTP:%{http_code} TIME:%{time_total}s" \
-X POST "$ES/_search" \
-H 'Content-Type: application/json' \
-d '{"query":{"match_all":{}},"size":10,"timeout":"10s"}')
echo " Run $i: $(echo $RESULT | grep -o 'HTTP:[^ ]*') $(echo $RESULT | grep -o 'TIME:[^ ]*')"
done
echo ""
echo "=== Done. Review thread pool rejections, heap %, and GC time ==="
# Remediation quick reference:
# High rejections -> increase search threadpool OR fix slow queries
# heap_used > 75% -> reduce fielddata, lower doc fetch size, tune JVM heap
# GC time growing -> check for cartesian products in aggregations, large sorts
# Pending tasks > 0 -> cluster rebalancing; wait or investigate shard assignmentError Medic Editorial
Error Medic Editorial is written by senior DevOps and SRE engineers with production experience running Elasticsearch clusters at scale across AWS, GCP, and on-premise environments. Articles are peer-reviewed for technical accuracy and tested against current Elasticsearch 7.x and 8.x releases.
Sources
- https://www.elastic.co/guide/en/elasticsearch/reference/current/search-your-data.html#search-timeout
- https://www.elastic.co/guide/en/elasticsearch/client/python-api/current/config.html
- https://github.com/elastic/elasticsearch/issues/21877
- https://stackoverflow.com/questions/22924300/reducing-elasticsearch-timeouts
- https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-threadpool.html
- https://www.elastic.co/guide/en/elasticsearch/reference/current/tune-for-search-speed.html