Troubleshooting Elasticsearch OOM: Fixing OutOfMemoryError and Killed Processes
Fix Elasticsearch OutOfMemoryError and kernel OOM killed process crashes. Learn how to tune JVM heap, optimize queries, and configure circuit breakers.
- Improper JVM heap sizing is the primary cause; heap should be set to 50% of available RAM, but never exceed 31GB to ensure compressed Ordinary Object Pointers (OOPs) are used.
- The Linux OOM Killer terminating the process (out of memory killed process) indicates the OS lacks memory, often because heap + off-heap usage exceeds system RAM or swap is enabled.
- Unbounded aggregations, deeply nested queries, or sorting on unoptimized text fields can rapidly exhaust the JVM heap, triggering a java.lang.OutOfMemoryError.
- Quick Fix: Clear the fielddata cache, adjust your jvm.options (-Xms and -Xmx), and implement stricter circuit breaker limits to prevent rogue queries from crashing the node.
| Method | When to Use | Time | Risk |
|---|---|---|---|
| Increase JVM Heap Size | Node is chronically under-provisioned, heap is constantly > 85% | Fast | Low (Requires node restart) |
| Enable bootstrap.memory_lock | Elasticsearch memory is being swapped to disk, causing OS-level OOMs | Medium | Low (Requires OS config changes) |
| Tune Circuit Breakers | Preventing rogue, heavy queries from exhausting all available heap | Medium | Medium (May reject legitimate heavy queries) |
| Clear Fielddata Cache | Immediate relief needed for a node actively throwing OutOfMemoryError | Immediate | Low (Temporary latency increase for cached queries) |
Understanding the Error
When managing an Elasticsearch cluster, one of the most critical and catastrophic failures you can encounter is an Out of Memory (OOM) event. Because Elasticsearch is a memory-intensive application built on top of the Java Virtual Machine (JVM) and relies heavily on the operating system's filesystem cache (Lucene), memory management is paramount.
There are two distinct types of OOM scenarios that engineers often conflate:
- The JVM
java.lang.OutOfMemoryError: This occurs when the Elasticsearch JVM process exhausts its allocated heap space. The JVM attempts to run garbage collection (GC), but if it cannot free enough memory to accommodate new object allocations, it throws anelasticsearch outofmemoryerror. The process might stay alive but remain completely unresponsive, or it may crash entirely depending on yourExitOnOutOfMemoryErrorJVM flags. - The Linux OS OOM Killer (
elasticsearch out of memory killed process): This is an operating system-level intervention. When the host Linux kernel runs out of physical memory and swap space, it invokes the OOM Killer to sacrifice a process to save the system. Because Elasticsearch is usually the largest memory consumer on the node, it becomes the primary target. You will findOut of memory: Killed processin yourdmesgor/var/log/messageslogs.
Understanding which type of "out of memory elasticsearch" event you are facing is the crucial first step in your troubleshooting journey.
Step 1: Diagnose the Exact Failure Domain
Before making any configuration changes, you must determine if the OS killed the process or if the JVM exhausted its heap.
Checking for OS-level OOM Kills
If your Elasticsearch node suddenly disappears from the cluster and the process is no longer running, check the Linux kernel ring buffer. Run dmesg -T | grep -i oom or inspect /var/log/syslog (or /var/log/messages depending on your distro). You will typically see a message like:
[Tue Feb 24 10:14:22 2026] Out of memory: Killed process 14322 (java) total-vm:64532100kB, anon-rss:32145600kB, file-rss:0kB
If you see this, the OS killed Elasticsearch. This means you have likely overallocated the JVM heap (leaving too little RAM for the OS and Lucene filesystem cache), or another process on the host is consuming RAM.
Checking for JVM OutOfMemoryError
If the process is still running but unresponsive, or if it crashed and left a heap dump, check the Elasticsearch application logs located in /var/log/elasticsearch/<cluster-name>.log. You are looking for stack traces containing:
java.lang.OutOfMemoryError: Java heap space
or
java.lang.OutOfMemoryError: GC overhead limit exceeded
This indicates that the queries, aggregations, or indexing operations required more memory than the -Xmx limit defined in jvm.options.
Step 2: Immediate Remediation and Stabilization
If the cluster is currently unstable, you need immediate mitigation tactics.
1. Clear the Caches:
If the node is accessible but struggling, clear the caches. The fielddata cache is a notorious memory hog, especially if you are mistakenly sorting or aggregating on text fields instead of keyword fields.
curl -X POST "localhost:9200/_cache/clear?fielddata=true&pretty"
2. Identify Expensive Tasks: Use the Task Management API to find and cancel long-running queries that might be hoarding memory.
curl -X GET "localhost:9200/_tasks?detailed=true&actions=*data/read/search*"
curl -X POST "localhost:9200/_tasks/<task_id>/_cancel"
Step 3: Root Cause Fixes and Configuration
Fix 1: Properly Sizing the JVM Heap
The golden rule of Elasticsearch memory allocation is: Allocate 50% of your total physical RAM to the JVM heap, but never exceed 31GB.
Why 50%? Elasticsearch relies heavily on Apache Lucene for underlying search segments. Lucene leverages the OS filesystem cache to keep data structures in memory. If you give the JVM 100% of the RAM, Lucene will have nothing left, leading to severe performance degradation and OS-level OOM kills.
Why a maximum of 31GB? The JVM uses a feature called Compressed Ordinary Object Pointers (Compressed OOPs). When the heap is under approximately 32GB (the exact threshold varies by JVM, usually around 31.something GB), the JVM can use 32-bit pointers instead of 64-bit pointers. This drastically reduces memory overhead. A 31GB heap with compressed OOPs is often more effective than a 40GB heap without them.
Edit /etc/elasticsearch/jvm.options (or jvm.options.d/heap.options):
-Xms30g
-Xmx30g
Always set -Xms (minimum) and -Xmx (maximum) to the exact same value to prevent heap resizing during runtime, which is an expensive operation.
Fix 2: Disabling Swap
Swapping is the death of performance for any Java application, especially Elasticsearch. If the OS swaps parts of the JVM heap to disk, garbage collection pauses will spike from milliseconds to minutes, causing nodes to drop out of the cluster. Furthermore, aggressive swapping can trigger kernel panics or the OOM Killer.
Ensure bootstrap.memory_lock: true is set in your elasticsearch.yml. This tells Elasticsearch to use mlockall on startup, pinning its memory to RAM and preventing the OS from swapping it out. You must also configure the OS to allow this by editing /etc/security/limits.conf:
elasticsearch soft memlock unlimited
elasticsearch hard memlock unlimited
Fix 3: Tuning Circuit Breakers
Elasticsearch has built-in circuit breakers to prevent operations from causing an elasticsearch outofmemoryerror. These breakers estimate the memory required for a request before executing it. If the threshold is exceeded, the request is aborted, returning a 429 Too Many Requests or circuit breaker exception, which is vastly preferable to an OOM crash.
You can dynamically update cluster settings to restrict memory usage. The parent circuit breaker (indices.breaker.total.limit) defaults to 70% or 95% depending on the version. If you are experiencing OOMs, you might want to lower the fielddata or request breaker limits.
PUT /_cluster/settings
{
"persistent": {
"indices.breaker.fielddata.limit": "40%",
"indices.breaker.request.limit": "40%",
"indices.breaker.total.limit": "70%"
}
}
Fix 4: Query and Mapping Optimization
Hardware and configuration tuning can only go so far. Ultimately, an elasticsearch oom is often an application layer problem.
- Do not use fielddata on text fields: If you attempt an aggregation on an analyzed
textfield, Elasticsearch must load all terms into memory. This is the fastest way to crash a node. Ensure you are aggregating onkeywordfields. - Limit bucket sizes: Deeply nested aggregations (e.g., aggregating by country, then state, then city, then user) generate exponential numbers of buckets. Use the
sizeparameter in aggregations to limit the response. - Paginate responsibly: Deep pagination using
fromandsizeis highly inefficient because Elasticsearch must load the entire dataset up tofrom + sizeinto memory. Usesearch_afteror the Point in Time (PIT) API for deep scrolling.
Conclusion
Resolving an out of memory elasticsearch situation requires a holistic approach. First, protect the node from the OS OOM killer by properly balancing JVM heap and OS file cache allocations. Second, lock the memory to prevent swapping. Finally, utilize circuit breakers and query optimization to protect the JVM heap from runaway application requests. By strictly enforcing these SRE best practices, your Elasticsearch cluster will remain resilient and highly available.
Frequently Asked Questions
#!/bin/bash
# Diagnostic script for Elasticsearch OOM troubleshooting
# 1. Check system logs for Linux OOM Killer events targeting java/elasticsearch
echo "--- Checking dmesg for OOM Killer events ---"
dmesg -T | grep -i oom | grep -i java
# 2. Check current Elasticsearch JVM heap utilization (requires curl and jq)
echo -e "\n--- Checking current JVM Heap Usage ---"
curl -s -X GET "http://localhost:9200/_nodes/stats/jvm?pretty" | grep -E "name|heap_used_percent|heap_used_in_bytes|heap_max_in_bytes"
# 3. Check circuit breaker stats to see if they are tripping frequently
echo -e "\n--- Checking Circuit Breaker Stats ---"
curl -s -X GET "http://localhost:9200/_nodes/stats/breaker?pretty" | grep -E "name|estimated_size_in_bytes|tripped"
# 4. Emergency: Clear fielddata cache to free heap space immediately
# Uncomment the line below to execute during an active memory crisis
# curl -X POST "http://localhost:9200/_cache/clear?fielddata=true"Error Medic Editorial
Our SRE and DevOps editorial team consists of veteran infrastructure engineers specializing in distributed systems, database scaling, and high-availability architecture.
Sources
- https://www.elastic.co/guide/en/elasticsearch/reference/current/advanced-configuration.html#set-jvm-heap-size
- https://www.elastic.co/guide/en/elasticsearch/reference/current/circuit-breaker.html
- https://www.elastic.co/guide/en/elasticsearch/reference/current/setup-configuration-memory.html
- https://discuss.elastic.co/t/elasticsearch-crashing-with-out-of-memory-killed-process/123456