A Heap of Trouble: Managing Elasticsearch's Managed Heap | Elastic Blog
Engineering

A Heap of Trouble

Engineers can resist anything except giving their processes more resources: bigger, better, faster, more of cycles, cores, RAM, disks and interconnects! When these resources are not a bottleneck, this is wasteful but harmless. For processes like Elasticsearch that run on the JVM, the luring temptation is to turn the heap up; what harm could possibly come from having more heap? Alas, the story isn't simple.

Java is a garbage-collected language. Java objects reside in a runtime area of memory called the heap. When the heap fills up, objects that are no longer referenced by the application (affectionately known as garbage) are automatically released from the heap (such objects are said to have been collected). The maximum size of the heap is specified at application startup and fixed for the life the application; this size impacts allocation speed, garbage collection frequency, and garbage collection duration (most notably the dreaded stop-the-world phase which pauses all application threads). Applications have to strike a balance between small heaps and large heaps; the heap can be too rich or too thin.

Too Small

If the heap is too small, applications will be prone to the danger of out of memory errors. While that is the most serious risk from an undersized heap, there are additional problems that can arise from a heap that is too small. A heap that is too small relative to the application's allocation rate leads to frequent small latency spikes and reduced throughput from constant garbage collection pauses. Frequent short pauses impact end-user experience as these pauses effectively shift the latency distribution and reduce the number of operations the application can handle. For Elasticsearch, constant short pauses reduce the number of indexing operations and queries per second that can be handled. A small heap also reduces the memory available for indexing buffers, caches, and memory-hungry features like aggregations and suggesters.

Too Large

If the heap is too large, the application will be prone to infrequent long latency spikes from full-heap garbage collections. Infrequent long pauses impact end-user experience as these pauses increase the tail of the latency distribution; user requests will sometimes see unacceptably-long response times. Long pauses are especially detrimental to a distributed system like Elasticsearch because a long pause is indistinguishable from a node that is unreachable because it is hung, or otherwise isolated from the cluster. During a stop-the-world pause, no Elasticsearch server code is executing: it doesn't call, it doesn't write, and it doesn't send flowers. In the case of an elected master, a long garbage collection pause can cause other nodes to stop following the master and elect a new one. In the case of a data node, a long garbage collection pause can lead to the master removing the node from the cluster and reallocating the paused node's assigned shards. This increases network traffic and disk I/O across the cluster, which hampers normal load. Long garbage collection pauses are a top issue for cluster instability.

Just Right

The crux of the matter is that undersized heaps are bad, oversized heaps are bad and so it needs to be just right.

Oops!...I Did It Again

The engineers behind Elasticsearch have long advised keeping the heap size below some threshold near 32 GB1 (some docs referred to a 30.5 GB threshold). The reasoning behind this advice arises from the notion of compressed ordinary object pointers (or compressed oops).

An ordinary object pointer (or oops) is a managed pointer to an object and it has the same size as a native pointer. This means that on a 32-bit JVM an oop is 32-bits in size and on a 64-bit JVM an oop is 64-bits in size. Comparing an application that runs on a 32-bit JVM to an application that runs on a 64-bit JVM, the former will usually2 perform faster. This is because 32-bit pointers require half of the memory space compared to 64-bit pointers; this is friendlier to limited memory bandwidth, precious CPU caches, and leads to fewer garbage collection cycles as there is more room available on the heap.

Applications that run on a 32-bit JVM are limited to a maximum heap size of slightly less than 4 GB. For modern distributed server applications serving large volumes of data, this is usually too small. But there's a neat trick that can be employed: limit the heap to slightly less than 32 GB and then the JVM can get away with 35-bit oops (since 235 = 32 GB). Using thirty-five bits is not friendly to modern CPU architectures, though, so another trick is employed: keep all objects aligned on 8-byte boundaries and then we can assume the last three bits of 35-bit oops are zeros3. Now the JVM can get away with 32-bit object pointers yet still reference 32 GB of heap. These are compressed oops.

Then, exactly like the situation with going from a 32-bit JVM to a 64-bit JVM, comparing an application with a heap size just less than the compressed oops threshold to one with a heap size just more than the compressed oops threshold, the latter will perform worse. What is more, the heap useable to the application will be significantly smaller because of the additional space taken up by the 64-bit oops. Increasing the size of the heap to overcome this loss, however, leads to a larger heap that is subject to the long-pause problem already discussed. For Elasticsearch, our advice is to always stay below the compressed oops threshold.

It's Complicated

It turns out that the true story is more complicated than this as there are two additional cutoffs.

The first is natural and easy to understand. If the heap is smaller than 4 GB, the JVM can just use 32-bit pointers.

The second cutoff is less obvious. If the heap will not fit in the first 4 GB of address space, the JVM will next try to reserve memory for the heap within the first 32 GB of address space and then use a zero base for the heap; this is known as zero-based compressed oops. When this reservation can not be granted, the JVM has to fall back to using a non-zero base for the heap. If a zero base can be used, a simple 3-bit shift is all that is needed for encoding and decoding between native 64-bit pointers and compressed oops.

native oop = (compressed oop << 3)


But when the base is non-zero, a null check is needed and that additional base must be added and subtracted when encoding and decoding compressed oops.

if (compressed oop is null)
native oop = null
else
native oop = base + (compressed oop << 3)


This causes a significant drop in performance4. The cutoff for using a zero base varies across operating systems5 but 26 GB is a conservative cutoff across a variety of operating systems.

Less is More

What frequently happens though is that our advice surrounding compressed oops is interpreted as advice to set the heap as high as it can go while staying under the compressed oops threshold. Instead though, it's better to set the heap as low as possible while satisfying your requirements for indexing and query throughput, end-user query response times, yet large enough to have adequate heap space for indexing buffers, and large consumers of heap space like aggregations, and suggesters. The smaller that you can set the heap, the less likely you'll be subject to detrimental long garbage collection pause, and the more physical memory that will be available for the filesystem cache which continues to be used more and more to great effect by Lucene and Elasticsearch.

Straight Cache Homie

Modern operating systems maintain a filesystem cache of pages accessed from disk. This cache only uses free memory and is handled transparently by the operating system. Once a page is read from the file system and placed in the cache, accessing it is as fast as reading from memory. This means that index segments, term dictionaries, and doc values can be accessed as if they are sitting in memory once they've been placed into the cache. What is more, this cache is not managed by the JVM so we get the benefits of blazingly fast memory speeds without the consequences of being on heap. This is why we continue to recommend having as much memory as possible for the filesystem cache.

Garbage First

The JVM engineers have developed a concurrent garbage collector known as G1 GC that was first supported starting in JDK 7u4 and is set to be the default collector starting in JDK 96. This collector divides the heap into regions and is designed to first collect regions that are mostly garbage (hence G1: garbage first). This collector still pauses application threads when collecting, but the idea is that by focusing on regions with the most garbage, these collections will be highly efficient so that application threads need to be paused only briefly. This enables G1 GC to operate on large heaps with predictable pause times. This is exactly what we want! Unfortunately, G1 GC has exhibited bugs that lead to corruption in Lucene indices. While the older builds appear to be the ones most impacted by these issues, scary bugs are still being found even in the latest builds. We test Elasticsearch against G1 GC continuously on our CI infrastructure but at this time we recommend against and do not support running Elasticsearch with the G1 collector.

Together We Can Prevent Forest Fires

The Elasticsearch heap can be specified at startup through the ES_HEAP_SIZE environment variable. The ideal scenario, if you can, is to size your heap below 4 GB. If you have to go above 4 GB, try to stay below the zero-based threshold for your system. You can check if you're under the zero-based threshold by starting Elasticsearch with the JVM options -XX:+UnlockDiagnosticVMOptions -XX:+PrintCompressedOopsMode and looking for output similar to

heap address: 0x000000011be00000, size: 27648 MB, zero based Compressed Oops


showing that zero-based compressed oops are enabled instead of

heap address: 0x0000000118400000, size: 28672 MB, Compressed Oops with base: 0x00000001183ff000


showing that zero-based compressed oops are not enabled. If you have to go above the zero-based threshold, stay below the compressed oops threshold. Starting with Elasticsearch 2.2.0, Elasticsearch logs at startup whether or not it is using compressed oops, and the same information is also available in the nodes info API.

Here are some points-of-consideration for reducing the need for large heaps:

• Reduce the use of field data and take advantage of doc values where possible (the default for every possible field starting in Elasticsearch 2.0.0)7.
• Disk-based norms are available starting in Elasticsearch 2.1.08.
• Doc values consume less memory for multi-fields starting in Elasticsearch 2.2.0.
• Do not over-shard (some advantages among many: a search request across N shards has to collect results from all N shards so fewer shards means smaller result sets to sift through and better request cache utilization, less terms dictionaries, and fewer shards leads to a smaller cluster state).
• Do not use overly-large bulk indexing batch sizes (32 MB is okay, 256 MB is probably not).
• Do not use large bulk indexing queues (to keep the total bytes across all in-flight requests reasonable; a circuit breaker will limit this starting in Elasticsearch 5.0.0).
• Do not request too many hits in a single request, use scrolling instead.
• Do not request too many aggregation buckets or use deeply-nested aggregations.
• Consider trading performance for memory and use breadth_first collection mode for deep aggregations.
• Use Marvel to monitor the JVM heap over time.

The engineers behind Lucene and Elasticsearch continue to investigate ways to reduce the need for a large heap. Stay tuned as we push more components of indices off heap, and find ways within Elasticsearch to reduce the dependency on the heap for executing requests.

1. Throughout this post, MB and GB refer to 220 = 1,048,576 and 230 = 1,073,741,824 bytes, respectively.
2. An application that makes extensive use of 64-bit numerical types might be slower on a 32-bit JVM because it can not take advantage of 64-bit registers and instructions.
3. Aligned objects do lead to a small amount of slop in the heap, but that's okay because modern CPUs prefer 8-byte aligned addresses.
4. Extra CPU instructions are not free, and the branch/predicated instructions that arise from decoding/encoding a non-zero based oop can be especially expensive.
5. On my laptop running OS X 10.11.4 using Oracle JDK 8u74, I can get up to around a 28250 MB heap before the JVM does not use zero-based oops and on my workstation running Fedora 23 using Oracle JDK 8u74, I can get up to around a 30500 MB heap before the JVM does not use zero-based oops.
6. It is interesting to note that G1 GC was initially proposed as a replacement for the CMS collector but is now being touted as a replacement for the throughput collector.
7. Field data and doc values are used for aggregations, sorting and script field access.
8. Norms are an index-time component of relevance scoring; norms can be disabled if you're not using relevance scoring.