September 3, 2014

Performance Considerations for Elasticsearch Indexing

Running Elasticsearch 2.0? Check out this updated post about performance considerations for Elasticsearch 2.0 indexing.

Elasticsearch users have delightfully diverse use cases, ranging from appending tiny log-line documents to indexing Web-scale collections of large documents, and maximizing indexing throughput is often a common and important goal. While we try hard to set good general defaults for "typical" applications, you can quickly improve your indexing performance by following a few simple best practices, described here.

To begin with, do not use a very large java heap if you can help it: set it only as large as is necessary (ideally no more than half of the machine's RAM) to hold the overall maximum working set size for your usage of Elasticsearch. This leaves the remaining (hopefully sizable) RAM for the OS to manage for IO caching. Make sure the OS is not swapping out the java process.

Upgrade to the most recent Elasticsearch release (1.3.2 at this time): numerous indexing related issues have been fixed in recent releases.

Before delving into the details, a caveat: remember that all the information here is up-to-date as of today (1.3.2), but as Elasticsearch is a fast moving target, this information may no longer be accurate when you, future Googler, come across it. If you are unsure, just come ask on the user's list.

Marvel is especially useful when tuning your cluster for indexing throughput: as you iterate on each setting described here you can easily visualize the impact of each change on your cluster's behavior.

Client side

Always use the bulk api, which indexes multiple documents in one request, and experiment with the right number of documents to send with each bulk request. The optimal size depends on many factors, but try to err in the direction of too few rather than too many documents. Use concurrent bulk requests with client-side threads or separate asynchronous requests.

Before you conclude indexing is too slow, be sure you are really making full use of your cluster's hardware: use tools like iostat, top and ps to confirm you are saturating either CPU or IO across all nodes. If you are not then you need more concurrent requests, but if you hit EsRejectedExecutionException from the java client, or TOO_MANY_REQUESTS (429) HTTP response from REST requests, then you are sending too many concurrent requests. If you are using Marvel, you can see the rejection counts under the THREAD POOLS - BULK section of the Node Statistics Dashboard. It is usually not a good idea to increase the bulk thread pool size (defaults to the number of cores) as that will likely decrease overall indexing throughput; it is better to decrease client-side concurrency or add more nodes instead.

Since the settings we discuss here are focused on maximizing indexing throughput for a single shard, it is best to first test just a single node, with a single shard and no replicas, to measure what a single Lucene index is capable of on your documents, and iterate on tuning that, before scaling it out to the entire cluster. This can also give you a baseline to roughly estimate how many nodes you will need in the full cluster to meet your indexing throughput requirements.

Once you have a single shard working well, you can take full advantage of Elasticsearch's scalability and multiple nodes in your cluster by increasing the shard count and replica count.

Before drawing any conclusions, be sure to measure performance of the full cluster over a fairly long time (say 60 minutes), so your test covers the full lifecycle including events like large merges, GC cycles, shard movements, exceeding the OS's IO cache, possibly unexpected swapping, etc.

Storage devices

Unsurprisingly, the storage devices that hold the index have a huge impact on indexing performance:

Use modern solid-state disks (SSDs): they are far faster than even the fastest spinning disks. Not only do they have lower latency for random access and higher sequential IO, they are also better at the highly concurrent IO that is required for simultaneous indexing, merging and searching.
Do not place the index on a remotely mounted filesystem (e.g. NFS or SMB/CIFS); use storage local to the machine instead.
Beware virtualized storage, such as Amazon's Elastic Block Storage. Virtualized storage works very well with Elasticsearch, and it is appealing since it is so fast and simple to set up, but it is also unfortunately inherently slower on an ongoing basis when compared to dedicated local storage. In a recent informal test, even the highest performance provisioned IOPS SSD-backed EBS option was still substantially slower than the local instance-attached SSD, and remember that the local instance-attached SSD is still shared across all virtual machines on that physical machine so you will see otherwise inexplicable slowdowns if the other virtual machines on that physical machine suddenly become IO intensive.
Stripe your index across multiple SSDs by setting multiple path.data directories, or just configure a RAID 0 array. The two are similar, except instead of striping at the file block level, Elasticsearch "stripes" at the individual index files level. Just remember that either approach increases the risk of failure for a single shard (to trade off for faster IO performance) since the failure of any one SSD destroys the index. This is typically the right tradeoff to make: optimize single shards for maximum performance, and then add replicas across different nodes so there's redundancy for any node failures. You can also use snapshot and restore to backup the index for further insurance.

Segments and merging

Under the hood, newly indexed documents are first held in RAM by Lucene's IndexWriter. Periodically, when the RAM buffer is full, or when Elasticsearch triggers a flush or refresh, these documents are written to new on-disk segments. Eventually there are too many segments, and they are merged according to the merge policy and scheduler. This process cascades: the merged segments produce a larger segment, and after enough small merges, those larger segments are also merged. Here is a nice visualization of how this works.

Merges, especially large ones, can take a very long time to run. This is normally fine, because such merges are also rare, so the amortized cost remains low. But if merging cannot keep up with indexing then Elasticsearch will throttle incoming indexing requests to a single thread (as of 1.2) to prevent serious problems when there are far too many segments in the index.

If you see INFO level log messages saying now throttling indexing or you see segment counts growing and growing in Marvel then you know merges are falling behind. Marvel plots the segment count under the MANAGEMENT EXTENDED section of the Index Statistics dashboard, and it should grow at a very slow logarithmic rate, perhaps showing a saw-tooth pattern as large merges complete:

Why would merges fall behind? By default, Elasticsearch limits the allowed aggregate bytes written across all merges to a paltry 20 MB/sec. For spinning disks, this ensures that merging will not saturate the typical drive's IO capacity, allowing concurrent searching to still perform well. But if you are not searching during your indexing, search performance is less important to you than indexing throughput or your index is on SSDs, you should disable merge throttling entirely by setting index.store.throttle.type to none; see store for details. Note that prior to 1.2, there was a nasty bug that caused merge IO throttling to be far more restrictive than you asked for. Upgrade!

If you are unfortunately still using spinning disks, which do not handle concurrent IO nearly as well as SSDs, then you should set index.merge.scheduler.max_thread_count to 1. Otherwise, the default value (which favors SSDs) will allow too many merges to run at once.

Do not call optimize on an index that is still being actively updated, since it is a very costly operation (it merges all segments). However, if you are done adding documents to a given index, it is a good idea to optimize it at that point, since that will reduce resources required during searching. For example, if you are using time-based indices where each day's worth of logs is added to a new index, once that day has passed, it is a good idea to optimize the index, especially if nodes will hold many days worth of indices.

Here are some further settings to tune:

Tune your mappings to turn off any fields you do not actually need, such as disabling the _all field. For fields you would like to keep, you can also tune whether and how they are indexed or stored.
You may be tempted to disable the _source field, but its indexing cost is likely small (just stored, not indexed), and it has substantial value for future updates or for fully re-indexing a previous index, so it is typically not worth disabling unless disk usage is a concern, which it should not be since disk is relatively cheap.
If you can accept some delay in searching recently indexed documents, increase index.refresh_interval to 30s, or disable it entirely by setting it to -1. This allows larger segments to flush and decreases future merge pressure.
As long as you have upgraded to at least Elasticsearch 1.3.2, which fixes issues that could cause excessive RAM usage when flushes are infrequent, increase index.translog.flush_threshold_size from the default (currently 200 mb) to 1 gb, to decrease how frequently fsync is called on the index files.
Marvel plots the flush rate under the MANAGEMENT section of the Index Statistics dashboard.
Use 0 replicas while building up your initial large index, and then enable replicas later on and let them catch up. Just beware that a node failure when you have 0 replicas means you have lost data (your cluster is red) since there is no redundancy. If you plan to call optimize (because no more documents will be added), it is a good idea to do that after finishing indexing and before increasing the replica count so replication can just copy the optimized segment(s). See update index settings for details.

Index buffer size

If your node is doing only heavy indexing, be sure indices.memory.index_buffer_size is large enough to give at most ~512 MB indexing buffer per active shard (beyond that indexing performance does not typically improve). Elasticsearch takes that setting (a percentage of the java heap or an absolute byte-size), and divides it equally among the currently active shards on the node subject to min_index_buffer_size and max_index_buffer_size values; larger values means Lucene writes larger initial segments which reduces future merge pressure.

The default is 10% which is often plenty: for example, if you have 5 active shards on a node, and your heap is 25 GB, then each shard gets 1/5th of 10% of 25 GB = 512 MB (already the maximum). After dedicated heavy indexing, lower this setting back to its default (currently 10%) so search-time data structures have plenty of RAM to use. Note that this is not yet a dynamic setting; there is an issue open to fix that.

The number of bytes currently in use by the index buffer was added to the indices stats API in 1.3.0. You can see it by looking at the indices.segments.index_writer_memory value. This is not yet plotted in Marvel and will be added in the coming version, but you can add a chart yourself (Marvel still collects the data).

Coming in 1.4.0, the indices stats API also shows exactly how much RAM buffer was allocated to each active shard as indices.segments.index_writer_max_memory. To see these values per-shard for a given index, use the http://host:9200/<indexName>/_stats?level=shards; this will return the stats per shard as well as the totals across all shards.

Use auto id or pick a good id

If you do not care what id your documents have, let Elasticsearch automatically assign them: this case is optimized (as of 1.2) to save an ID and version lookup per document, and you can see the performance difference in Elasticsearch's nightly indexing benchmarks (compare the Fast and FastUpdate lines).

If you do have your own ids, try to pick one that is fast for Lucene if that is under your control, and upgrade to at least 1.3.2 since there were further optimizations to id lookup. Just remember that java's UUID.randomUUID() is the worst choice for an id because it has no predictability or pattern on how ids are assigned to segments, causing a seek per segment in the worst case.

You can see the difference in indexing rate as reported by Marvel when using Flake IDs:

versus using fully random UUIDs:

Coming in 1.4.0, we have switched Elasticsearch's auto-generated IDs from random UUIDs to Flake IDs.

If you are curious about the low-level operations Lucene is doing on your index, try enabling lucene.iw TRACE logging (available in 1.2). This produces very verbose output but can be helpful to understand what is happening at the Lucene IndexWriter level. The output is very low-level; Marvel provides a much better real-time graphical view on what is happening to the index.

Scaling out

Remember, we focused here on tuning performance for a single shard (Lucene index) but once you are happy with that, where Elasticsearch really shines is in easily scaling out your indexing and searching across a full cluster of machines. So be sure to increase your shard count again (the default is currently 5), which buys you concurrency across machines, a larger maximum index size, and lower latency when searching. Also remember to increase your replicas to at least 1 so you have redundancy to hardware failures.

Finally, if you are still having trouble, get in touch, e.g. through the Elasticsearch user list. Maybe there is an exciting bug to fix (patches are always welcome!).