Product release

Elasticsearch 6.1.0 released

Today we are pleased to announce the release of Elasticsearch 6.1.0, based on Lucene 7.1.0. This is the latest stable release, and is already available for deployment on Elastic Cloud, our Elasticsearch-as-a-service platform.

Latest stable release in 6.x:

You can read about all the changes in the release notes linked above, but there are a few changes which are worth highlighting:

Index Splitting

As a companion to the Shrink Index API, we now have a Split Index API that allows you to split an existing index into a new index, where each original primary shard is split into two or more primary shards in the new index.

The split is done efficiently by hard-linking the data in the source primary shard into multiple primary shards in the new index, then running a fast Lucene Delete-By-Query to mark documents which should belong to a different shard as deleted. These deleted documents will be physically removed over time by the background merge process.

The split API can only be used on indices that have had the index.number_of_routing_shards setting specified at index creation time. From 7.0, we plan to have this setting be set automatically: until then, this feature will only be available to new indices created on or after Elasticsearch 6.1.0.

Composite Aggregation with Paging

Elasticsearch is designed to return the top-10 best search results or the top-50 most accessed destination pages in your web logs as fast as possible. This speed is part of the reason why Elasticsearch is so popular for analytics. However, sometimes you need to get back ALL terms and the top-N design of aggregations doesn’t allow this to happen efficiently on high cardinality fields.

The new composite aggregation is designed to make this possible. The composite agg allows you to create terms, histogram, or date_histogram composite buckets on one or more fields, sorted in "natural order", i.e. alphabetically for terms, and numerically or by date for the histograms.

Because these composite buckets are returned in sorted order, results can be paged through efficiently in a similar manner to a scroll request. The first search request could return the first 100 or 1000 buckets, then the next tranche can be requested by passing the values of the last composite bucket in the after parameter, and so on until all buckets have been retrieved.

An additional benefit to the composite aggregation is that doc counts and metric aggs directly under the composite aggregation are accurate for the cases where you need non-approximated counts, as we can be sure that we have seen all documents for a particular composite bucket (unlike the top-N model). While you can specify a further terms agg under the composite agg, it will use the standard top-N model and return approximate counts.

Adaptive Replica Selection

Today in Elasticsearch, a series of search requests to the same shard will be forwarded to the primary and each replica in round robin fashion. This can prove problematic if one node starts a long garbage collection — search requests will still be forwarded to the slow node regardless and will have an impact on search latency.

In 6.1, we have added an experimental featured called Adaptive Replica Selection. Each node tracks and compares how long search requests to other nodes take, and uses this information to adjust how frequently to send requests to shards on particular nodes. In our benchmarks, this results in an overall improvement in search throughput and reduced 99th percentile latencies.

This option is disabled by default as we are still fine-tuning how to compare different search requests and how to account for differences due to caching, but the results we are seeing are very promising. You can enable or disable this feature at runtime by updating a dynamic cluster setting, so it is worth trying this out in your environment. If you do so, we would love to hear about your results.

Improved Indexing Throughput

Each document indexed in Elasticsearch includes a _fields metafield, which lists the fields contained in that document. This is needed to support the exists query. It turns out that this simple feature is surprisingly costly. We have since reworked the exists query to use doc-values or norms as a proxy for _fields, which limits the need for the _fields metafield to only those fields that have neither doc-values nor norms. This simple change has resulted in a massive 15% increase in indexing throughput in our benchmarks, with no loss of functionality.

Scripted Similarities

Elasticsearch now uses BM25 scoring instead of TF/IDF, which is going to be removed.  That said, some people still want to use TF/IDF, and some people would like to have more control over scoring such as disabling term frequency or inverse document frequency.  Previously, the only way to have such control was to write an Elasticsearch plugin.  This has become much easier thanks to scripted similarities.  Now, you can write your own custom similarity using Painless.  The linked docs demonstrate how to recreate TF/IDF with two simple scripts.

Watcher run_as support

Up until now, watches have been executed as an internal X-Pack user when security is enabled, which allowed the watch to access any index that the X-Pack user has access to.  Starting in 6.1.0, search inputs, search transforms, and index actions will instead be run as the user who created (or last updated) the watch.  This will limit the watch's privileges to those of the user: if the user can't read index foo, then neither can the watch. Elevated permissions can still be requested in a watch by using the run_as privilege.  Existing watches will continue to run as the X-Pack user until they are updated.

Conclusion

Please download Elasticsearch 6.1.0, try it out, and let us know what you think on Twitter (@elastic) or in our forum. You can report any problems on the GitHub issues page.