Product release

Elasticsearch 6.0.0-alpha1 released

We are excited to announce the release of Elasticsearch 6.0.0-alpha1, based on Lucene 7-SNAPSHOT. This is the first in a series of pre-6.0.0 releases designed to let you test out your application with the features and changes coming in 6.0.0, and to give us feedback about any problems that you encounter. Open a bug report today and become an Elastic Pioneer.

IMPORTANT: This is an alpha release and is intended for testing purposes only. Indices created in this version will not be compatible with Elasticsearch 6.0.0 GA. Upgrading 6.0.0-alpha1 to any other version is not supported.

DO NOT DEPLOY IN PRODUCTION

Elasticsearch 6.0.0-alpha1 is a departure from our previous major version upgrades. Elasticsearch 1, 2, and 5 were huge releases containing many new features and changes, and it required significant effort to upgrade between major versions. In Elasticsearch 6, we are trying to make this process much, much easier.

Throughout the 5.x series, we have added new features and breaking changes to the master (6.0) branch, then we have added backwards compatibility and backported these changes to 5.x minor versions. This means that we have been able to get more improvements into the hands of our users earlier, and that the differences between 6.0 and 5.x are smaller than in previous major versions.

So, if the differences are small, why upgrade at all? Elasticsearch 6.0.0 will use Lucene 7, which brings benefits like sparse doc values and index sorting. Elasticsearch and X-Pack have a number of exciting new features lined up which will evolve during the 6.x series such as fast replica recovery, secure settings, distributed Watch execution, and support for OAuth, SAML, and Kerberos. Plus there are the incremental improvements made in major releases to clean out cruft, improve APIs, and to improve performance and resilience.

Making it Easier to Upgrade

Here is how we have set about making it easier to upgrade to 6.0:

Upgrading to 6.0 with Rolling Restarts

The hardest part about upgrading to a new major version has been the fact that you have to do a full cluster restart to get there. No more! You will be able to upgrade from the latest 5.x version to 6.0 using rolling restarts, without any cluster downtime. The only exception to this is if you use X-Pack Security without SSL/TLS enabled. TLS between nodes is required in X-Pack Security in 6.0 and the only way to enable it if you aren’t already using it is to do a full cluster restart, which you can choose to do either in 5.x or in 6.0.

Cross Cluster Search Across Major Versions

As with previous major version upgrades, Elasticsearch 6.0 will be able to read indices created in 5.x, but not those created in 2.x. However, instead of needing to reindex all of your old indices, you can choose to leave them in a 5.x cluster and to use Cross Cluster Search to search across both your 6.x and 5.x clusters at the same time.

Check Your Deprecation Logs

Major versions are a time to make breaking changes, to clean out the cruft. With previous versions, you had to do a lot of research in the release notes to figure out what changes to make to your application to work with the new version. In 5.x, we have made this much easier. Wherever possible, we have added deprecation logging to warn you about functionality that is either going away or changing, and (wherever possible) we have added a backwards compatibility layer which allows you to migrate your application to the new functionality before moving to 6.0.

Upgrade to the latest version of 5.x and consult the deprecation logs before upgrading to 6.0.

Changes in 6.0

Sparse Doc Values

Doc values (the columnar data store in Elasticsearch) have allowed us to escape the limitations of JVM heap size to support scalable analytics on larger amounts of data. Doc values are a very good fit for dense values, where every document has a value for every field:

Field Doc_1 Doc_2 Doc_3 Doc_N

one

10

20

15

17

two

1.5

6.2

9.8

8.7

But they have been a poor fit for sparse values (many fields, with few documents having a value for each field), where the matrix structure ends up wasting a lot of space:

Field Doc_1 Doc_2 Doc_3 Doc_N

one

10

-

-

-

two

-

6.2

-

-

three

-

-

abc

-

four

-

-

-

2017-04-01

Lucene 7 brings support for sparse doc values — an alternative encoding format for the sparse case which can save a lot of both disk space and file-system cache.

Index Sorting

Imagine that you have a large search-heavy index. Searches should be super-fast, but a significant part of every search request is sorting the results into the correct order in order to return just the top 10 best hits. With index sorting, you can pay the price of sorting at index time (30-40% of throughput) instead of at search time. That way, a search can terminate as soon as it has gathered sufficient hits.

To take advantage of this, your documents need to be sorted at index time in the same order as will be used for your primary sort criterion at search time, e.g. by price or timestamp. This means that it won’t work well where your primary sort is on the relevance _score. It also isn’t suitable for searches with aggregations, as aggregations have to examine all documents regardless and can’t terminate early.

However, there is another non-obvious benefit of index sorting. Sorting on low-cardinality fields such as age, gender, is_published, which are commonly used as filters, can result in more efficient searches as all potential matching documents are grouped together.

Elasticsearch 6.0.0-alpha1 supports index sorting (and so would already benefit the low-cardinality use case), but doesn’t yet expose the early termination of searches. See Index Sorting for more details.

Sequence Numbers and Ops Based Recoveries

While synced flush has greatly improved shard recovery times for indices that are not being written to, recovery of active indices is still a slow and heavy operation. An active replica on a node that leaves the cluster for a brief period still needs to copy over all or most of the files in the primary shard in order to bring itself up to date.

The new Sequence Numbers infrastructure assigns an incremental operation ID to every index, update, or delete. This new infrastructure allows a replica to ask the primary for all operations from X onwards. If these operations are found in the primary’s translog an older replica can bring itself up to date by just replaying the transaction log and avoid the need to copy files.

This is a feature that is partially present in 6.0.0-alpha1 and will continue to evolve towards 6.0 and during the 6.x series. The main areas of improvements are:

  • Increase the chance of a succesful ops based recovery by keeping transaction logs around for longer than strictly necessary. It will be possible to configure how long to keep old transaction logs, allowing you to balance disk usage with longer outage periods.
  • If a primary shard fails and it is configured to have multiple replicas, it is possible for each replica to have different operations in flight — operations which have not yet been acknowledged to the user. Sequence numbers allow the replicas to sync with the newly elected primary immediately rather than wait for the next recovery to ensure that all shards hold the same data.
  • Sequence numbers can improve optimistic locking, which is currently implemented using internal versioning.

See Consensus and Replication in Elasticsearch for more on the topic.

Long term we are planning to use Sequence Numbers to power new features like a Changes API and Cross Data-Centre Disaster Recovery.

Removal of Mapping Types

Early on in the history of Elasticsearch, we used to describe an index as being like a database, and a mapping type as being like a table in the database. This turned out to be a bad description as tables are supposed to be independent of each other. Fields in mapping types are not independent of each other as fields with the same name in the same index in different types are backed by the same Lucene field internally.

In 5.0 we enforced this requirement: fields with the same name in the same index must have the same mapping. In 6.0 we are taking this further: an index created in 6.0 can only have one type. (Your existing 5.x indices will continue to function as before.) This also means a change in how we specify parent-child mappings, as currently these depend on having different parent and child types. In 6.0, we will also add APIs that support type-less operations, such as indexing a document by specifying only the index and ID, instead of also having to specify the type. Eventually, in 7.0, we will remove all support for types.

In 6.0.0-alpha1, only a single type is allowed in new indices and the _type field has been removed. This will be followed up with the other changes mentioned above in later pre-6.0.0 releases.

Conclusion

Please download Elasticsearch 6.0.0-alpha1, try it out, and let us know what you think on Twitter (@elastic) or in our forum. You can report any problems on the GitHub issues page.