26 Oktober 2016 Veröffentlichungen

Elasticsearch 5.0.0 released

Von Clinton Gormley

With 2,674 pull requests by 310 committers added since the release of Elasticsearch 2.0.0, we are proud to announce the release of Elasticsearch 5.0.0 GA, based on Lucene 6.2.0. This is the latest stable release and is already available for deployment on Elastic Cloud, our Elasticsearch-as-a-service platform.

Go and download it today! You know you want to.

Elasticsearch 5.0.0 has something for everyone. This is the fastest, safest, most resilient, easiest to use version of Elasticsearch ever, and it comes with a boatload of enhancements and new features.

Indexing performance

Indexing throughput has improved dramatically in 5.0.0 thanks to a number of changes including better numeric data structures (see New data structures), reduced contention in the lock that prevents concurrent updates to the same document, and reduced locking requirements when fsyncing the transaction log. Asynchronous translog fsyncing will particularly benefit users of spinning disks, and the append-only use case (think time-based events) has a big throughput improvement when relying on Elasticsearch to auto-generate document IDs. An internal change to how real-time document GET is supported means there is more memory available for the indexing buffer and much less time spent in garbage collection.

Depending on your use case, you are likely to see somewhere between 25% - 80% improvement to indexing throughput.

Ingest node

Getting your data into Elasticsearch just got a whole lot easier. Logstash is a powerful tool but many smaller users just need the filters it provides without the extensive routing options. We have taken the most popular Logstash filters such as grok, split, convert, and date, and implemented them directly in Elasticsearch as processors. Multiple processors are combined into a pipeline which can be applied to documents at index time via the index or bulk APIs.

Now, ingesting log messages is as easy as configuring a pipeline with the Ingest API, and setting up Filebeat to forward a log file to Elasticsearch. You can even run dedicated ingest nodes to separate the extraction and enrichment work load from search, aggregations, and indexing. You can read more in A New Way to Ingest - Part 1.  If you have ideas for other processors or want to submit your own, let us know!

Painless scripting

Scripting is everywhere in Elasticsearch, and it was very frustrating to not be able to enable scripting by default for security reasons. We are thrilled to announce that we have developed a new scripting language called Painless, which is fast, safe, secure, and enabled by default. Not only that, Painless is 4 times as fast as Groovy, and getting faster. We like Painless so much that we have made it the new default scripting language and deprecated Groovy, Javascript, and Python.  Of course you can use Painless in a script query or alert on Painless scripted conditions using X-Pack but you can also combine a Painless script with the reindex API or an Ingest node for a powerful way to manipulate documents.

You can read more about Painless in the blog Painless: A New Scripting Language.

New data structures

Lucene 6 brought a new Points datastructure for numeric and geo-point fields called Block K-D trees, which has revolutionised how numeric values are indexed and searched. In our benchmark, Points are 36% faster at query time, 71% faster at index time and used 66% less disk and 85% less memory (see Searching numb3rs in 5.0). The addition of Points means that the ip field can now support IPv6 as well as IPv4 addresses. On top of that, we have changed geo_point fields to use Lucene’s new LatLonPoint, which doubles geo-point query performance.

We have also added a half_float field type for half-precision (16 bit) floating points, and a scaled_float type which is implemented internally as a long field and so can benefit from the compression techniques used for longs.  These new types can mean significantly reduced disk space in many cases, especially in metrics data.

Analyzed and not-analysed string fields have been replaced by dedicated text fields for full text search, and keyword fields for string identifier search, sorting, and aggregations. See Strings are dead, long live strings!

Finally, dots-in-fieldnames are back, but properly supported this time: a dotted field like foo.bar is the equivalent of a bar field inside an object field foo. For those of you stuck on 1.7 because of the lack of support for dots-in-fieldnames in 2.x, see Elasticsearch 2.4.0 which provides a migration path.

Search and Aggregations

Those beautiful Kibana charts which graph some metric over the previous day, month, or year, are going to get a serious speed boost with Instant Aggregations. Most of the data in those charts comes from indices which are no longer being updated, but Elasticsearch had to recalculate the aggregation from scratch on every request because it wasn’t possible to cache a range like { "gte": "now-30d", "lte": "now" }. A year’s worth of refactoring of the search API later, and Elasticsearch can be much cleverer about how a range query is executed, and will now only recalculate the aggregation for indices that have changed.

Besides this, aggregations have seen more improvements: histogram aggregations now support fractional buckets and handle the rounding of negative buckets correctly, terms aggregations are calculated more efficiently to reduce the risk of combinatorial explosion, and aggregations are now supported by the Profile API.

On the search side, the default relevance calculation has been changed from TF/IDF to the more modern BM25. Deep pagination of search results is now possible with the search_after feature, which efficiently skips over previously returned results to return just the next page.  Scrolled searches can now be sliced and executed in parallel - this functionality will soon be added to reindex, update-by-query, and delete-by-query. 

The completion suggester has been completely rewritten to take deleted documents into account, and it now returns documents instead of just payloads. Contexts are more flexible and suggestions can be requested for a context, many contexts, or all contexts. Suggestions can not only be weighted at index time — scores can be adjusted based on prefix length, contexts, or geolocation.

The percolator has also been rewritten to be faster and more memory efficient. The terms in percolator queries are now indexed so that only the queries that could possibly match need to be checked, instead of the brute force approach that was used before. The Percolator API has been replaced by the percolate query, which opens up all the features of the search API to percolation: highlighting, scoring, pagination, etc.

User friendly

A major theme of the 5.0.0 release has been making Elasticsearch safer and easier to use. We have adopted the approach of "Shout loud, shout early" — if something is wrong, we should tell you about it immediately instead of leaving you guessing and possibly at risk of a 3 a.m. pager to debug a production problem. An example of this attitude is the improvements made to index and cluster settings.

Settings are now validated strictly. If Elasticsearch doesn’t understand the value of a setting, it complains. If it doesn’t recognise a setting, it complains, and offers you did-you-mean suggestions. Not only that, cluster and index settings can now be unset (with null), and you can see the default settings by specifying the include_defaults parameter. Similarly, body and query string parameters are parsed strictly and invalid or unrecognised parameters result in exceptions.

The rollover and shrink APIs enable a new pattern for managing time based indices efficiently: using multiple shards to make the most of your hardware resources during indexing, then shrinking indices down to a single shard (or a few shards) for efficient storage and search.

There have been a number of improvements to index management as well. Creating an index is simpler than before — instead of having to wait_for_status=green, the create index request now only returns when an index is ready to be used. And a newly created index no longer turns the cluster red, allowing sysadmins to sleep peacefully at night.

If something does go wrong with shard allocation, Elasticsearch no longer keeps trying to allocate the shard indefinitely (while filling up your logs and possibly your disks). Instead, it will give up after 5 attempts and wait for you to resolve the problem. To help resolve the problem, the new cluster-allocation-explain API is an essential tool for figuring out why a shard isn’t allocated.

Finally, deprecation logging is now turned on by default to give you ample warning about any deprecated functionality you are using. This should make the transition to new major versions in the future a much less stressful process.

Resiliency

There are a host of changes that have gone into this release to make Elasticsearch safer than ever before. Every part of the distributed model has been picked apart, refactored, simplified, and made more reliable. Cluster state updates now wait for acknowledgement from all the nodes in the cluster. When a replica shard is marked as failed by the primary, the primary now waits for a response from the master. Indexes now use their UUID in the data path, instead of the index name, to avoid naming clashes.

Your system must be properly configured (such as having sufficient available file descriptors) otherwise you are putting yourself at risk of losing data later on. Elasticsearch now has bootstrap checks which ensure that you are not in for a nasty surprise down the road. But proper system configuration can be onerous, especially if you’re just wanting to experiment on your laptop. Elasticsearch starts in a localhost-only lenient developer mode which just warns you about poor configuration. As soon as you configure the network to form a cluster, it switches to production mode, which enforces all of the system checks. See Bootstrap checks: Annoying you now instead of devastating you later!

We have added a circuit breaker which limits the amount of memory which can be used by in-flight requests, and expanded the request circuit breaker to track the memory used by aggregation buckets and to abort pathological requests which request trillions of buckets. While an out-of-memory exception is much less likely than before, if one does occur, the node will now die-with-dignity instead of limping along in some undefined state.

For sysadmins, especially those in multitenant environments, we have added numerous soft limits to protect your cluster from malicious users and to alert the naive when they are straying into dangerous territory. Examples include: default search timeouts, disable fielddata loading on text fields, limiting the number of shards a search request can target, limiting the number of fields in a mapping, etc.

Java REST client

After years of waiting, we have finally released a low-level Java HTTP/REST client. It provides a simple HTTP client with minimal dependencies, which handles sniffing, logging, round robining of requests, and retry on node failure. It uses the REST layer which has historically been much more stable than the Java API, which means that it can be used across upgrades, possibly even upgrades across major versions. It works with Java 7 and has minimal dependencies, resulting in fewer dependency conflicts than the Transport client. It is just HTTP and so can be firewalled/proxied like all of the other HTTP clients.  In our benchmarks, the Java REST client performs similarly to the Transport client.

Be aware that this is a low-level client. At this stage we don’t provide any query builders or helpers that will allow for autocompletion in your IDE. It is JSON-in, JSON-out, and it is up to you how you build the JSON. Development won’t stop here — we will be adding an API which will help you to construct queries and to parse responses. You can follow the changes in issue #19055.

Migration Helper

The Elasticsearch Migration Helper is a site plugin designed to help you to prepare for your migration from Elasticsearch 2.3.x/2.4.x to Elasticsearch 5.0. It comes with three tools:

Cluster Checkup
Runs a series of checks on your cluster, nodes, and indices and alerts you to any known problems that need to be rectified before upgrading.
Reindex Helper
Indices created before v2.0.0 need to be reindexed before they can be used in Elasticsearch 5.x. The reindex helper upgrades old indices at the click of a button.
Deprecation Logging
Elasticsearch comes with a deprecation logger which will log a message whenever deprecated functionality is used. This tool enables or disables deprecation logging on your cluster.

Instruction for install the Elasticsearch migration helper

For instructions when migrating from earlier versions of Elasticsearch, see the Upgrade documentation.

Conclusion

Please download Elasticsearch 5.0.0, try it out, and let us know what you think on Twitter (@elastic) or in our forum. You can report any problems on the GitHub issues page.