April 5, 2016

Elasticsearch 5.0.0-alpha1 released

Today we are excited to announce the release of Elasticsearch 5.0.0-alpha1 based on Lucene 6-SNAPSHOT. This is the first in a series of pre-5.0.0 releases designed to let you test out your application with the features and changes coming in 5.0.0, and to give us feedback about any problems that you encounter.

IMPORTANT: This is an alpha release and is intended for testing purposes only. Indices created in this version will not be compatible with Elasticsearch 5.0.0 GA. Upgrading 5.0.0-alpha1 to any other version is not supported.

DO NOT DEPLOY IN PRODUCTION

Elasticsearch 5.0.0-alpha1 is jam packed with awesome new features, with more to be added before we release 5.0.0 GA.

Lucene 6

Lucene 6 brings a major new feature (not yet implemented in Elasticsearch) called Dimensional Points: a new tree-based data structure which will be used for numeric, date, and geospatial fields. Compared to the existing field format, this new structure uses half the disk space, is twice as fast to index, and increases search performance by 25%. It will be of particular benefit to the logging and metrics use cases. You can read more in this blog: Multi-dimensional points, coming in Apache Lucene 6.0.

Because we're using a snapshot of Lucene, Java users of Elasticsearch will need to update their POM files temporarily for this release.

Ingest Node

Getting your data into Elasticsearch just got a whole lot easier. Logstash is a powerful tool but many smaller users just need the filters it provides without the extensive routing options. We have taken the most popular Logstash filters such as grok, split, convert, and date, and implemented them directly in Elasticsearch as processors. Multiple processors are combined into a pipeline which can be applied to documents at index time via the index or bulk APIs.

Now, ingesting log messages is as easy as configuring a pipeline with the Ingest API, and setting up Filebeat to forward a log file to Elasticsearch. You can even run dedicated ingest nodes to separate the extraction and enrichment work load from search, aggregations, and indexing.

Painless Scripting

It was a real disappointment when we had to disable dynamic or inline Groovy scripting for security reasons. Convenient scripting suddenly became inflexible and painful. We are thrilled to announce that we have a new scripting language called Painless, which is fast, safe, secure, and enabled by default. Painless was written specifically for Elasticsearch, so we were able to ensure that it was designed from the ground up to be safe. Its syntax is similar to other dynamic scripting languages like Javascript or Groovy:

def first = input.doc.first_name.0;
def last  = input.doc.last_name.0;
return first + " " + last;

Painless offers gradual typing. Scripts, which use dynamic typing, such as the above example, will perform similarly to Groovy. The same script can also be written using strong typing:

String first = (String)((List)((Map)input.get("doc")).get("first_name")).get(0);
String last  = (String)((List)((Map)input.get("doc")).get("last_name")).get(0);
return first + " " + last;

Clearly, this is much harder to read, but the good news is this: it is ten times faster! So you can prototype using dynamic typing, then convert to strong typing to get a performance boost in production.

Painless is enabled by default but is still experimental. Please try it out and let us know about anything that doesn’t work as it should, or any features which are missing. You can read more in the Painless documentation.

Instant aggregations

Over a year ago we added a feature called the query cache (now known as the shard request cache) to cache aggregation results so that shards that haven’t changed since the previous search can return their results almost instantly. The problem is that it didn’t work for the most common use case: calculating a date histogram with a timestamp range like from:now-30d to:now or 30 daily indices. The problem is that now changes every millisecond, thus invalidating the cached search.

During the last year we have completely refactored how and where search requests are parsed. Parsing is now stricter so you know immediately when an invalid parameter is used, and this is backed up by an extensive test suite. This refactoring means that we are now able to rewrite queries more flexibly than we were able to do before.

First, now is replaced with the current value for now. Second, each shard can compare the actual values it contains to the to/from range and potentially rewrite the range query to a match_all or a match_none. The end result is that we are now able to cache queries like these so that we only need to recaculate aggregations on shards that have actually changed, greatly improving performance.

This approach is so beneficial that result caching is now enabled by default.

Text/Keyword to Replace Strings

String fields are currently used both for full text content, such as the body of the email, and for keyword identifiers like email addresses, post codes, HTTP status codes, and domain names. The use cases for these two types of content are quite different: full text is all about relevance, while keywords are used for sorting, filtering, and for aggregations.

We have added two new field datatypes: text for full text, and keyword for keyword identifiers. Text fields support the full analysis chain, while keyword fields will (in a later release) support limited analysis only — just enough to normalise values with lower casing and similar transformations. Keyword fields support doc values for memory-friendly sorting and aggregations, while text fields have fielddata disabled by default, to prevent loading masses of data into memory by mistake.

The string field type will continue to work during the 5.x series, but will be removed in 6.0.

Completion Suggester v2

The completion suggester, added in Elasticsearch 1.0.0, has been completely rewritten to address some of the problems with the original implementation. The new suggester takes deleted documents into account, one of the most requested features in Elasticsearch. On top of that, it no longer returns just the suggestion or a stored payload. Instead it returns a document (or fields within a document). Contexts have been updated so that multiple or even all contexts can be queried at the same time. Suggestions can not only be weighted at index time — scores can be adjusted based on prefix length, contexts, or geolocation.

Settings Validation

How many hours have you lost to a settings typo which was just silently ignored? No more! Almost all settings are now strictly validated, either when starting a node, or when updating a dynamic cluster or index setting. Setting updates are now atomic: if one setting is invalid, the whole request will be rejected, so you won’t be left in an impossible situation where only half your settings have been applied.

The settings rewrite brings two new long-requested features:

Settings can be unset (or reset to the default value) simply by setting them to null.
The default value for all settings can be retrieved by adding ?include_defaults to the GET-settings request.

Safety in production

We have seen many users suffer unexpected problems and even data loss because they have forgotten a simple setting like ensuring enough file handles are available, or giving Elasticsearch permission to mlockall the entire heap into memory. While we log warnings about these issue, not many people look at the logs until after disaster has struck. To make users aware of these problems early on, Elasticsearch should refuse to start.

This is a difficult problem to solve because it interferes with the out-of-the-box experience. Who wants to configure their laptop as if it were a production system before starting to develop on Elasticsearch? We needed to find some way to distinguish between Elasticsearch being used in development mode and in production mode. Since Elasticsearch 2.0.0, nodes bind only to localhost by default. This is fine for development but won’t allow you to form a cluster across multiple servers without configuring the networking.

This is the signal we are using: if network settings use the defaults, we warn about potential problems in the logs. As soon as networking is configured, those warnings are upgraded to exceptions which will prevent Elasticsearch from starting.

Resilience

Besides the more obvious features listed above, there are a host of changes that have gone into this release to make Elasticsearch safer than ever before. Every part of the distributed model has been picked apart, refactored, simplified, and made more reliable. Cluster state updates now wait for acknowledgement from all the nodes in the cluster. When a replica shard is marked as failed by the primary, the primary now waits for a response from the master. Indexes now use their UUID in the data path, instead of the index name, to avoid naming clashes. There are many more incremental improvements like these which mean that your data is safer.

Conclusion

Please download Elasticsearch 5.0.0-alpha1, try it out, and let us know what you think on Twitter (@elastic) or in our forum. You can report any problems on the GitHub issues page.