This Week in Elasticsearch and Apache Lucene - 2017-07-03

Welcome to This Week in Elasticsearch and Apache Lucene! With this weekly series, we're bringing you an update on all things Elasticsearch and Apache Lucene at Elastic, including the latest on commits, releases and other learning resources.

Configurable translog retention for fast ops-based shard recovery

Up to 6.0 the primary purpose of the translog was to persist all operations that hadn't yet been persisted to Lucene. Once Lucene files were committed (i.e., fsync) we would clear the translog and start writing to it again (a.k.a _flush). While this behaviour is optimal from a storage perspective, it is not ideal for the new ops-based fast recovery introduced in 6.0. In order to bring a replica up to speed without copying segments, the primary needs to have all the required operations available in its translog. That meant that fast ops recoveries weren't possible during full restarts as we flush when we recover the primary, or if a node left the cluster long enough for the primary to flush while it was away. This now changes with the introduction of a translog retention policy - we will now keep translog files (up 512MB and for at least 12 hours by default) even if the data they contain is already committed to lucene. This is also very important for the future xDCR feature where we need the ops to be streamed off cluster.

Augmenting Painless functionality

Scripts in Elasticsearch are now run in a particular "context", which defines the return value type and the variables and functions that are enabled in a particular script, e.g. an update script should have access to _source, while a fast search script shouldn't. Now, other classes can extend the functionality available to Painless scripts, instead of them being restricted to just a hardcoded list of functions. This will allow Painless scripts to expose utility functions which are only useful in certain contexts.

Removal of default passwords

We will no longer be using the default changeme password for internal users in Elasticsearch 6.0 with security. Instead, a cluster in production mode will start with no passwords, but will only accept requests to change the password for the elastic user from localhost. This can be done via the API, but we provide a command line tool which will generate strong passwords for all internal users and output them to the console. Alternatively, the user can specify their own passwords using the same tool. For Docker, where this CLI tool is not accessible, we allow a temporary password to be set via an environment variable, which will cease to work as soon as the user changes the password via the API.

Longer user and role names Some users have long requested longer user and role names in security, which were previously limited to just 30 characters. The new limits are as follows:

A valid username's length must be at least 1 and no more than 1024 characters. It may not contain leading or trailing whitespace. All characters in the name must be be alphanumeric (a-z, A-Z, 0-9), printable punctuation or symbols in the Basic Latin (ASCII) block, or the space character.

Changes in 5.5:

Changes in 5.x:

Changes in master:

Apache Lucene

We're moving towards getting Lucene 7 released. The 7.0 and 7.x branches are expected to be cut very soon. A change in the version string format of Java 9 affects Hadoop, which Solr relies on, and we are looking at the best way to get things fixed since it will be important for Lucene 7 to support Java 9.

Videos

Want to watch Lucene-related content? There were 3 Lucene-related talks at Berlin Buzzwords this year:

Faster deletes/updates

Mike has been exploring using multiple threads to resolve doc ids for the deleted terms, which results in a 50% update throughput increase in some cases.

Other

Watch This Space

Stay tuned to this blog, where we'll share more news on the whole Elastic ecosystem including news, learning resources and cool use cases!