This Week in Elasticsearch and Apache Lucene - 2020-02-18

Elasticsearch

EQL

We continue to make progress towards our first milestone of having end-to-end execution of simple EQL queries.

We has been working on the parsing pipeline and we now have this implemented for EQL which has allowed us to start working on specific aspects of the parsing. We are now working on the query translation which will convert the AST created from the query text into an Elasticsearch search request which can be executed to get results. Once this is merged we will be working on translating the search results into the EQL response.

We have been extracting optimiser rules from SQL which will be common with EQL and plugging them into the EQL optimiser. We have also been working on support for wildcards matching which is important for a lot of EQL rules. We have written a test harness so we can ensure feature parity with the original python implementation of EQL. This is important so we can make sure that rules and queries that are run against endpoints can also be run against the Elasticsearch implementation.

We also have the start of documentation for EQL.

Progress of the EQL implementation can be tracked on the meta issue.

Disallowing expensive queries

For a long time Elasticsearch cluster administrators wanted to have control over what type of queries users are allowed to run so to keep their clusters stable. Now it is possible. We merged a PR that adds a new cluster setting search.allow_expensive_queries. If set to false, expensive queries are not allowed to run and an exception is thrown if a user attempts to run them. The disallowed query types are: script and script_score, fuzzy, regexp, prefix, wildcard, range on text and keyword, percolate, joining queries, and geo prefix tree queries. Currently, the level of control is crude, as it completely disallows all these queries, but we are planning to make it more fine-grained, for example to disallow only expensive queries that involve many documents.

System indices

Now that the foundations are in place for hidden indices and system indices, it is time to transition each dot index into the appropriate new index type. History indices are exactly the type of index we want to remain available to users to query, but not show up in all index search results. This week we converted the ILM and SLM history indices into hidden indices.

Modernizing the es-hadoop build

One of the big problems that we face in the es-hadoop build is how we cross compile for different versions of Scala. Our previous approach has worked decently for about a year, but has become a maintenance burden. We are in the process of replacing our old cross compiling logic with Gradle Variants. The work has been promising so far, even highlighting some better patterns we could follow in other parts of the es-hadoop build.

ILM

We updated the forcemerge ILM action so that it is now available in the hot phase (#52073). Since forcemerge should only be used after a rollover action, ILM will validate that the policy definition contains a rollover when creating or updating a policy. This allows users to use the physical resources of the hot nodes (which generally have faster disks) to perform the forcemerge before the index is moved to the warm node.

Apache Lucene

Elasticsearch 7.7 is on a Lucene 8.5 snapshot

We upgraded the master and 8.x branches of Elasticsearch to a new snapshot of Lucene 8.5. This triggered interesting changes on the nightly benchmarks. Indices are slightly more space-efficient thanks to better compression of the terms dictionary, e.g. a ~5% reduction on the geonames track. Memory usage is also slightly better thanks to the fact that the index of stored fields has moved off-heap, e.g. ~15% less memory usage on the geonames track.

Compression of binary doc values

We continue iterating on compressing binary doc values. We are interested in compression of binary doc values, because we plan to use them for the upcoming wildcard_keyword field, which will provide efficient wildcard search on arbitrary strings. The complete lack of compression makes binary doc values very space-inefficient for this use-case today, which should hopefully be addressed very soon as the change looks ready for merging.