This Week in Elasticsearch and Apache Lucene - 2017-02-13

Welcome to This Week in Elasticsearch and Apache Lucene! With this weekly series, we're bringing you an update on all things Elasticsearch and Apache Lucene at Elastic, including the latest on commits, releases and other learning resources.

Faster range and geo-distance queries

Range and geo-distance queries today always execute using Lucene's BKD trees. This gives best performance when ranges are run on their own but might be slow when the result set of the range needs to be intersected with a selective query since the range query would still need to visit all documents that match the range, and this number might be much higher than the number of matches of other clauses. As of 5.4, if the range query is not the most selective part of the query, it will execute using doc values, which allows for much faster query execution when only a minority of matches need to be verified. This change has the potential of making some queries tens of times faster.

Faster nested queries

The relational structure that is created by nested fields is totally opaque on the low level: nested documents are just regular documents to Lucene. As a consequence, Elasticsearch automatically applies a filter that excludes nested docs to all queries that are executed. This logic is being improved by applying the filter using a FILTER clause rather than a MUST_NOT clause (since positive clauses are faster as they can skip more efficiently) as well as only adding the filter when needed. For instance if the query is a term query on a field that may only occur in root documents, there is no need to exclude nested documents since they cannot match anyway.

Flexible search phases

Until now, search phase execution has always been hardcoded. You might for instance know about QUERY_THEN_FETCH or DFS_QUERY_THEN_FETCH. Phases are currently being detached in order to make things easier to unit test, but this also means it will be easier to add new phases to the execution of search requests in the future.

Better query generation with multi-term synonyms

The progress in graph queries, such as multi-token synonyms, continues! Today, in Lucene 6.4.x and Elasticsearch 5.2.0, when the query parser sees that the search-time analyzer created a token graph for a given query, it enumerates all unique paths through the graph, and then creates a big BooleanQuery with each full path analyzed as a sub-query. But this is dangerous: the number of unique paths can grow exponentially in the number of tokens, and while that is unlikely to happen in actual queries, it is still possible. In Lucene we generally try hard to prevent any "adversarial" cases that could lead to denial of service, and so this exciting and tricky change makes graph queries much faster (and safer) by pre-analyzing the graph for its cut vertices, and then directly creating a BooleanQuery, possibly with nested clauses. Beyond just a fun optimization, the change also alters hit scoring, how settings like minimumShouldMatch interact with synonyms, and what boolean operator is used for the tokens inserted by a synonym when the user's query is not a phrase query.

New field capabilities API

Kibana has to embed a lot of client-side logic in order to select which indices to query using the field-stats API. We are working on moving the burden to Elasticsearch by providing a simpler API that only says which fields may be searched or aggregated, as well as making it cheap to query indices that do not have matches so that Kibana could always send queries to all indices without having to worry about the timestamp range.

Changes in 5.3: Changes in 5.4:Changes in 6.0:Coming up:

Apache Lucene

Watch This Space

Stay tuned to this blog, where we'll share more news on the whole Elastic ecosystem including news, learning resources and cool use cases!