13 February 2017

This Week in Elasticsearch and Apache Lucene - 2017-02-13

•

Welcome to This Week in Elasticsearch and Apache Lucene! With this weekly series, we're bringing you an update on all things Elasticsearch and Apache Lucene at Elastic, including the latest on commits, releases and other learning resources.

Molly, Elasticsearch and Data Replication https://t.co/zQ49fAzewk . Thank you Kamala and @palvaro . It was fun working with you.
— Boaz Leskes (@bleskes) February 7, 2017

Faster range and geo-distance queries

Range and geo-distance queries today always execute using Lucene's BKD trees. This gives best performance when ranges are run on their own but might be slow when the result set of the range needs to be intersected with a selective query since the range query would still need to visit all documents that match the range, and this number might be much higher than the number of matches of other clauses. As of 5.4, if the range query is not the most selective part of the query, it will execute using doc values, which allows for much faster query execution when only a minority of matches need to be verified. This change has the potential of making some queries tens of times faster.

Faster nested queries

The relational structure that is created by nested fields is totally opaque on the low level: nested documents are just regular documents to Lucene. As a consequence, Elasticsearch automatically applies a filter that excludes nested docs to all queries that are executed. This logic is being improved by applying the filter using a FILTER clause rather than a MUST_NOT clause (since positive clauses are faster as they can skip more efficiently) as well as only adding the filter when needed. For instance if the query is a term query on a field that may only occur in root documents, there is no need to exclude nested documents since they cannot match anyway.

Flexible search phases

Until now, search phase execution has always been hardcoded. You might for instance know about QUERY_THEN_FETCH or DFS_QUERY_THEN_FETCH. Phases are currently being detached in order to make things easier to unit test, but this also means it will be easier to add new phases to the execution of search requests in the future.

Better query generation with multi-term synonyms

The progress in graph queries, such as multi-token synonyms, continues! Today, in Lucene 6.4.x and Elasticsearch 5.2.0, when the query parser sees that the search-time analyzer created a token graph for a given query, it enumerates all unique paths through the graph, and then creates a big BooleanQuery with each full path analyzed as a sub-query. But this is dangerous: the number of unique paths can grow exponentially in the number of tokens, and while that is unlikely to happen in actual queries, it is still possible. In Lucene we generally try hard to prevent any "adversarial" cases that could lead to denial of service, and so this exciting and tricky change makes graph queries much faster (and safer) by pre-analyzing the graph for its cut vertices, and then directly creating a BooleanQuery, possibly with nested clauses. Beyond just a fun optimization, the change also alters hit scoring, how settings like minimumShouldMatch interact with synonyms, and what boolean operator is used for the tokens inserted by a synonym when the user's query is not a phrase query.

New field capabilities API

Kibana has to embed a lot of client-side logic in order to select which indices to query using the field-stats API. We are working on moving the burden to Elasticsearch by providing a simpler API that only says which fields may be searched or aggregated, as well as making it cheap to query indices that do not have matches so that Kibana could always send queries to all indices without having to worry about the timestamp range.

Changes in 5.3:

Cluster state appliers (which make changes based on the cluster state) can no longer sample not-yet-applied cluster states, but appliers can create an observer when they need to wait for specific changes in a new cluster state.
Connections to new nodes can now happen in parallel.
Upgraded to Lucene 6.4.1, which fixes two memory leaks.
Secure settings are now validated at startup.
The backport of the change to execute index/delete/update requests via the bulk API introduced a bug in versioning.
The bulk processor now requires a content-type when accepting bytes to avoid content detection.
Cluster allocation explain should not return an empty body when an exception is thrown.
The field-collapsing feature no longer uses blocking calls to populate its inner hits.
Bulk and msearch APIs support the ndjson content-type (and have deprecated support for ldjson).
The cross-cluster client can now be disabled on a per-node basis.
The search.highlight.term_vector_multi_value setting was not exposed.
The internal-only QUERY_AND_FETCH search type has been removed.
The include_in_all parameter has been deprecated and will not allowed on new indices in 6.0.0.

Changes in 5.4:

The reduce phase of aggs, suggesters, and profiles has been split from the merge phase of fetched hits.
HEAD requests should return the correct content-length header, fixed template, index, alias,
Upgrade Lucene to 6.5.0-SNAPSHOT.
The typed_keys parameter to search requests will prefix agg and suggester names with the agg/suggester type for easier parsing by clients.
MasterFaultDetection can start as soon as the first cluster state has been processed as connection to the master is guaranteed.
The Java High Level REST client continues to make progress with: xcontent parsing for suggesters and the main response.
The task description for indexing requests only includes the request body if it is smaller than 2kB.

Changes in 6.0:

SocketPermissions can be restricted to just Netty, instead of granting to the whole transport module.
InternalSearchHits (the only implementation of SearchHits) has been folded into its interface.
Methods requiring connect permissions are now forbidden.
Doc values for date fields in scripts will now return ReadableDateTime objects instead of long values.

Coming up:

The translog is changing from being file oriented to becoming sequence-number oriented.

Apache Lucene

The Lucene 6.4.1 release bits were set free and the 5.5.4 bug fix release vote is now underway
Term filters are so cheap that we should never cache them, leaving space for more complex queries, and we should also cache compound filters earlier than their sub-clauses to improve cache efficiency
The near-real-time document suggester should allow for optionally filtering out duplicates
Unmapping files opened with MMapDirectory on close will also work for Java 9 and Lucene 5.5.4
An intermittent NullPointerException caused by graduating the new IndexOrDocValuesQuery from sandbox to core has been fixed
The APIs to track external data structures along with Lucene's LeafReaders can be greatly improved for Lucene 7.0
When index files go missing we should throw a CorruptIndexException, not a NoSuchFileException
SortedNumericDocValues can soon be wrapped as a DoubleValueSource or LongValueSource using a selector, making their values accessible to things like function queries and expressions
MemoryIndex is being modernized: it now respects omitNorms , directly implements the new doc values iterators and properly implement per-field postings payloads
ComplexPhraseQueryParser and AsciiFoldingFilter struggle to work together
TermGroupFacetCollector is tricky to use if you also want to sort
It is far too easy to accidentally create an index that messes up block joins
UnifiedHighlighter fails to highlight all terms from SpanNearQuery
Block join queries should not try to track their original un-rewritten forms
Java 8's UnaryOperator confuses our javadocs checker
OneMergeWrappingMergePolicy lets you change each merge the merge policy chose before it's executed
Images referenced by the javadocs of the legacy numerics range filter are now moved to the right place
BKDReader, used to visit matching dimensional points for a query, now more efficiently pre-allocates up front the expected number of hits
FilterCodecReader was failing to override and delegate all of super's methods
FilterScorer is also failing to override and delegate all of super's methods
Query parsers now create much more efficient queries when they encounter a graph token stream to avoid combinatoric explosion
A tricky test failure was due to the test accidentally creating single-document segments

Watch This Space

Stay tuned to this blog, where we'll share more news on the whole Elastic ecosystem including news, learning resources and cool use cases!

Elasticsearch Platform

ELK Stack

Elastic Cloud

Observability

Security

Search

By industry

By solution

Customer spotlight

Developers

Connect

Learn

Help

See what's happening at Elastic

This Week in Elasticsearch and Apache Lucene - 2017-02-13

Apache Lucene

Watch This Space

Follow us

About us

Join us

Press

Partners

Trust & Security

Investor relations

EXCELLENCE AWARDS