Welcome to This Week in Elasticsearch and Apache Lucene! With this weekly series, we're bringing you an update on all things Elasticsearch and Apache Lucene at Elastic, including the latest on commits, releases and other learning resources.
Molly, Elasticsearch and Data Replication https://t.co/zQ49fAzewk . Thank you Kamala and @palvaro . It was fun working with you.
— Boaz Leskes (@bleskes) February 7, 2017
Faster range and geo-distance queries
Range and geo-distance queries today always execute using Lucene's BKD trees. This gives best performance when ranges are run on their own but might be slow when the result set of the range needs to be intersected with a selective query since the range query would still need to visit all documents that match the range, and this number might be much higher than the number of matches of other clauses. As of 5.4, if the range query is not the most selective part of the query, it will execute using doc values, which allows for much faster query execution when only a minority of matches need to be verified. This change has the potential of making some queries tens of times faster.
Faster nested queries
The relational structure that is created by nested
fields is totally opaque on the low level: nested documents are just regular documents to Lucene. As a consequence, Elasticsearch automatically applies a filter that excludes nested docs to all queries that are executed. This logic is being improved by applying the filter using a FILTER clause rather than a MUST_NOT clause (since positive clauses are faster as they can skip more efficiently) as well as only adding the filter when needed. For instance if the query is a term query on a field that may only occur in root documents, there is no need to exclude nested documents since they cannot match anyway.
Until now, search phase execution has always been hardcoded. You might for instance know about QUERY_THEN_FETCH or DFS_QUERY_THEN_FETCH. Phases are currently being detached in order to make things easier to unit test, but this also means it will be easier to add new phases to the execution of search requests in the future.
Better query generation with multi-term synonyms
The progress in graph queries, such as multi-token synonyms, continues! Today, in Lucene 6.4.x and Elasticsearch 5.2.0, when the query parser sees that the search-time analyzer created a token graph for a given query, it enumerates all unique paths through the graph, and then creates a big BooleanQuery with each full path analyzed as a sub-query. But this is dangerous: the number of unique paths can grow exponentially in the number of tokens, and while that is unlikely to happen in actual queries, it is still possible. In Lucene we generally try hard to prevent any "adversarial" cases that could lead to denial of service, and so this exciting and tricky change makes graph queries much faster (and safer) by pre-analyzing the graph for its cut vertices, and then directly creating a BooleanQuery, possibly with nested clauses. Beyond just a fun optimization, the change also alters hit scoring, how settings like minimumShouldMatch interact with synonyms, and what boolean operator is used for the tokens inserted by a synonym when the user's query is not a phrase query.
Kibana has to embed a lot of client-side logic in order to select which indices to query using the field-stats API. We are working on moving the burden to Elasticsearch by providing a simpler API that only says which fields may be searched or aggregated, as well as making it cheap to query indices that do not have matches so that Kibana could always send queries to all indices without having to worry about the timestamp range.
Changes in 5.3:- Cluster state appliers (which make changes based on the cluster state) can no longer sample not-yet-applied cluster states, but appliers can create an observer when they need to wait for specific changes in a new cluster state.
- Connections to new nodes can now happen in parallel.
- Upgraded to Lucene 6.4.1, which fixes two memory leaks.
- Secure settings are now validated at startup.
- The backport of the change to execute index/delete/update requests via the bulk API introduced a bug in versioning.
- The bulk processor now requires a content-type when accepting bytes to avoid content detection.
- Cluster allocation explain should not return an empty body when an exception is thrown.
- The field-collapsing feature no longer uses blocking calls to populate its inner hits.
- Bulk and msearch APIs support the
ndjson
content-type (and have deprecated support forldjson
). - The cross-cluster client can now be disabled on a per-node basis.
- The
search.highlight.term_vector_multi_value
setting was not exposed. - The internal-only
QUERY_AND_FETCH
search type has been removed. - The
include_in_all
parameter has been deprecated and will not allowed on new indices in 6.0.0.
- The reduce phase of aggs, suggesters, and profiles has been split from the merge phase of fetched hits.
- HEAD requests should return the correct content-length header, fixed template, index, alias,
- Upgrade Lucene to 6.5.0-SNAPSHOT.
- The
typed_keys
parameter to search requests will prefix agg and suggester names with the agg/suggester type for easier parsing by clients. - MasterFaultDetection can start as soon as the first cluster state has been processed as connection to the master is guaranteed.
- The Java High Level REST client continues to make progress with: xcontent parsing for suggesters and the main response.
- The task description for indexing requests only includes the request body if it is smaller than 2kB.
- SocketPermissions can be restricted to just Netty, instead of granting to the whole transport module.
- InternalSearchHits (the only implementation of SearchHits) has been folded into its interface.
- Methods requiring
connect
permissions are now forbidden. - Doc values for date fields in scripts will now return
ReadableDateTime
objects instead of long values.
- The translog is changing from being file oriented to becoming sequence-number oriented.
Apache Lucene
- The Lucene 6.4.1 release bits were set free and the 5.5.4 bug fix release vote is now underway
- Term filters are so cheap that we should never cache them, leaving space for more complex queries, and we should also cache compound filters earlier than their sub-clauses to improve cache efficiency
- The near-real-time document suggester should allow for optionally filtering out duplicates
- Unmapping files opened with
MMapDirectory
on close will also work for Java 9 and Lucene 5.5.4 - An intermittent
NullPointerException
caused by graduating the newIndexOrDocValuesQuery
fromsandbox
tocore
has been fixed - The APIs to track external data structures along with Lucene's
LeafReaders
can be greatly improved for Lucene 7.0 - When index files go missing we should throw a
CorruptIndexException,
not aNoSuchFileException
SortedNumericDocValues
can soon be wrapped as aDoubleValueSource
orLongValueSource
using a selector, making their values accessible to things like function queries and expressionsMemoryIndex
is being modernized: it now respectsomitNorms
, directly implements the new doc values iterators and properly implement per-field postings payloadsComplexPhraseQueryParser
andAsciiFoldingFilter
struggle to work togetherTermGroupFacetCollector
is tricky to use if you also want to sort- It is far too easy to accidentally create an index that messes up block joins
UnifiedHighlighter
fails to highlight all terms fromSpanNearQuery
- Block join queries should not try to track their original un-rewritten forms
- Java 8's
UnaryOperator
confuses our javadocs checker OneMergeWrappingMergePolicy
lets you change each merge the merge policy chose before it's executed- Images referenced by the javadocs of the legacy numerics range filter are now moved to the right place
BKDReader,
used to visit matching dimensional points for a query, now more efficiently pre-allocates up front the expected number of hitsFilterCodecReader
was failing to override and delegate all ofsuper's
methodsFilterScorer
is also failing to override and delegate all ofsuper's
methods- Query parsers now create much more efficient queries when they encounter a graph token stream to avoid combinatoric explosion
- A tricky test failure was due to the test accidentally creating single-document segments
Watch This Space
Stay tuned to this blog, where we'll share more news on the whole Elastic ecosystem including news, learning resources and cool use cases!