This Week in Elasticsearch and Apache Lucene - 2017-01-30
Welcome to This Week in Elasticsearch and Apache Lucene! With this weekly series, we're bringing you an update on all things Elasticsearch and Apache Lucene at Elastic, including the latest on commits, releases and other learning resources.
On Immigration and Diversity https://t.co/xjzJvV9vej
— Shay Banon (@kimchy) January 29, 2017
Search-time field collapsing with paging
When you want to group result by a particular field it is easy to use the power of a terms
aggregation coupled with a top_hits
aggregation underneath. This common feature is called field collapsing and we’ve decided to give it a boost!
As an aggregation, this feature is widely used but suffers from at least two limitations: it is impossible to page the results (one of the most discussed issues in ES), and the result is an approximation (the top group and the top hits can be inaccurate - a known limitation of the aggregation framework as we trade precision for speed).
To solve these two issues we’ve added a field collapsing feature targeted for search only. Now it is possible to group results by a particular field and to retrieve the top hits for each group in any search request. Similarly it is possible to page through the results of a field collapsed search request like you would do for any search. This approach can be much faster than the top_hits
aggregation solution because we apply the collapsing to the top search hits only. It’s less powerful than top_hits
because the sorting of the group cannot be based on a separate computation for that group, but it’s also more precise.
New simplified analysis chain in Lucene
This ambitious Lucene issue is exploring an alternative analysis architecture to replace Lucene's current analysis API components (Tokenizer, CharFilter, TokenFilter). The new Stage API is simpler to consume, with just reset and next methods, versus 5 methods today. Each analysis stage uses a write-once binding to define attributes, instead of the global AttributeFactory Lucene now uses, giving each stage full control over exactly what attributes the next stage can see. This also fixes the long-standing trap of failing to call clearAttributes in your tokenizer. Graph token filters are much easier to create, since position increment and length are replaced with an explicit to/from arc, and the synonym filter on this branch (finally!) can consume a graph, so you could run WordDelimiterFilter followed by SynonymFilter. Tokens are never removed by stages, but instead marked deleted using a new DeletedAttribute. The changes are being pushed to this branch, but plenty of work remains before this is committable!
Elasticsearch Core
Changes in 5.2:
- Queries which timed out should not be stored in the query cache.
- Plugins need a flag to know when certain transport actions should be executed even when the thread pool queue is full.
Changes in 5.x:
- New more efficient search-time field collapsing.
- Lenient parsing of booleans is deprecated. Booleans will only accept
true
,false
,"true"
, or"false"
in 6.0. - The S3 plugin is seeing some cleanup: auto-bucket creation is deprecated, and the parameter is deprecated in favour of the more flexible parameter. Configuration now uses named configs only.
- The EC2 discovery plugin can now read hostnames from AWS instance tags, making it easier to handle situations where no public DNS is available.
- The Java High Level REST client continues to make progress: it can now parse ElasticsearchExceptions.
- The delete-by-query helpers have been moved into core, which allows other plugins to use DBQ, as well as bulk-scroll style operations.
- We have taken the first step in moving from predefined DFS, QUERY, and FETCH search phases to a more flexible N-phases approach. The deprecated DFS_QUERY_AND_FETCH phase has been removed.
- Cancelling tasks with no cancellable children can cause the cancel request to hang.
- The docs now include a complete reference for methods and classes exposed in the painless API.
- Docker cgroups are mounted in the wrong place and require a hack to make them readable.
- The ingest date processor now resets it's default year every time the pipeline is run, for dates without a year.
- Range queries which are rewritten to match-all now normalize the time zone and date format to increase cache hits.
Changes in master:
- The HDFS repository plugin now wraps socket connect ops in doPrivileged blocks, and the URL repository has moved to the new
repository-url
module. - Sequence-ID based recovery now allows an out-of-sync replica to ask the primary for all sequence IDs from the replica's local checkpoint (if the sequence IDs) are available, otherwise file based recovery is used instead. Further changes will strengthen the guarantees.
Coming up:
- A new Unified Highlighter fixes almost all the problems with previous highlighters, and will work with indexed term vectors or posting lists with offsets, or will reanalyze on the fly if needed.
- The translog will become "sequence-number aware", so that multiple translog files can be kept around to ensure that all documents either exist in all shards in Lucene, or can be replayed from the translog.
Apache Lucene
- Lucene 6.4.0 has been released, but a 6.4.1 bug-fix release is likely coming soon to fix two memory leak issues
- An entirely new analysis API will make it easier to build graph token filters
SimplePatternTokenizer
uses Lucene's fast determinized automaton implementation to locate tokens, but requires that the regular expression can be compactly determinized (not all can!)- Doc values, in addition to dimensional points, can now be used for distance filtering giving possibly very large speedups for restrictive queries that apply a non-restrictive distance filter
LatLonPoint's
distance filter is optimized to perform fewer distance checks showing a nice ~10% speedup on Lucene's nightly geo benchmarks, and can be further optimized to take longitude wrapping at the dateline into accountLatLonPoint's
polygon filter is also optimized using the same approachToParentBlockJoinCollector,
used to collect parent and child documents with joins, will be removed and replaced with a new query wrapper to collect children separately after parentsToParentBlockJoinQuery
should implement two-phased iteration so boolean queries with MUST clauses can be fasterRandomizedtesting
is upgraded to 2.5.0 to address problems deleting temporary directories after tests finish- Our geo distance queries could get faster with a better first pass approximation than a bounding box
- We now have two possible approaches to speed up
ASCIIFoldingFilter
by simplifying its overly massive method - If we expect too many points will match a geo distance filter then we should speed it up by inverting its logic to find points beyond the requested distance, like we did for 1D ranges
- We will deprecate the postings-based
GeoPointField
since the dimensional points implementation is faster and smaller - When index files go missing we should throw a
CorruptIndexException,
not aNoSuchFileException
- The XML query parser should implement the
SpanQueryBuilder
interface - The download link on Lucene's web site is annoying
- A new expert
IndexWriter
API tells you all unique field names in the index - The changes-to-html generator (necessary when releasing Lucene) should not rely on Jira being up
- The
NullPointerException
caused by graduatingIndexOrDocValuesQuery
to core has been fixed and its single-valued optimization re-enabled - The recent optimization to estimating how many points a given shape will visit caused some fun test failures
- The Kuromoji (Japanese) tokenizer fails if the user tries to include
#
in their dictionary - We had to tweak our javadocs build process to sidestep a new bug in Java 8u121
- Lucene may soon have an implementation of a logistic regression classifier
ComplexPhraseQueryParser
is confused if you pass a phrase with a trailing escaped colon- We tried to make it easier to get the per-hit matching scorers in
DisjunctionScorer,
but it led to problems and had to be reverted for now CompressingStoredFieldsFormat,
the default format for Lucene indices, reclaims memory more aggressively- Queries were accidentally holding onto way too much memory when they were cached in some cases, causing us to kick off discussions for a 6.4.1 bugfix release
Watch This Space
Stay tuned to this blog, where we'll share more news on the whole Elastic ecosystem including news, learning resources and cool use cases!