30 January 2017

This Week in Elasticsearch and Apache Lucene - 2017-01-30

•

Welcome to This Week in Elasticsearch and Apache Lucene! With this weekly series, we're bringing you an update on all things Elasticsearch and Apache Lucene at Elastic, including the latest on commits, releases and other learning resources.

On Immigration and Diversity https://t.co/xjzJvV9vej
— Shay Banon (@kimchy) January 29, 2017

Search-time field collapsing with paging

When you want to group result by a particular field it is easy to use the power of a termsaggregation coupled with a top_hits aggregation underneath. This common feature is called field collapsing and we’ve decided to give it a boost!

As an aggregation, this feature is widely used but suffers from at least two limitations: it is impossible to page the results (one of the most discussed issues in ES), and the result is an approximation (the top group and the top hits can be inaccurate - a known limitation of the aggregation framework as we trade precision for speed).

To solve these two issues we’ve added a field collapsing feature targeted for search only. Now it is possible to group results by a particular field and to retrieve the top hits for each group in any search request. Similarly it is possible to page through the results of a field collapsed search request like you would do for any search. This approach can be much faster than the top_hits aggregation solution because we apply the collapsing to the top search hits only. It’s less powerful than top_hits because the sorting of the group cannot be based on a separate computation for that group, but it’s also more precise.

New simplified analysis chain in Lucene

This ambitious Lucene issue is exploring an alternative analysis architecture to replace Lucene's current analysis API components (Tokenizer, CharFilter, TokenFilter). The new Stage API is simpler to consume, with just reset and next methods, versus 5 methods today. Each analysis stage uses a write-once binding to define attributes, instead of the global AttributeFactory Lucene now uses, giving each stage full control over exactly what attributes the next stage can see. This also fixes the long-standing trap of failing to call clearAttributes in your tokenizer. Graph token filters are much easier to create, since position increment and length are replaced with an explicit to/from arc, and the synonym filter on this branch (finally!) can consume a graph, so you could run WordDelimiterFilter followed by SynonymFilter. Tokens are never removed by stages, but instead marked deleted using a new DeletedAttribute. The changes are being pushed to this branch, but plenty of work remains before this is committable!

Elasticsearch Core

Changes in 5.2:

Queries which timed out should not be stored in the query cache.
Plugins need a flag to know when certain transport actions should be executed even when the thread pool queue is full.

Changes in 5.x:

New more efficient search-time field collapsing.
Lenient parsing of booleans is deprecated. Booleans will only accept true, false, "true", or "false" in 6.0.
The S3 plugin is seeing some cleanup: auto-bucket creation is deprecated, and the parameter is deprecated in favour of the more flexible parameter. Configuration now uses named configs only.
The EC2 discovery plugin can now read hostnames from AWS instance tags, making it easier to handle situations where no public DNS is available.
The Java High Level REST client continues to make progress: it can now parse ElasticsearchExceptions.
The delete-by-query helpers have been moved into core, which allows other plugins to use DBQ, as well as bulk-scroll style operations.
We have taken the first step in moving from predefined DFS, QUERY, and FETCH search phases to a more flexible N-phases approach. The deprecated DFS_QUERY_AND_FETCH phase has been removed.
Cancelling tasks with no cancellable children can cause the cancel request to hang.
The docs now include a complete reference for methods and classes exposed in the painless API.
Docker cgroups are mounted in the wrong place and require a hack to make them readable.
The ingest date processor now resets it's default year every time the pipeline is run, for dates without a year.
Range queries which are rewritten to match-all now normalize the time zone and date format to increase cache hits.

Changes in master:

The HDFS repository plugin now wraps socket connect ops in doPrivileged blocks, and the URL repository has moved to the new repository-url module.
Sequence-ID based recovery now allows an out-of-sync replica to ask the primary for all sequence IDs from the replica's local checkpoint (if the sequence IDs) are available, otherwise file based recovery is used instead. Further changes will strengthen the guarantees.

Coming up:

A new Unified Highlighter fixes almost all the problems with previous highlighters, and will work with indexed term vectors or posting lists with offsets, or will reanalyze on the fly if needed.
The translog will become "sequence-number aware", so that multiple translog files can be kept around to ensure that all documents either exist in all shards in Lucene, or can be replayed from the translog.

Apache Lucene

Lucene 6.4.0 has been released, but a 6.4.1 bug-fix release is likely coming soon to fix two memory leak issues
An entirely new analysis API will make it easier to build graph token filters
SimplePatternTokenizer uses Lucene's fast determinized automaton implementation to locate tokens, but requires that the regular expression can be compactly determinized (not all can!)
Doc values, in addition to dimensional points, can now be used for distance filtering giving possibly very large speedups for restrictive queries that apply a non-restrictive distance filter
LatLonPoint's distance filter is optimized to perform fewer distance checks showing a nice ~10% speedup on Lucene's nightly geo benchmarks, and can be further optimized to take longitude wrapping at the dateline into account
LatLonPoint's polygon filter is also optimized using the same approach
ToParentBlockJoinCollector, used to collect parent and child documents with joins, will be removed and replaced with a new query wrapper to collect children separately after parents
ToParentBlockJoinQuery should implement two-phased iteration so boolean queries with MUST clauses can be faster
Randomizedtesting is upgraded to 2.5.0 to address problems deleting temporary directories after tests finish
Our geo distance queries could get faster with a better first pass approximation than a bounding box
We now have two possible approaches to speed up ASCIIFoldingFilter by simplifying its overly massive method
If we expect too many points will match a geo distance filter then we should speed it up by inverting its logic to find points beyond the requested distance, like we did for 1D ranges
We will deprecate the postings-based GeoPointField since the dimensional points implementation is faster and smaller
When index files go missing we should throw a CorruptIndexException, not a NoSuchFileException
The XML query parser should implement the SpanQueryBuilder interface
The download link on Lucene's web site is annoying
A new expert IndexWriter API tells you all unique field names in the index
The changes-to-html generator (necessary when releasing Lucene) should not rely on Jira being up
The NullPointerException caused by graduating IndexOrDocValuesQuery to core has been fixed and its single-valued optimization re-enabled
The recent optimization to estimating how many points a given shape will visit caused some fun test failures
The Kuromoji (Japanese) tokenizer fails if the user tries to include # in their dictionary
We had to tweak our javadocs build process to sidestep a new bug in Java 8u121
Lucene may soon have an implementation of a logistic regression classifier
ComplexPhraseQueryParser is confused if you pass a phrase with a trailing escaped colon
We tried to make it easier to get the per-hit matching scorers in DisjunctionScorer, but it led to problems and had to be reverted for now
CompressingStoredFieldsFormat, the default format for Lucene indices, reclaims memory more aggressively
Queries were accidentally holding onto way too much memory when they were cached in some cases, causing us to kick off discussions for a 6.4.1 bugfix release

Watch This Space

Stay tuned to this blog, where we'll share more news on the whole Elastic ecosystem including news, learning resources and cool use cases!

Elasticsearch Platform

ELK Stack

Elastic Cloud

Observability

Security

Search

By industry

By solution

Customer spotlight

Developers

Connect

Learn

Help

See what's happening at Elastic

This Week in Elasticsearch and Apache Lucene - 2017-01-30

Elasticsearch Core

Apache Lucene

Watch This Space

Follow us

About us

Join us

Press

Partners

Trust & Security

Investor relations

EXCELLENCE AWARDS