This Week in Elasticsearch and Apache Lucene - 2017-01-30

Welcome to This Week in Elasticsearch and Apache Lucene! With this weekly series, we're bringing you an update on all things Elasticsearch and Apache Lucene at Elastic, including the latest on commits, releases and other learning resources.

Search-time field collapsing with paging

When you want to group result by a particular field it is easy to use the power of a termsaggregation coupled with a top_hits aggregation underneath. This common feature is called field collapsing and we’ve decided to give it a boost!

As an aggregation, this feature is widely used but suffers from at least two limitations: it is impossible to page the results (one of the most discussed issues in ES), and the result is an approximation (the top group and the top hits can be inaccurate - a known limitation of the aggregation framework as we trade precision for speed).

To solve these two issues we’ve added a field collapsing feature targeted for search only. Now it is possible to group results by a particular field and to retrieve the top hits for each group in any search request. Similarly it is possible to page through the results of a field collapsed search request like you would do for any search. This approach can be much faster than the top_hits aggregation solution because we apply the collapsing to the top search hits only. It’s less powerful than top_hits because the sorting of the group cannot be based on a separate computation for that group, but it’s also more precise.

New simplified analysis chain in Lucene

This ambitious Lucene issue is exploring an alternative analysis architecture to replace Lucene's current analysis API components (Tokenizer, CharFilter, TokenFilter). The new Stage API is simpler to consume, with just reset and next methods, versus 5 methods today. Each analysis stage uses a write-once binding to define attributes, instead of the global AttributeFactory Lucene now uses, giving each stage full control over exactly what attributes the next stage can see. This also fixes the long-standing trap of failing to call clearAttributes in your tokenizer. Graph token filters are much easier to create, since position increment and length are replaced with an explicit to/from arc, and the synonym filter on this branch (finally!) can consume a graph, so you could run WordDelimiterFilter followed by SynonymFilter. Tokens are never removed by stages, but instead marked deleted using a new DeletedAttribute. The changes are being pushed to this branch, but plenty of work remains before this is committable!

Elasticsearch Core

Changes in 5.2:

Changes in 5.x:

Changes in master:

Coming up:

  • A new Unified Highlighter fixes almost all the problems with previous highlighters, and will work with indexed term vectors or posting lists with offsets, or will reanalyze on the fly if needed.
  • The translog will become "sequence-number aware", so that multiple translog files can be kept around to ensure that all documents either exist in all shards in Lucene, or can be replayed from the translog.

Apache Lucene

Watch This Space

Stay tuned to this blog, where we'll share more news on the whole Elastic ecosystem including news, learning resources and cool use cases!