08 January 2018

This Week in Elasticsearch and Apache Lucene - 2018-01-08

By Clinton GormleyAdrien Grand

Welcome to This Week in Elasticsearch and Apache Lucene! With this weekly series, we're bringing you an update on all things Elasticsearch and Apache Lucene at Elastic, including the latest on commits, releases and other learning resources.

Changes in 5.6:

  • Only bind loopback addresses when binding to local #28029
  • Use _refresh to shrink the version map on inactivity #27918
  • X-Pack:
    • Set processors on audit remote client #3469
    • AD authn: never clear passwords on Bind connections #3351

Changes in 6.1:

  • Allow shrinking of indices from a previous major #28076
  • Carry forward weights, etc on rescore rewrite #27981
  • Fix composite aggregation when after term is missing in the shard #27936
  • Do not start snapshots that are deleted during initialization #27931
  • Fix incorrect results for aggregations nested under a nested aggregation #27946
  • Make AbstractQueryBuilder.declareStandardFields to be protected (#27865) #27894
  • Fix preserving FiltersAggregationBuilder#keyed field on rewrite #27900
  • Recovery from snapshot may leave seq# gaps #27850
  • X-Pack:
    • Cleanup the handling for bootstrap passwords #3470
    • [Security] has_privileges.has_all_requested should respect cluster privileges #3379

Changes in 6.2:

  • BREAKING: Introduce limit to the number of terms in Terms Query #27968
  • BREAKING: Limit the analyzed text for highlighting #27934
  • Painless: Modify Loader to Load Classes Directly from Definition #28088
  • Add missing delegate methods to NodeIndicesStats #28092
  • Enable ASN support for Ingest GeoIP plugin. #27958
  • Fix global aggregation that requires breadth first and scores #27942
  • Pass java.locale.providers=COMPAT to Java 9 onwards #28080
  • Add Writeable.Reader support to TransportResponseHandler #28010
  • Fix cluster.routing.allocation.enable and cluster.routing.rebalance.enable case #28037
  • Plugins: Add plugin extension capabilities #27881
  • Add node id to shard failure message #28024
  • Set global checkpoint before open engine from store #27972
  • Rollback a primary before recovering from translog #27804
  • Non-peer recovery should set the global checkpoint #27965
  • Plugins: Add validation to plugin descriptor parsing #27951
  • Persist global checkpoint when finalizing a peer recovery #27947
  • Remove operation_threading from the rest specs #27940
  • Check and repair index under the store metadata lock #27768
  • Fixes DocStats to properly deal with shards that report -1 index size #27863
  • Make KeyedLock reentrant #27920
  • Move uid lock into LiveVersionMap #27905
  • Backport for using lastSyncedGlobalCheckpoint in deletion policy #27866
  • Retain originalIndex info when rewriting FieldCapabilities requests #27761
  • Backport of ranking evaluation API (#27478) #27844
  • Using DocValueFormat::parseBytesRef for parsing missing value parameter #27855
  • Reject scroll query if size is 0 (#22552) #27842
  • Handle case where the hole vertex is south of the containing polygon(s) #27685
  • Allow TrimFilter to be used in custom normalizers #27758
  • X-Pack:
    • Watcher: Add refresh parameter to index action #3350
    • [Watcher] Use auto_expand_replicas on triggered_watches index too #3371

Changes in 7.0:

  • Create nio-transport plugin for NioTransport #27949
  • Remove deprecated exceptions #28059
  • Enable grok processor to support long, double and boolean #27896
  • Add elasticsearch-nio jar for base nio classes #27801

Apache Lucene

Lucene 7.2.1

A bug has been discovered in the codec for Lucene 6.x indices regarding multi-valued doc values fields. This potentially affects all users who run Elasticsearch 6.x and have indices created with Elasticsearch 5.x that have multi-valued fields with doc-values enabled.

A fix has been merged and we are already discussing releasing Lucene 7.2.1. If you are affected by this bug, a workaround can be to merge this index so that its segments are rewritten with the 7.0 Lucene codec, which doesn't have the bug.

Indexing terms impacts

We are resuming work on an old issue whose goal is to include information about produced scores in postings lists, so that collectors can skip blocks of documents that do not produce competitive scores (this only works in the case that you are only interested in the top-scoring documents). It makes several queries of the benchmark more than 10x faster, but the question is now what the API should look like so that it doesn't only benefit term queries but can also be propagated through boolean queries.

FunctionScoreQuery to replace CustomScoreQuery, BoostedQuery and BoostingQuery

We made it easier to use FunctionScoreQuery for use-cases that used to rely on CustomScoreQuery, BoostedQuery or BoostingQuery in the past, making FunctionScoreQuery the go-to query when it comes to customizing the way that scores are computed. Note that this query is different from Elasticsearch's function_score for now, even though you can expect to see Elasticsearch's function_score leverage Lucene's FunctionScoreQuery in the future.

Lazy term lookups

For queries that match few documents, it is not uncommon that term lookup is a bottleneck. In LUCENE-7311 we optimized term queries to not look up terms eagerly when scores are not needed, so that we can potentially skip term lookups if the cache has an entry for some segments. LUCENE-8113 generalizes to all queries that perform term lookups like phrase queries or span term queries.

Other