08 January 2018

This Week in Elasticsearch and Apache Lucene - 2018-01-08

•

Welcome to This Week in Elasticsearch and Apache Lucene! With this weekly series, we're bringing you an update on all things Elasticsearch and Apache Lucene at Elastic, including the latest on commits, releases and other learning resources.

Changes in 5.6:

Only bind loopback addresses when binding to local #28029
Use _refresh to shrink the version map on inactivity #27918
X-Pack:
- Set processors on audit remote client #3469
- AD authn: never clear passwords on Bind connections #3351

Changes in 6.1:

Allow shrinking of indices from a previous major #28076
Carry forward weights, etc on rescore rewrite #27981
Fix composite aggregation when after term is missing in the shard #27936
Do not start snapshots that are deleted during initialization #27931
Fix incorrect results for aggregations nested under a nested aggregation #27946
Make AbstractQueryBuilder.declareStandardFields to be protected (#27865) #27894
Fix preserving FiltersAggregationBuilder#keyed field on rewrite #27900
Recovery from snapshot may leave seq# gaps #27850
X-Pack:
- Cleanup the handling for bootstrap passwords #3470
- [Security] has_privileges.has_all_requested should respect cluster privileges #3379

Changes in 6.2:

BREAKING: Introduce limit to the number of terms in Terms Query #27968
BREAKING: Limit the analyzed text for highlighting #27934
Painless: Modify Loader to Load Classes Directly from Definition #28088
Add missing delegate methods to NodeIndicesStats #28092
Enable ASN support for Ingest GeoIP plugin. #27958
Fix global aggregation that requires breadth first and scores #27942
Pass java.locale.providers=COMPAT to Java 9 onwards #28080
Add Writeable.Reader support to TransportResponseHandler #28010
Fix cluster.routing.allocation.enable and cluster.routing.rebalance.enable case #28037
Plugins: Add plugin extension capabilities #27881
Add node id to shard failure message #28024
Set global checkpoint before open engine from store #27972
Rollback a primary before recovering from translog #27804
Non-peer recovery should set the global checkpoint #27965
Plugins: Add validation to plugin descriptor parsing #27951
Persist global checkpoint when finalizing a peer recovery #27947
Remove operation_threading from the rest specs #27940
Check and repair index under the store metadata lock #27768
Fixes DocStats to properly deal with shards that report -1 index size #27863
Make KeyedLock reentrant #27920
Move uid lock into LiveVersionMap #27905
Backport for using lastSyncedGlobalCheckpoint in deletion policy #27866
Retain originalIndex info when rewriting FieldCapabilities requests #27761
Backport of ranking evaluation API (#27478) #27844
Using DocValueFormat::parseBytesRef for parsing missing value parameter #27855
Reject scroll query if size is 0 (#22552) #27842
Handle case where the hole vertex is south of the containing polygon(s) #27685
Allow TrimFilter to be used in custom normalizers #27758
X-Pack:
- Watcher: Add refresh parameter to index action #3350
- [Watcher] Use auto_expand_replicas on triggered_watches index too #3371

Changes in 7.0:

Create nio-transport plugin for NioTransport #27949
Remove deprecated exceptions #28059
Enable grok processor to support long, double and boolean #27896
Add elasticsearch-nio jar for base nio classes #27801

Apache Lucene

Lucene 7.2.1

A bug has been discovered in the codec for Lucene 6.x indices regarding multi-valued doc values fields. This potentially affects all users who run Elasticsearch 6.x and have indices created with Elasticsearch 5.x that have multi-valued fields with doc-values enabled.

A fix has been merged and we are already discussing releasing Lucene 7.2.1. If you are affected by this bug, a workaround can be to merge this index so that its segments are rewritten with the 7.0 Lucene codec, which doesn't have the bug.

Indexing terms impacts

We are resuming work on an old issue whose goal is to include information about produced scores in postings lists, so that collectors can skip blocks of documents that do not produce competitive scores (this only works in the case that you are only interested in the top-scoring documents). It makes several queries of the benchmark more than 10x faster, but the question is now what the API should look like so that it doesn't only benefit term queries but can also be propagated through boolean queries.

FunctionScoreQuery to replace CustomScoreQuery, BoostedQuery and BoostingQuery

We made it easier to use FunctionScoreQuery for use-cases that used to rely on CustomScoreQuery, BoostedQuery or BoostingQuery in the past, making FunctionScoreQuery the go-to query when it comes to customizing the way that scores are computed. Note that this query is different from Elasticsearch's function_score for now, even though you can expect to see Elasticsearch's function_score leverage Lucene's FunctionScoreQuery in the future.

Lazy term lookups

For queries that match few documents, it is not uncommon that term lookup is a bottleneck. In LUCENE-7311 we optimized term queries to not look up terms eagerly when scores are not needed, so that we can potentially skip term lookups if the cache has an entry for some segments. LUCENE-8113 generalizes to all queries that perform term lookups like phrase queries or span term queries.

Other

HyphenationCompoundWordTokenFilter doesn't work when the hyphenation indicator is >= 7 due to a bit twiddling bug.
Lucene should protect against calls to IndexWriter.addDocuments with loads of documents. Currently you could end up with an ArrayIndexOutOfBoundsException.
Static analysis found an unnecessary null check.
We are upgrading ICU to version 60.2.
UnifiedHighlighter might highlight terms at unmatched positions when using span_near.
LatLonBoundingBox has a buggy toString representation.
Similarity may now only take the doc-term frequency and the field's norm as per-document inputs to compute the score.
Maybe the fact that we are not seeing speedups when using Java 9's Arrays.mismatch is due to the fact that Java 9 also changes the default GC?
Explanation now takes a Number instead a a float, allowing for better explanations.
Force-merge on read-write indexes can cause trouble when it comes to reclaiming deleted documents. One way to avoid this issue would be to make force-merge honor the maximum-segment-size parameter.
If you ask for the top scores on a query that has both MUST and SHOULD clauses, the SHOULD clauses will turn into MUST clauses if the MUST clauses alone are not enough to produce competitive scores.

The Search AI Company

ELK Stack

Elastic Cloud

Generative AI

Search

Security

Observability

By solution

Industries

Customer spotlight

Research

Build

Learn

Connect

This Week in Elasticsearch and Apache Lucene - 2018-01-08

Apache Lucene

Follow us

About us

Join us

Press

Partners

Trust & Security

Investor relations

EXCELLENCE AWARDS