This Week in Elasticsearch and Apache Lucene - 2018-01-08
Welcome to This Week in Elasticsearch and Apache Lucene! With this weekly series, we're bringing you an update on all things Elasticsearch and Apache Lucene at Elastic, including the latest on commits, releases and other learning resources.
Changes in 5.6:
- Only bind loopback addresses when binding to local #28029
_refreshto shrink the version map on inactivity #27918
Changes in 6.1:
- Allow shrinking of indices from a previous major #28076
- Carry forward weights, etc on rescore rewrite #27981
- Fix composite aggregation when after term is missing in the shard #27936
- Do not start snapshots that are deleted during initialization #27931
- Fix incorrect results for aggregations nested under a nested aggregation #27946
- Make AbstractQueryBuilder.declareStandardFields to be protected (#27865) #27894
- Fix preserving FiltersAggregationBuilder#keyed field on rewrite #27900
- Recovery from snapshot may leave seq# gaps #27850
Changes in 6.2:
- BREAKING: Introduce limit to the number of terms in Terms Query #27968
- BREAKING: Limit the analyzed text for highlighting #27934
- Painless: Modify Loader to Load Classes Directly from Definition #28088
- Add missing delegate methods to NodeIndicesStats #28092
- Enable ASN support for Ingest GeoIP plugin. #27958
- Fix global aggregation that requires breadth first and scores #27942
java.locale.providers=COMPATto Java 9 onwards #28080
- Add Writeable.Reader support to TransportResponseHandler #28010
- Fix cluster.routing.allocation.enable and cluster.routing.rebalance.enable case #28037
- Plugins: Add plugin extension capabilities #27881
- Add node id to shard failure message #28024
- Set global checkpoint before open engine from store #27972
- Rollback a primary before recovering from translog #27804
- Non-peer recovery should set the global checkpoint #27965
- Plugins: Add validation to plugin descriptor parsing #27951
- Persist global checkpoint when finalizing a peer recovery #27947
operation_threadingfrom the rest specs #27940
- Check and repair index under the store metadata lock #27768
- Fixes DocStats to properly deal with shards that report -1 index size #27863
- Make KeyedLock reentrant #27920
- Move uid lock into LiveVersionMap #27905
- Backport for using lastSyncedGlobalCheckpoint in deletion policy #27866
- Retain originalIndex info when rewriting FieldCapabilities requests #27761
- Backport of ranking evaluation API (#27478) #27844
- Using DocValueFormat::parseBytesRef for parsing missing value parameter #27855
- Reject scroll query if size is 0 (#22552) #27842
- Handle case where the hole vertex is south of the containing polygon(s) #27685
- Allow TrimFilter to be used in custom normalizers #27758
Changes in 7.0:
- Create nio-transport plugin for NioTransport #27949
- Remove deprecated exceptions #28059
- Enable grok processor to support long, double and boolean #27896
- Add elasticsearch-nio jar for base nio classes #27801
A bug has been discovered in the codec for Lucene 6.x indices regarding multi-valued doc values fields. This potentially affects all users who run Elasticsearch 6.x and have indices created with Elasticsearch 5.x that have multi-valued fields with doc-values enabled.
A fix has been merged and we are already discussing releasing Lucene 7.2.1. If you are affected by this bug, a workaround can be to merge this index so that its segments are rewritten with the 7.0 Lucene codec, which doesn't have the bug.
Indexing terms impacts
We are resuming work on an old issue whose goal is to include information about produced scores in postings lists, so that collectors can skip blocks of documents that do not produce competitive scores (this only works in the case that you are only interested in the top-scoring documents). It makes several queries of the benchmark more than 10x faster, but the question is now what the API should look like so that it doesn't only benefit term queries but can also be propagated through boolean queries.
FunctionScoreQuery to replace CustomScoreQuery, BoostedQuery and BoostingQuery
We made it easier to use FunctionScoreQuery for use-cases that used to rely on CustomScoreQuery, BoostedQuery or BoostingQuery in the past, making FunctionScoreQuery the go-to query when it comes to customizing the way that scores are computed. Note that this query is different from Elasticsearch's
function_score for now, even though you can expect to see Elasticsearch's
function_score leverage Lucene's FunctionScoreQuery in the future.
Lazy term lookups
For queries that match few documents, it is not uncommon that term lookup is a bottleneck. In LUCENE-7311 we optimized term queries to not look up terms eagerly when scores are not needed, so that we can potentially skip term lookups if the cache has an entry for some segments. LUCENE-8113 generalizes to all queries that perform term lookups like phrase queries or span term queries.
- HyphenationCompoundWordTokenFilter doesn't work when the hyphenation indicator is >= 7 due to a bit twiddling bug.
- Lucene should protect against calls to IndexWriter.addDocuments with loads of documents. Currently you could end up with an ArrayIndexOutOfBoundsException.
- Static analysis found an unnecessary null check.
- We are upgrading ICU to version 60.2.
- UnifiedHighlighter might highlight terms at unmatched positions when using span_near.
- LatLonBoundingBox has a buggy toString representation.
- Similarity may now only take the doc-term frequency and the field's norm as per-document inputs to compute the score.
- Maybe the fact that we are not seeing speedups when using Java 9's Arrays.mismatch is due to the fact that Java 9 also changes the default GC?
- Explanation now takes a Number instead a a float, allowing for better explanations.
- Force-merge on read-write indexes can cause trouble when it comes to reclaiming deleted documents. One way to avoid this issue would be to make force-merge honor the maximum-segment-size parameter.
- If you ask for the top scores on a query that has both MUST and SHOULD clauses, the SHOULD clauses will turn into MUST clauses if the MUST clauses alone are not enough to produce competitive scores.