23 November 2018

This Week in Elasticsearch and Apache Lucene - 2018-11-23

By Bill McConaghyAdrien GrandDaniel MitterdorferTim VernumYannick WelschColin Goodheart-SmitheJim Ferenczi

Elasticsearch

Remote clusters UI

Work continues on the remote clusters UI. We have finished the UI layer, and have been working on the app layer to integrate the UI with endpoints. So far the work has been done based on the assumption that add/edit/delete are all persistent settings. Next, we will work on handling transient settings as well as settings that come from elasticsearch.yml.

Index lifecycle management UI

Work on this UI is progressing nicely. The flows from the index management app to add a policy to and remove a policy from an index are completed. We also added the ability to see the error stack trace for lifecycle errors and to see the phase definition. Remaining tasks include adding the ability to disable the UI, adding a button to add a policy to an index template from the policy CRUD UI, adding filter buttons in index management to filter on managed/unmanaged and lifecycle phase, and updating autocomplete rules for console to include new ILM APIs.

Query Script Query

There is a new query called script_score that allows users to manipulate the score of documents through a Painless script. The query also includes a script context which exposes a number of useful methods including the Decay functions of the function_score query. This new query is the first part of the work to deprecate and remove the function_score query in favour of easier to use alternatives.

Rare terms aggregation

We have a PR for a new aggregation that is designed to surface rare terms. This is designed to be used instead of ordering a terms agg by count ascending which has an unbounded error. The rare terms aggregation uses a different strategy to merge the results from different shards to ensure a better precision.

Build

As every PR against our repo is built on our CI infrastructure as part of the review process, it is very important to have fast feedback cycles. By splitting the build and running it on multiple worker nodes we were able to reduce the total PR build time by 50% down to 90 minutes. We have also enabled Gradle build scans so we can find more room for improvements.

Backup of Security

It’s currently possible to backup the .security index, but it’s not an ideal experience. We have a plan of tasks we want to complete to improve this experience and we have started working through this list.

We have opened a PR for a “snapshot” privilege and have been thinking about how to allow some low-privileged users the ability to list the .security index, and potentially see some metadata about it, but not necessarily be able to search it, or get documents from it.

Zen 2

We now have a storage layer that can atomically write cluster states, and the remaining follow-up work will focus on integrating the new storage layer with the existing Zen2 components.

We also fixed a long-standing issue where nodes miss applying a cluster state update, indefinitely lagging behind the master in terms of applied cluster states. This can potentially impact other operations in the cluster, for example block indexing operations to wait on mapping updates. The solution is for the master to track applied cluster state versions of each node and to kick a node out of the cluster if it doesn't apply a published cluster state or any of its follow-up published states within some timeout, giving us a strong bound on how stale the cluster state applied on every node in the cluster can be.

Cross Cluster Replication

We are also continuing to investigate and benchmark update- and refresh-heavy use cases, to see if the number of refreshes (and merges) can be improved without ruining the indexing throughput. This work has led to two Lucene changes to improve lock contention when indexing happens concurrently to refresh / open reader calls: 1) Improve merge performance in the common case where soft- and hard-deletes are not mixed and 2) prevent blocking on resolving deletes.

The work to implement "recovery from remote" by bootstrapping the creation of the follower index through the Snapshot/Restore functionality has started by prototyping a CCR Snapshot/Restore repository that gets registered based on available remote clusters.

Changes

Changes in 6.4:

  • [CI] Reactivate 3rd party tests on CI [branch 6.4] #35733

Changes in 6.5:

  • Avoid NPE in follower stats when no tasks metadata #35802
  • [Scripting] Use Number as a return value for BucketAggregationScript #35653
  • SQL: Improve validation of unsupported fields #35675

Changes in 6.6:

  • BREAKING: Forbid negative scores in functon_score query #35709
  • SQL: Implement NVL(expr1, expr2) #35794
  • TransportResyncReplicationAction should not honour blocks #35795
  • GEO: More robust handling of ignore_malformed in geoshape parsing #35603
  • SQL: Implement ISNULL(expr1, expr2) #35793
  • HLRC: ML Delete event from Calendar #35760
  • [HLRC] Fix issue in equals impl for GlobalOperationPrivileges #35721
  • [GEO] Add support to ShapeBuilders for building Lucene geometry #35707
  • SQL: Implement IFNULL variant of COALESCE #35762
  • SQL: Perform lazy evaluation of mismatched mappings #35676
  • [HLRC] Add support for get application privileges API #35556
  • [Tests] Fix SimpleQueryStringBuilderTests #35784
  • manage_token privilege for kibana_system #35751
  • SQL: Introduce INTERVAL support #35521
  • [HLRC] Added support for CCR Resume Follow API #35638
  • Clean up PutLicenseResponse #35689
  • Clean up StartBasicResponse #35688
  • Add a prebuilt ICU Analyzer #34958
  • [HLRC] Added support for CCR Unfollow API #35693
  • Fix problem with MatchNoDocsQuery in disjunction queries #35726
  • HLRC: ML Get Calendar Events #35747
  • HLRC ML Add Event To Calendar API #35704
  • HLRC: ML Delete job from calendar #35713
  • Add a _freeze / _unfreeze API #35592
  • HLRC: Add "_has_privileges" API to Security Client #35479
  • HLRC for _mtermvectors #35266
  • HLRC: ML Add Job to Calendar API #35666
  • Build: Fix official plugins list #35661
  • HLRC: Add ML delete filter action #35382
  • SQL: Move internals from Joda to java.time #35649
  • Fix phrase_slop in query_string query #35533

Changes in 7.0:

  • Add a new query type - ScriptScoreQuery #34533
  • Deprecate types in validate query requests. #35575
  • Deprecate types in count and msearch. #35421

Lucene

BM25F

A first implementation of BM25F has been pushed to sandbox. BM25F is an extension to BM25 that supports scoring across multiple fields with different degrees of importance. The way it works is quite simple, for instance say that you want to give title a weight of 5 and body a weight of 1 then BM25F will score across these two fields as if there was a single field that consisted of the concatenation of the body field and 5 times the title field. While this first implementation is already helpful, this is currently implemented via a custom query that ignores the similarity that is configured on IndexSearcher. There is still work needed in order to make it more flexible, eg. by accepting different values of b (the BM25 parameter that controls length normalization) per field and making it easier to use. We hope to eventually leverage this functionality to improve scoring in Elasticsearch when using the cross_fields way of scoring across multiple fields.

Benchmarks for LatLonShape

While we would eventually want to benchmark LatLonShape - our new BKD-based support for shapes - with realistic datasets of shapes (polygons, lines, etc), we started simple by running our current benchmark for points with LatLonShape. This already gives us a good indication of how much overhead LatLonShape adds compared with LatLonPoint when only indexing points. See the purple line at http://people.apache.org/~mikemccand/geobench.html if you are curious. LatLonShape is still at an early stage of its development and we hope to bridge some of the gap with LatLonPoint in the near future.

Faster merging when indexing geo shapes

We recently enhanced BKD trees so that only a subset of dimensions is indexed, which in-turn helps use BKD trees as a R-tree. We noticed that we are still doing work on non-indexed dimensions which is not necessary. Fixing it is expecting to make merging faster for fields that index shapes. We will hopefully soon see a bump in the indexing speed charts (see above paragraph).

Less contention when applying deletes

Deletes may be applied either while indexing or refreshing. Today this is a blocking call, which might block some indexing threads. However blocking is only required when refreshing, in order to make sure that all deletes are reflected in the new point-in-time view of the index. an optimization that only makes it blocking for refreshes has been implemented to solve this.

Normalization as a first-class citizen

The ability for analysis chains to only normalize content and skip tokenization, stemming and all other components that don't work with partial terms (think of wildcard queries) always felt a bit like a second-class citizen as you had to check for whether a component implemented a special marker interface to know whether it should be applied for normalization. This was fixed by moving normalization directly to the CharFilter and TokenFilter factory classes.

Resources around Lucene

As highlighted in a recent Twitter conversation, the Lucene wiki has a useful page that collects various resources that either describe Lucene's internals or have been influential.

Other

- The Lucene 7.6 release process is on pause during Thanksgiving.

- Nori, the korean tokenizer, could hit a NullPointerException if the part-of-speech attribute is not set.

- Nori doesn't do a good job with non-CJK tokens.

- The "flexible" query parser can fail with index-out-of-bounds exceptions with non-default locales.

- In the general case, counting soft deletes upon merge requires a linear scan. Simon optimized the common special case that a segment consists only of soft deletes, as in that case the number of soft deletes is already known via IndexReader#numDocs.