This Week in Elasticsearch and Apache Lucene - 2019-07-12

Elasticsearch

Geo

We merged a PR to consolidate geo return values under a single GeoValue type.  This is so that aggregations that just want a "geo thing" can use a generic view of both points and shapes to get common attributes like bounding boxes.

We merged a bug fix for too strict validation in geoshapes, and opened a PR for moving dateline handling logic out of ShapeBuilders in order to simplify XY Shape addition.  As a reminder, this is part of a series of refactoring that we want to do to ShapeBuilders before moving forward with XYShape, since it'll be easier to unpack it now than later.

We also added an XContentParser-compatible map iterator that should allow us to avoid serializing and de-serializing large geo shapes.  This is something that Ingest apparently was running into, and Tal is going to make use of for the Circle ingest processor

We've been working on a new xpack geo plugin to hold the new geometry field type for exposing lucene's XYShape: 

Index Templates List View

We started working on the UI for index templates. As shown below, it supports listing and deleting index templates, but prevents deleting system templates.

Index Templates List View

UI Tests

We landed app load tests for Index Lifecycle Management, Cross Cluster Replication, Index Management, and Snapshot Restore.

Snapshot resiliency

We are investigating ways to reduce the number of RPC calls when doing snapshots or snapshot deletions. He has optimized the interaction with blob store repositories a little more by removing old 1.x compatibility logic and removing a few redundant listing operations during both snapshot create and delete.

We are also investigating what kind of corruptions to expect when multiple clusters are concurrently writing to the same repository and is hardening the snapshot deletion functionality to reduce the risk of leaving a repository in a broken state. We are also adding more checks to the test infrastructure to verify the integrity of repositories, i.e., that they are not missing any files.

Distance functions on vector fields

We added Manhattan distance and Euclidean distance functions to vector fields.

Aggregation tests

We completed the migration of all (but one) integration tests from AvgIT integration test to AvgAggregatorTests.  This is pretty exciting, since there were ~15-20 tests that were moved from IT to unit test and should help both speed and debug-ability in the future. The one remaining test is for caching and can be consolidated later once other tests have been moved.

Password-protected keystore

We are finishing up work on two components of adding password protection to the Elasticsearch keystore - adding password support to the CLI tools and adding password support to the reload secure settings API.

ODBC Driver

A new PR has been opened to have all existing configuration parameters be settable through the DSN editor; this will facilitate connection configurations for the users, who will no longer need to fiddle with an error-prone connection string editing.

A documentation PR has been opened to cover all available parameters, as well as adjust some slightly misleading phrasing.

Enrich

Enrich is a project to permit an ingest pipeline to add data to a document being ingested via a search on a different index. The enrich processor uses the multi search api to executed buffered search requests to do lookups. Buffering the search requests is a great way of optimizing the lookups that enrich processor does, especially when the enrich index isn't local. However multi search api doesn't execute the searches in an optimal way if all searches search the same shard. This PR handles search requests for enrich indices more efficiently. Only the search features that the enrich processor requires are used. Further, multiple search requests reuse the same engine and are executed by the same search thread. (#43965)

The enrich indexes need to be cleaned up in a background process.  The enrich background cleanup process is quite important to get right as it needs to behave nicely with enrich policies that may be in the process of executing, but also is expected to maintain the state of the enrich feature that may deteriorate for any reason. For instance, if a master node fails while executing a policy, the new master node must be able to recover and clean up the broken pieces correctly. We made some adjustments to the background enrich cleanup process and added some test code this week. (#43746)

Selecting keys in buckets_path

In a recent chat one of our engineers had with a user, the user wanted to select specific buckets for use in a bucket_script which today isn't possible.  There is a workaround using two filter aggs... but it's hacky at best.  Zach opened a PR to add support for selecting keys in the buckets_path syntax (e.g. categories['green']>the_avg.value). Terms aggs have always been a bit problematic for pipelines, so this should make them a little easier to work with.

A tale of two append-only optimizations

Elasticsearch provides two optimizations to speed up indexing with an append-only workload.

When using auto-generated IDs it can avoid a lookup in Lucene to check whether a document with the same ID already exists, and use Lucene's addDocument instead of updateDocument method for indexing. The optimization is safe against internal retries and out-of-order arrival of writes on both primary and replica shards through a few extra techniques. The coordinating node adds a timestamp to these index requests and flags retries, allowing the shards to detect out-of-order and duplicate arrivals. For replicas, additional measures have been put in place to account for a different type of out-of-order arrival between a possible follow-up delete of the document and the original indexing operation.

Elasticsearch also uses another optimization, which has been introduced to speed up CCR follower indices as the auto-generated ID optimization is not applicable there. This newer optimization, which also works for non-auto-generated IDs, has the primary on the leader shard determine whether an indexing operation requires an update in Lucene, and has it record the maximum sequence number of operations that have resulted in updates or deletes (maxSeqNoOfUpdatesOrDeletes). This information is then relayed with the actual operations to the follower index, allowing its shards to skip lookups in case where the sequence number of the operation to index is higher than maxSeqNoOfUpdatesOrDeletes (+ some extra conditions for out-of-order arrivals).

These two optimizations, which have slightly different fields of application, contribute significantly to the complexity of the code in InternalEngine. To simplify things at the conceptual as well as the code level, we adjusted the scope of these optimizations. The scope of the first optimization has been reduced to primaries (of non-follower indices). The second optimization has been extended from follower indices to replicas of regular indices as well.

There are several benefits to this approach. The second optimization now serves as a full replacement for the first optimization on replicas. The extra code guaranteeing safety for the first optimization on replicas is no longer needed and has been removed. Benchmarks show that the performance is on par for a workload with auto-generated IDs. The new optimization also allows replicas to now index faster in case of an append-only workload with non-auto-generated IDs. We measured an indexing throughput increase by around 13 percent. There isn't currently a nightly benchmarks for this scenario (non-auto-generated IDs + replicas) on our official benchmarking page (most of them use auto-generated IDs), but we contributed a PR to Rally that will make it easier to run such benchmarks in the future.

Lucene

Lucene 8.2.0

We had started discussing doing a 8.1.2 release one month ago. However it got delayed several times and we eventually decided to cancel this release and focus on 8.2.0 instead, which we had also started talking about a couple weeks ago. Ignacio will be the release manager.

First iteration of XYShape is in

A first iteration of XYShape was merged, which enables indexing shapes in a cartesian space. This will help index non-geo data such as locations on a soccer field or in a video game and will be a building block to support for projections.

Should we index position lengths?

The fact that we don't index position lengths means that some queries are incorrect. For instance if a token stream has A with a position length of 2 followed by B and it gets indexed, a phrase query on "A B" would not find it because it would expect B to occur at the position of A, plus 1. And the query has now way to know that it is doing tho wrong thing since position lengths are not stored in the index, so it only sees that B is at the position of A, plus 2 - not that A also has a position length of 2.

However storing position lengths in the index also introduces problems. First, it would make the implementation of phrase queries more complicated, in particular because position ends (the sum of the position and the position length) would not come in order and because a term could occur multiple times at the same position with different lengths. Furthermore this would possibly make the notion of phrase slops more confusing.

We agreed to not index position lengths for now and explore by storing position lengths in payloads as a start, in order to evaluate the impact it would have on queries.

Faster phrase queries when hit counts are not needed

Phrase queries already had an optimization that consists of not reading positions when the minimum term frequency across query terms is not high enough for a hit to be competitive (annotation CJ). We introduced a new optimization that can skip entire blocks of documents by merging impacts across query terms to derive the maximum score that the phrase query could have over these blocks. The efficiency depends a lot on how likely query terms are to occur next to one another, but it helped a lot on the phrase queries that we use for nightly benchmarks (annotation CQ).

Other

 - Lucene got a specialized collector for large numbers of hits.

 - The change to our SPI mechanism for analysis factories would be a breaking change for 8.2 and we are considering reverting from 8.x.

 - Specializing point-in-polygon queries helps get a 10% speedup.

Changes

Changes in Elasticsearch

Changes in 8.0:

  • Enable Kotlin compiler to recognise @org.elasticsearch.common.Nullable annotation for null checks. #43912
  • BREAKING: Do not set a NameID format in Policy by default #44090

Changes in 7.4:

  • Fix X509AuthenticationToken principal #43932
  • Pass tests.jvms system property to test tasks for maxParallelForks #44237
  • Ignore test seed when flag is passed #44234
  • Add l1norm and l2norm distances for vectors #44116
  • Geo: add validator that only checks altitude #43893
  • JSON logging refactoring and X-Opaque-ID support #41354
  • Fix decimal point parsing for date_optional_time #43859

Changes in 7.3:

  • [ML][Data Frame] adds index validations to _start data frame transform #44191
  • Fix eclipse project file generation #44080
  • [ML-DataFrame] Add a frequency option to transform config, default 1m #44120
  • Check again on-going snapshots/restores of indices before closing #43873
  • [ML] Data frame task failure do not make a 500 response #44058

Changes in 7.2:

  • SQL: add pretty printing to JSON format #43756
  • SQL: handle double quotes escaping #43829
  • SQL: handle SQL not being available in a more graceful way #43665
  • SQL: change the size of the list of concrete indices when resolving multiple indices #43878

Changes in 6.8:

  • SQL: Handle the edge case of an empty array of values to return from source #43868
  • Do not copy initial recovery filter during split #44053

Changes in Elasticsearch Hadoop Plugin

Changes in NO VERSION LABELS:

  • [DOCS] Updates documentation version #1313

Changes in Elasticsearch SQL ODBC Driver

Changes in 7.4:

  • Logging: descriptors setting logged back on debug level #165

Changes in Rally

Changes in 1.3.0:

  • BREAKING: Drop support for Elasticsearch 1.x #716

Changes in Rally Tracks

  • Update target throughout for nested randomized-sorted-term-queries #82

  • Allow configure conflicts for updates #81