This Week in Elasticsearch and Apache Lucene - 2019-09-09

Elasticsearch

Snapshot Resilience

We've been working on a change to limit the effects of AWS S3's eventual consistency model in our snapshot implementation. The idea is to change the snapshot metadata format in the repository slightly to include references to the state of each shard in the metadata at the root of the repository. This will speed up a number of snapshot operations as it saves expensive LIST API calls on the shard directories.

After moving the S3 repository integration tests to an HTTP fixture last week, which allows simulating unreliable connections to S3 repositories, We've done the same for both Azure and GCS this week. This work allows us to verify SDK retry behaviors, on which we very much rely for a resilient snapshot implementation.

Vector Search

We merged the feature branch to optimize the vector similarity functions. We ran some benchmarks to evaluate the gain, which showed close to 3x speedups with these optimizations. We plan to continue exploring other ways to optimize brute-force vector search possibly with introducing API changes. 

Geo Shapes

In case you missed it, the new cartesian geometry was recently merged. This is a field that lets a user index arbitrary x/y coordinates without any kind of geospatial projection.  At Elastic{On} 2017 we showcased a custom tilemap demo of the venue (Pier 48) which was overlaid with heatmaps of various data collected during the conference.  We also demo'ed a custom tilemap for plotting NHL data (shots on goal, etc). Both of these demos used geospatial latitude/longitude points, and while it worked, it isn't ideal. In small, enclosed environments it is often more convenient to use an arbitrary coordinate system that is relative to the venue itself.  The new cartesian geometry field is just a flat x/y plane without any geospatial baggage, and would have made those demos a lot easier to build. This should prove useful for indoor maps, video game coordinates, etc.

With shape field now available, Lucene work is resuming to add support for CONTAINS queries in BKD shapes. We're also investigating an approach for adding spatial projection support to the geo_shape field type.

Benchmarking

To simulate realistic data volumes, we've improved the data generator in the rally-evendata-track and added a new challenge so it can now generate a user-defined logging volume per (simulated) day. We also added a new  scheduler that allows to control throughput based on a specified target utilization (this is non-trivial because utilization depends on the specific characteristics of the benchmark candidate and cannot be determined upfront).

date_histogram performance

We are looking into a performance regression of the date_histogram.  The user ran some more tests on their side and it appears that 6.8 is twice as fast as 7.3 when the dates are close to each other, but 7.3 is faster when the dates are spread more evenly over a larger range.  So this might not be a regression per-se, but a change in behavior (perhaps unanticipated). More investigation to come.

Point in Time Reader

We started to work on a replacement for scroll queries. The goal is to allow users to create a point in time reader that can be used for multiple requests. For instance this will allow users to implement algorithms that requires multiple pass over the data while ensuring that each pass returns the same set of documents even if the indices are updated. The hidden goal of this project is also to simplify the handling of the different phases in search. Today we have to cache a complex context object in the node during the entire lifecycle of the search but with this change we plan to recreate it on each search phase and only keep the index reader as the context. This means that the state of a search will be kept in the coordinating node rather than in each individual nodes that need to run a shard search.

Enrich

We added a tamper warning to enrich indices in the mapping metadata #45996, as enrich indices are managed by Elasticsearch itself and should not be directly modified by users.

We also uncovered an issue with the Ingest framework where freshly ingested documents where being saved with the default request XContent type instead of the XContent type that they were originally submitted with. This was running amok with some new enrich code which assumes that the carefully managed internal enrich indices store documents in SMILE format. #45799
Now that the above issue is resolved, Jimmy opened a PR for making sure enrich documents from different source indices that have the same id do not clobber one another when being copied to an enrich index #46348

Snapshot Lifecycle Management

We previously checked during SLM retention operations that snapshots were not in progress -- waiting until they were done before proceeding. It turns out more than just taking a snapshot can block deletions. We added a commit that checks all of the operations that can block snapshot deletion, such as repository cleanups, restores, and other deletions. #45992

Previous work added stats that were collected about SLM in general and on a per-policy basis, #45989 adds these stats to the output when retrieving a policy so you can see at a glance whether the policy has had successful snapshotting/deletion/etc.

SQL

We merged a small bug-fix PR in the new CBOR implementation for our ODBC driver and another change applied to have the driver always give row counts on result sets with cursors. '

We also provided two bug fixes for the conditional functions/statements, one for IIF and one for CASE.

Role Validation

For a long time we have allowed roles to be created with invalid values, such as having a typo in the name of a privilege. However if a user who as assigned that role attempted to authenticate to Elasticsearch, they would get an error, and their request would fail. This is the wrong experience because the feedback is given to the wrong person at the wrong time.

We have two open PRs to validate privilege names and document level security queries at role creation time.

Lucene

Merge at refresh time?

When doing near-realtime search, frequent refreshes might create a lot of tiny segments that add up to the search overhead. Might we want to merge these segments before opening a new NRT reader in order to get a more search-efficient point-in-time view of the index? If these segments are small, the merge would be very cheap.

Faster parallel search

Lucene has had the ability to search indices concurrently for a long time, by assigning sets of segments to different threads and merging results in the end. However this way of searching doesn't leverage recent optimizations like index sorting and block-max WAND as much as it could. We are looking at how we can share information between threads so that information collected on a given thread can help skip non-competitive matches on another thread. This sounds like an obvious win, but we need to be careful to not add too much contention in order to not penalize the worst-case scenario in which not many hits can be skipped.

A start is to share a counter between all threads so that we can stop counting hits as soon as we reach the threshold across all threads. We are now looking at sharing information about the minimum requirements for a hit to be competitive, which is more challenging.

Other

 - We are speeding up INTERSECTS queries on shapes with box and polygon queries by making smarter decisions on inner nodes of the KD tree.

 - FSTs could speed up lookups by encoding dense nodes as a hash table instead of sorting nodes at index time and using binary search at search time.

 - The "Direct" doc-value format that stored everything in memory without support for sparse values is costly to maintain and brings little value so we removed it.

 - We optimized rescorers in the case that only a small subset of the hits need to be rescored.

 - We are removing duplicate work from WITHIN and CONTAINS queries.

 - We're fixing the korean analyzer to split tokens on boundaries between digits and alphabetic characters.

 - The upgrade to ICU 62.1 was incomplete.

 - oal.store.Directory is soon going to get a wrapper that allows to keep track of I/O activity.

 - Some work to get Lucene's geojson parser to better handle string arrays.

 - We added a way to find all points that are inside a polygon using doc values, so that such queries can run faster when combined with a selective filter thanks to IndexOrDocValuesQuery.

Changes

Changes in Elasticsearch

Changes in 8.0:

  • BREAKING: Decouple shard allocation awareness from search and get requests #45735
  • BREAKING: Fixed synchronizing inflight breaker with internal variable #40878

Changes in 7.5:

  • Adjacency_matrix aggregation memory usage optimisation. #46257
  • Clarify error message on keystore write permissions #46321
  • Support geotile_grid aggregation in composite agg sources #45810
  • [ML][Transforms] allow executor to call start on started task #46347
  • Do not send recovery requests with CancellableThreads #46287
  • Bwc testclusters all #46265
  • First round of optimizations for vector functions. #46294
  • Reset queryGeometry in ShapeQueryTests #45974

Changes in 7.4:

  • Initialize document subset bit set cache used for DLS #46211
  • Build: Enable testing without magic comments #46180

Changes in 7.3:

  • SQL: Fix issue with IIF function when condition folds #46290
  • Multi-get requests should wait for search active #46283
  • [ML-DataFrame] Fix off-by-one error in checkpoint operations_behind #46235
  • SQL: Fix issue with DataType for CASE with NULL #46173

Changes in 6.8:

  • Suppress warning logs from background sync on relocated primary #46247
  • Do not use ifSeqNo if doc does not have seq_no #46198

Changes in Elasticsearch Hadoop Plugin

Changes in NO VERSION LABELS:

  • [DOCS] Updates version attributes filename #1339

Changes in Elasticsearch SQL ODBC Driver

Changes in 7.4:

  • SQLRowCount: always return row set size #182

Changes in Rally

Changes in 1.3.0:

  • Allow to retry until success #761
  • Show track and team revision when listing races #759
  • Improve error message on SSL errors #758
  • Improve logging of schedules #760