This Week in Elasticsearch and Apache Lucene - 2019-07-05

Elasticsearch

Snapshot Lifecycle Management (SLM) merged

We have opened a PR (#43934) to merge SLM into master.

Enrich processor

We improved the enrich processor's capabilities to execute searches in parallel. The new enrich processor executes searches asynchronously, and in parallel. Since the enrich processor could overwhelm nodes with many searches we have introduced an internal search proxy API to control the current searches that enrich processor can execute concurrently,. The enrich processor will use this instead of the search API directly. This internal API coordinates searches by initially queueing the search requests and then consuming them from the queue in parallel, but with a fixed number of outgoing requests, in order to avoid overwhelming nodes. The internal proxy API also tries to include as many search requests as possible (up to a defined maximum) inside a multi search request in order to reduce the number of remote API calls made from the node that performs ingestion. This specifically helps ingest-only nodes optimize enriching documents, since the reference data in those nodes is never locally available. (#43801)

We added a background process for cleaning up abandoned enrich indices, as well as some updates to the runner synchronization to ensure that cleaning up enrich indices doesn't clobber currently executing enrich policies. (#43746) He also added a fix to the runner process that expands the enrich index replicas after forcemerge is completed (#43600)

Watcher

We implemented a long-standing request for Watcher to be able to execute actions for each item in an array. For example, you can send a Slack message for each item returned from a search. (#41997)

Cluster coordination

We are working on a possible safety issue where we are incorrectly counting votes in our coordination subsystem from nodes that have previously been master-eligible, but have subsequently been repurposed and are not master-eligible anymore.

We fixed a corner case where a voting-only master node, when being the only bootstrapped node, would not step down as master after state transfer because publishing to self would succeed. The reason for this is that cluster state publications to the master itself don't go through the transport layer and can therefore not be intercepted by the voting-only functionality.

Geo

We are wrapping up the ShapeBuilder refactoring effort, and are fixing an issue with libs/geo parser being too restrictive.

We opened a PR to add LineString/MultiLineString support to geoshape doc-values, and pushed draft WIP of the Circle Processor. This is an ingest processor that transforms a WKT/ES-GEOJSON circle into an n-gon polygon approximation, which is needed to work with the new BKD shapes. We refactored libs/geo parser to return a geometry format object that can perform both serialization and deserialization functions, which makes life easier for the processor.

Rollups

We opened a bugfix where rollup won't create the right mapping/metadata when index templates are used. Rollup was accidentally using an "expert" API on the CreateIndexRequest that is semi-lenient about syntax. It accepts typeless mappings, except when templates are also used, in which case templates and the mappings don't merge correctly.

Analytics

We merged a new rare terms aggregation. Documentation is here.

We merged a PR to make "missing" parsing more flexible when fields are unmapped. Aggregations accept a "missing" parameter to fill in empty field values. But if the field is also unmapped, there is no real type information to inform how the agg should run (e.g. string or long terms? numeric or ip range? etc). The prior code would just fall through to "bytes" which can be problematic. The PR adds a hook so aggs can parse the missing object and determine their own ValuesSource.

Watcher UI merged

We have merged Watcher UI. This effort contributes to our wider migration from Angular to React, adds threshold alert support for 4 new action types (index, webhook, jira, pagerduty) as well as new comparators, and fixes several long standing bugs.

Snapshot/Restore UI merged

We have merged the Snapshot and Restore UI. We also adjusted SR permissions logic, added a link to index settings docs from SR restore wizard fields, and prevented users from deleting snapshots stored in a Cloud-managed repository from the UI.

Search profiler now accepts triple-quoted input

We improved the Search Profiler to accept triple quoted input.

OpenID Connect

We fixed a bug in our OpenID Connect implementation regarding how we encode credentials for Basic Authentication.

Lucene

Lazy-loaded frequencies

We addressed a long standing TODO in the lower levels of Lucene which provided a nicespeedup for certain queries. Term frequencies that are used for scoring are now only loaded if they are really needed by the iterating docs or the impacts enum.

Impacts on phrases

The fact that Lucene stores raw term frequencies and norms as impacts means it's possible to leverage them for phrases. For instance, an impact that consists of (termFreq=5,norm=10) means that documents that have 5 occurrences or more of this term also have a norm that is greater than or equal to 10. Because the frequency of the phrase is necessarily less than the frequency of each term, these impacts can then be merged across all terms of the phrase query.

Can index sorting help block-max WAND?

Block-max WAND relies on a feedback loop between the collector and the scorer: the collector regularly tells the scorer the minimum required score that a hit needs to produce in order to make it to the top-k hits, and the scorer uses this information to skip documents that can't possibly produce competitive scores. The higher the required minimum score, the more hits can be skipped. This raises the question of whether we can organize the index order in such a way that scores converge more quickly, which should then help the block-max WAND logic. Scores have two components per document: term frequency, and a length normalization factor. Term frequencies are not practical since they are different for every document, but what if we sorted the index by increasing the norm? We wrote a patch that allows index sorting to be configured to sort by norm. We compared query speed with an index that was built in no particular order and the results were encouraging: queries ran up to twice as fast.

BKD-Tree improvements

We committed several improvements this week. We now use data dimensions as a tie-breaker when partitioning. For duplicated values, we use run-length compression. Along the same lines, the Points API changed to allow queries to leverage the fact that there are duplicates on a leaf. We also now use the new Points API in certain queries.

New issues are already open to optimize point queries in BKD-backed geo shapes. Currently, we use bounding box queries in these cases, but using a specialized query yields a performance boost of around 10%.

The XYShape feature is now in the review phase.

Other

  • BooleanQuery.maxBooleanClauses() now uses IndexSearcher instead of a global boolean query limit, and uses query visitors to get a better count of the number of leaves. The next step is to investigate whether we can make the limit per-searcher rather than static.
  • Luwak has been merged into the main line of Lucene as the monitor submodule.
  • Intervals queries have moved from the sandbox into the queries module. We also changed wildcard() and prefix() intervals to take a BytesRef rather than a String, which integrates better with Analyzer.normalize().

Changes

Changes in Elasticsearch

Changes in 8.0:

  • Ensure test cluster classpath inputs have predictable ordering #43938
  • BREAKING: Remove the client transport profile filter #43236
  • Enable caching of rest tests which use integ-test distribution #43782

Changes in 7.4:

  • SmokeTestWatcherWithSecurityIT: Retry if failures searching .watcher-history #43781
  • Refactor index engines to manage readers instead of searchers #43860
  • BREAKING: Provide an Option to Use Path-Style-Access with S3 Repo #41966
  • Async IO Processor release before notify #43682

Changes in 7.3:

  • [ML-DataFrame] audit message missing for autostop #43984
  • Ensure to access RecoveryState#fileDetails under lock #43839
  • Actually close IndexAnalyzers contents #43914
  • Shortcut simple patterns ending in * #43904
  • Deprecate transport profile security type setting #43237
  • Add _reload_search_analyzers endpoint to HLRC #43733
  • Watcher: Allow to execute actions for each element in array #41997
  • Refresh translog stats after translog trimming in NoOpEngine #43825
  • Support builtin privileges in get privileges API #42134
  • Use separate BitSet cache in Doc Level Security #43669
  • Always attach system user to internal actions #43468
  • Add dims parameter to dense_vector mapping #43444
  • [ML][Data Frame] add node attr to GET _stats #43842
  • [ML-Data Frame] Add data frame transform cluster privileges to HLRC #43879
  • [ML][Data Frame] fix progress measurement for continuous transforms #43838
  • Return reloaded analyzers in _reload_search_ananlyzer response #43813
  • Remove sort by primary term when reading soft-deletes #43845
  • [ML][Data Frame] Add deduced mappings to _preview response payload #43742
  • Upgrade to Gradle 5.5 #43788
  • Add RareTerms aggregation #35718
  • Expose translog stats in ReadOnlyEngine #43752
  • Add "manage_api_key" cluster privilege #43728
  • Make peer recovery send file info step async #43792
  • Avoid parallel reroutes in DiskThresholdMonitor #43381
  • Make peer recovery clean files step async #43787
  • Consistent Secure Settings #40416
  • Trim translog for closed indices #43156
  • Refactor IndexSearcherWrapper to disallow the wrapping of IndexSearcher #43645
  • [ML][Data Frame] removing format support in date_histogram group_by #43659
  • Wildcard intervals #43691
  • Add support for 'flattened object' fields. #42541

Changes in 7.2:

  • Fix index_prefix sub field name on nested text fields #43862
  • Fix credential encoding for OIDC token request #43808

Changes in 6.8:

  • Prevent types deprecation warning for indices.exists requests #43963
  • Fix wrong logic in match_phrase query with multi-word synonyms #43941
  • AsyncIOProcessor preserve thread context #43729

Changes in Elasticsearch SQL ODBC Driver

Changes in 7.4:

  • Add timing logging for the REST request #162

Changes in 7.2:

  • Remove support for the frozen indices param #161

Changes in Rally Tracks

  • Change default store_type to fs #80