05 July 2019

This Week in Elasticsearch and Apache Lucene - 2019-07-05

Mike Baamonde

•

•

•

•

•

Elasticsearch

Snapshot Lifecycle Management (SLM) merged

We have opened a PR (#43934) to merge SLM into master.

Enrich processor

We improved the enrich processor's capabilities to execute searches in parallel. The new enrich processor executes searches asynchronously, and in parallel. Since the enrich processor could overwhelm nodes with many searches we have introduced an internal search proxy API to control the current searches that enrich processor can execute concurrently,. The enrich processor will use this instead of the search API directly. This internal API coordinates searches by initially queueing the search requests and then consuming them from the queue in parallel, but with a fixed number of outgoing requests, in order to avoid overwhelming nodes. The internal proxy API also tries to include as many search requests as possible (up to a defined maximum) inside a multi search request in order to reduce the number of remote API calls made from the node that performs ingestion. This specifically helps ingest-only nodes optimize enriching documents, since the reference data in those nodes is never locally available. (#43801)

We added a background process for cleaning up abandoned enrich indices, as well as some updates to the runner synchronization to ensure that cleaning up enrich indices doesn't clobber currently executing enrich policies. (#43746) He also added a fix to the runner process that expands the enrich index replicas after forcemerge is completed (#43600)

Watcher

We implemented a long-standing request for Watcher to be able to execute actions for each item in an array. For example, you can send a Slack message for each item returned from a search. (#41997)

Cluster coordination

We are working on a possible safety issue where we are incorrectly counting votes in our coordination subsystem from nodes that have previously been master-eligible, but have subsequently been repurposed and are not master-eligible anymore.

We fixed a corner case where a voting-only master node, when being the only bootstrapped node, would not step down as master after state transfer because publishing to self would succeed. The reason for this is that cluster state publications to the master itself don't go through the transport layer and can therefore not be intercepted by the voting-only functionality.

Geo

We are wrapping up the ShapeBuilder refactoring effort, and are fixing an issue with libs/geo parser being too restrictive.

We opened a PR to add LineString/MultiLineString support to geoshape doc-values, and pushed draft WIP of the Circle Processor. This is an ingest processor that transforms a WKT/ES-GEOJSON circle into an n-gon polygon approximation, which is needed to work with the new BKD shapes. We refactored libs/geo parser to return a geometry format object that can perform both serialization and deserialization functions, which makes life easier for the processor.

Rollups

We opened a bugfix where rollup won't create the right mapping/metadata when index templates are used. Rollup was accidentally using an "expert" API on the CreateIndexRequest that is semi-lenient about syntax. It accepts typeless mappings, except when templates are also used, in which case templates and the mappings don't merge correctly.

Analytics

We merged a new rare terms aggregation. Documentation is here.

We merged a PR to make "missing" parsing more flexible when fields are unmapped. Aggregations accept a "missing" parameter to fill in empty field values. But if the field is also unmapped, there is no real type information to inform how the agg should run (e.g. string or long terms? numeric or ip range? etc). The prior code would just fall through to "bytes" which can be problematic. The PR adds a hook so aggs can parse the missing object and determine their own ValuesSource.

Watcher UI merged

We have merged Watcher UI. This effort contributes to our wider migration from Angular to React, adds threshold alert support for 4 new action types (index, webhook, jira, pagerduty) as well as new comparators, and fixes several long standing bugs.

Snapshot/Restore UI merged

We have merged the Snapshot and Restore UI. We also adjusted SR permissions logic, added a link to index settings docs from SR restore wizard fields, and prevented users from deleting snapshots stored in a Cloud-managed repository from the UI.

Search profiler now accepts triple-quoted input

We improved the Search Profiler to accept triple quoted input.

OpenID Connect

We fixed a bug in our OpenID Connect implementation regarding how we encode credentials for Basic Authentication.

Lucene

Lazy-loaded frequencies

We addressed a long standing TODO in the lower levels of Lucene which provided a nicespeedup for certain queries. Term frequencies that are used for scoring are now only loaded if they are really needed by the iterating docs or the impacts enum.

Impacts on phrases

The fact that Lucene stores raw term frequencies and norms as impacts means it's possible to leverage them for phrases. For instance, an impact that consists of (termFreq=5,norm=10) means that documents that have 5 occurrences or more of this term also have a norm that is greater than or equal to 10. Because the frequency of the phrase is necessarily less than the frequency of each term, these impacts can then be merged across all terms of the phrase query.

Can index sorting help block-max WAND?

Block-max WAND relies on a feedback loop between the collector and the scorer: the collector regularly tells the scorer the minimum required score that a hit needs to produce in order to make it to the top-k hits, and the scorer uses this information to skip documents that can't possibly produce competitive scores. The higher the required minimum score, the more hits can be skipped. This raises the question of whether we can organize the index order in such a way that scores converge more quickly, which should then help the block-max WAND logic. Scores have two components per document: term frequency, and a length normalization factor. Term frequencies are not practical since they are different for every document, but what if we sorted the index by increasing the norm? We wrote a patch that allows index sorting to be configured to sort by norm. We compared query speed with an index that was built in no particular order and the results were encouraging: queries ran up to twice as fast.

BKD-Tree improvements

We committed several improvements this week. We now use data dimensions as a tie-breaker when partitioning. For duplicated values, we use run-length compression. Along the same lines, the Points API changed to allow queries to leverage the fact that there are duplicates on a leaf. We also now use the new Points API in certain queries.

New issues are already open to optimize point queries in BKD-backed geo shapes. Currently, we use bounding box queries in these cases, but using a specialized query yields a performance boost of around 10%.

The XYShape feature is now in the review phase.

Other

BooleanQuery.maxBooleanClauses() now uses IndexSearcher instead of a global boolean query limit, and uses query visitors to get a better count of the number of leaves. The next step is to investigate whether we can make the limit per-searcher rather than static.
Luwak has been merged into the main line of Lucene as the monitor submodule.
Intervals queries have moved from the sandbox into the queries module. We also changed wildcard() and prefix() intervals to take a BytesRef rather than a String, which integrates better with Analyzer.normalize().

Changes

Changes in Elasticsearch

Changes in 8.0:

Ensure test cluster classpath inputs have predictable ordering #43938
BREAKING: Remove the client transport profile filter #43236
Enable caching of rest tests which use integ-test distribution #43782

Changes in 7.4:

SmokeTestWatcherWithSecurityIT: Retry if failures searching .watcher-history #43781
Refactor index engines to manage readers instead of searchers #43860
BREAKING: Provide an Option to Use Path-Style-Access with S3 Repo #41966
Async IO Processor release before notify #43682

Changes in 7.3:

[ML-DataFrame] audit message missing for autostop #43984
Ensure to access RecoveryState#fileDetails under lock #43839
Actually close IndexAnalyzers contents #43914
Shortcut simple patterns ending in * #43904
Deprecate transport profile security type setting #43237
Add _reload_search_analyzers endpoint to HLRC #43733
Watcher: Allow to execute actions for each element in array #41997
Refresh translog stats after translog trimming in NoOpEngine #43825
Support builtin privileges in get privileges API #42134
Use separate BitSet cache in Doc Level Security #43669
Always attach system user to internal actions #43468
Add dims parameter to dense_vector mapping #43444
[ML][Data Frame] add node attr to GET _stats #43842
[ML-Data Frame] Add data frame transform cluster privileges to HLRC #43879
[ML][Data Frame] fix progress measurement for continuous transforms #43838
Return reloaded analyzers in _reload_search_ananlyzer response #43813
Remove sort by primary term when reading soft-deletes #43845
[ML][Data Frame] Add deduced mappings to _preview response payload #43742
Upgrade to Gradle 5.5 #43788
Add RareTerms aggregation #35718
Expose translog stats in ReadOnlyEngine #43752
Add "manage_api_key" cluster privilege #43728
Make peer recovery send file info step async #43792
Avoid parallel reroutes in DiskThresholdMonitor #43381
Make peer recovery clean files step async #43787
Consistent Secure Settings #40416
Trim translog for closed indices #43156
Refactor IndexSearcherWrapper to disallow the wrapping of IndexSearcher #43645
[ML][Data Frame] removing format support in date_histogram group_by #43659
Wildcard intervals #43691
Add support for 'flattened object' fields. #42541

Changes in 7.2:

Fix index_prefix sub field name on nested text fields #43862
Fix credential encoding for OIDC token request #43808

Changes in 6.8:

Prevent types deprecation warning for indices.exists requests #43963
Fix wrong logic in match_phrase query with multi-word synonyms #43941
AsyncIOProcessor preserve thread context #43729

Changes in Elasticsearch SQL ODBC Driver

Changes in 7.4:

Add timing logging for the REST request #162

Changes in 7.2:

Remove support for the frozen indices param #161

Changes in Rally Tracks

Change default store_type to fs #80

Elasticsearch Platform

ELK Stack

Elastic Cloud

Observability

Security

Search

By industry

By solution

Customer spotlight

Developers

Connect

Learn

Help

See what's happening at Elastic