19 July 2019

This Week in Elasticsearch and Apache Lucene - 2019-07-19

Yannick Welsch

•

•

•

•

•

•

Elasticsearch Highlights

Snapshot Lifecycle Management

Snapshot Lifecycle Management (SLM) is a set of Elasticsearch APIs that allow users to automatically create backups of their data on a predefined schedule. Historically, new Elasticsearch APIs have tended to ship a release or two ahead of their UI counterparts. We’ve made an effort to develop the UI in tandem with the API so they can ship together in 7.4. By shipping UI and API together, the feature as a whole becomes accessible to a broader set of users upon launch, and our product story becomes more comprehensive. We’ll continue this workflow in the future!

We merged ES Snapshot Lifecycle Management feature branch to master and completed the backport to 7.x. Feel free to take it for a spin! Documentation is here and here. Work continues on retention.

We submitted a PR that adds the list of Snapshot Lifecycle Management policies and a details panel to the Snapshot and Restore app in Kibana.

Search: Stemmers, Pinners, Cancellers, and their neighbors

We continued work on cancelling searches when the underlying client connection is closed. Complications around multi-search request have spawned additional changes: Associate sub-requests to their parent task in multi search API, and Make multi search tasks cancellable.

We have a draft implementation for the approximate nearest neighbours search based on the locality-sensitive hashing.

We have continued work on the result pinning PR to support Swiftype style promotion of results. As a result of reviewing we uncovered some inefficiencies in Lucene's DisMax query.

We have a PR for a new English plural stemmer and published a comparison of results with the existing Lucene implementation.

Geo

We merged a PR that introduces the new Spatial Plugin into master. This is the first building block for the new spatial features (including, but not limited to, geo) we are working on. We are working on an upcoming feature for this plugin, a new XYShape field, which is not strictly Geo.

We are working in refactoring the spatial components in Elasticsearch so the new XYShape field can be integrated easily. We have merged a PR that moves dateline handling logic out of ShapeBuilders , which is specific to Geo. Next is to extract this logic from the query builders.

Pipeline Aggregations

We are working on a community PR that adds a shift parameter to the MovingFunction pipeline agg. This allows the user to adjust where the window is positioned, instead of trailing the current bucket as is the case today. Useful if you want to include or exclude the current bucket, "center" the window, etc

We are working as well in adding a "none" gap policy to pipeline aggs. Users often want to execute a bucket_script on all buckets, whether they have a value or not. The existing gap_policies make this impossible. So the "none" policy basically does nothing, and lets the aggregation evaluate the null/missing/NaN value for itself. It also adds a params.doc_count parameter to the script context so that the user can inspect the document count and determine if the value is NaN because it is missing, or NaN because it was just NaN.

Async peer recoveries

We have been working in the past months on making peer recoveries non-blocking. We merged the most complex work item this week, with file chunks now being sent asynchronously, fixing an issue where setting the node_concurrent_recoveries setting to a large value would potentially lead to deadlocks. The only remaining item is the relocation handoff. Moving more code to async is a massive undertaking but will also tremendously help in our quest to reduce the default maximum size of the generic threadpool. We are also applying the same technique to CCR's recovery from remote, allowing us to share the complex file chunk coordination logic between peer recovery and recovery from remote.

PKI for Kibana

We believe we reached a consensus on all of the outstanding questions we had about Proxied PKI for Kibana, and are working through the series or pull requests to implement this feature.

Enrich Processor

We merged a PR to ensure that an enrich policy is immutable. Since enrich index names and enriching behaviors are directly associated to a policy we can eliminate an entire category of potential issues by enforcing immutable policies. We are still working through the details of ensuring that if a policy is deleted then immediately re-adding it won't re-introduce the same category of problems were are avoiding with immutable policies. (#43604)

We merged the background cleanup process (#43746) and is in the process of adding the ES version that an enrich policy was created with to the metadata so that we have it if anything major needs to change across versions and need compatibility logic.

Apache Lucene

Lucene 8.2

The Elasticsearch benchmarks helped catch a regression with memory usage of the terms dictionary. This was due to a change to FSTs that enabled direct arc addressing on dense nodes in order to be able to run lookups with a single random access instead of a binary search. This change only triggered minor size increases when tested against text data, but it has a worst-case scenario of ~4x more memory usage. In our case, it made the terms dictionary use 50% more memory, likely because of the _id field, which is binary and has denser nodes than FSTs made of english text. This change has been reverted from 8.2 and we will work on re-enabling it in a way that avoids worst-case scenarios memory-wise for 8.3.

Other

Can we leverage top-hits retrieval optimizations across multiple slices?
BKDWriter could make better splitting decisions by recomputing the range of each dimension on each recursion level.
We could clean things up by leveraging Set.copyOf and Set.of.
Per-query I/O counters can be useful for benchmarking.
Nearest neighbor search was missing an optimization we added a couple months ago that consists of shrink wrapping leaf cells to have better bounds.
DisjunctionMaxQuery can be optimized for top-hits retrieval.
IndexSearcher#termStatistics should not require creating a TermStates, which is expensive at it requires seeking in the terms dictionary.
We are considering removing the "Direct" doc-value format.

Changes in Elasticsearch

Changes in 8.0:

BREAKING: Fail node containing ancient closed index #44264

Changes in 7.4:

Expose index age in ILM explain output #44457
add disable_chunked_encoding configuration #44052
Associate sub-requests to their parent task in multi search API #44492
Defer reroute when starting shards #44433
Introduce test issue logging #44477
Add Snapshot Lifecycle Management #43934
Make peer recovery send file chunks async #44468
Do not allow version in Rest Update API #43516
Improve build scan metadata #44247
Cluster health should await events plus other things #44348
Log write failures for watcher history document. #44129
Allow RerouteService to reroute at lower priority #44338
Throw TranslogCorruptedException in more cases #44217
add clarification around TESTSETUP docu and error message #43306
HLRC: Fix '+' Not Correctly Encoded in GET Req. #33164
Fail engine if hit document failure on replicas #43523
[ML][Data Frame] prevent task from attempting to run when failed #44239
Support WKT point conversion to geo_point type #44107
Make plugin verification FIPS 140 compliant #44224
Avoid counting votes from master-ineligible nodes #43688

Changes in 7.3:

Fix incorrect calculation of how many buckets will result from a merge #44461
Don't use index_phrases on graph queries #44340
Fix broken short-circuit in getUnlicensedRealms #44399
Ensure field caps doesn't error on rank feature fields. #44370
[ML][Data Frame] treat bulk index failures as an indexing failure #44351
Improve CryptoService error message on missing secure file #43623
Fix AnalyzeAction response serialization #44284
[ML][Data Frame] responding with 409 status code when failing _stop #44231
Fix port range allocation with large worker IDs #44213

Changes in 6.8:

Skip update if leader and follower settings identical #44535
Fix parameter value for calling data.advanceExact #44205
Avoid stack overflow in auto-follow coordinator #44421
Avoid NPE when checking for CCR index privileges #44397
Fix varying responses for /_analyze request #44342
Do not swallow I/O exception getting authentication #44398
Fix swapped variables in error message #44300

Changes in Elasticsearch Hadoop Plugin

Changes in 7.3:

[DOCS] Fix broken links for ES API docs move #1317

Changes in Rally

Changes in 1.3.0:

Be resilient upon startup #730
BREAKING: Drop 1.x support for cluster metadata #729
Allow to set distribution version as parameter #728

The Search AI Company

Generative AI

Search

Security

Observability

By solution

Industries

This Week in Elasticsearch and Apache Lucene - 2019-07-19

Elasticsearch Highlights

Snapshot Lifecycle Management

Search: Stemmers, Pinners, Cancellers, and their neighbors

Geo

Pipeline Aggregations

Async peer recoveries

PKI for Kibana

Enrich Processor

Apache Lucene

Lucene 8.2

Other

Changes in Elasticsearch

Changes in 8.0:

Changes in 7.4:

Changes in 7.3:

Changes in 6.8:

Changes in Elasticsearch Hadoop Plugin

Changes in Rally

Follow us

About us

Join us

Press

Partners

Trust & Security

Investor relations

EXCELLENCE AWARDS