The Enrich project's goal is to provide a means to enrich incoming data with some reference data prior to indexing. Internally we are working through many concerns, one of which is user friendliness. We have recently made many adjustments to the original design to make these workflows more user friendly. The most recent change allows the same entity to be enriched from multiple sources. For example, suppose you are trying to add additional information about a user (i.e. email, address, etc.) prior to indexing. This additional information comes from a variety of sources, but conceptually it is all reference data for the same user. Pre index time enrichment will now support data from multiple sources for the same entity from a single processor. Prior to this change, multiple policies and processors would have been needed to serve this use case. Once Enrich is released, this will allow less configuration to enrich a single entity from multiple sources. As part of this change we changed the "exact_match" policy name to a "match" policy name with a configurable max number of enrichments. #46041

We've also enhanced the delete policy API to delete all of the enrich indices that are associated with the policy. It also pointed out some slightly flawed logic we had with unlocking policies in the delete call, so that was addressed and fixed as well in #45870

Painless

In addition to continuing decoupling of ANTLR from the painless AST, we began a protoype of a Painless AST Builder. This API will allow constructing an AST from pure java code, which should allow other alternate grammars, like one that we might want to represent a parsed script in a debugger.

Vector Search

Vector fields allow to encode numeric multidimensional vectors and use them for computations. Typical use cases include finding similar images, audios, or texts encoded through embeddings. We have published a new blog that explains how vector fields can be used for text similarity. Currently all vector functions are based on the brute force approach, where a vector function is applied to all documents' vectors in the index. We found some areas how the brute force approach can be optimized and below the PRs that target these optimizations:

Switch to ByteBuffer for vector encoding. (https://github.com/elastic/elasticsearch/pull/45936)
Use float instead of double for query vectors. (https://github.com/elastic/elasticsearch/pull/46004)
Combine vector decoding and function computation. (https://github.com/elastic/elasticsearch/pull/46103)
Use an array instead of a List for the query vector. (https://github.com/elastic/elasticsearch/pull/46155)

Reindex

We merged a change that fixes the behavior of the allow_partial_search_results=false option when a executing a search hits a shard that is recovering. This is important for Reindex which executes scroll searches to retrieve the documents to reindex.

We have also changing the Reindex, Update-by-query and Delete-by-query APIs so that they now use allow_partial_search_results flag and don't silently ignore red/unavailable shards anymore.

Elasticsearch now ensures that only a single reindex persistent task can write to the .reindex index at a time by using conditional update based on terms and sequence numbers.

Shard allocation

We have merged community PR that automatically removes the read-only-allow-delete block when it's no longer necessary. This block is automatically added when the node breaches the flood-stage disk watermark (95% by default) and in current versions we leave it in place until it is manually removed. We see many cases where this block is in place because of a temporary disk shortage of which the user is unaware, so from 7.4 onwards we will automatically remove any read-only-allow-delete block when nodes gain enough free space to continue indexing, i.e. when their disk usage drops below the high watermark (90% by default).

We found a bug in the disk-based shard allocator. Disk based shard allocation is in charge of monitoring the disk usage of data nodes so that shards don't get allocated to a node if there's no enough disk free space on that node to hold them. To compute the disk usage, the allocator takes into account the size of the shards that are already allocated to a node plus the size of the shards that are actively relocating to or relocating away from the node. But we noticed that the new relocating shards - for which the decision to relocate has been decided but the relocating process is not started yet - were just ignored and not taken into account... making the disk-based shard allocator to allocate or deallocate too many shards from nodes and possibly overshoot watermarks during relocation. This bug is particularly problematic if cluster.routing.allocation.node_concurrent_recoveries is set to a high value, and we now have a fix.

Analytics

We merged in support for aggregations on Range fields, making feature freeze so it will be part of 7.4. The PR includes agg support for Histogram, Date Histogram and Terms bucketing aggregations, as well as Cardinality, Value Count and Missing metric aggs. We're investigating support for other aggs although some don't have entirely clear semantics (what's a "maximum" range? from the start or end of the range? Maybe the middle?). More to come in the future as we decide on behaviors and hear back from users.

Adding support for ranges was a large and complicated feature addition, since it touched many of the deep, dark places of the aggregation framework. We've been wanting to refactor the agg framework for a while, and the range project accidentally showed us many places that need refactoring (how to resolve the "value type" when the field is unmapped but we have a script, and the agg supports two different incompatible types?, how can licensed features enhance existing aggs? etc etc). We took a lot of notes a long the way and have a clearer idea what needs to be done now as a side effect.

We also merged a cumulative_cardinality pipeline aggregation. This is a Basic+ agg that uses a cardinality agg in your request to calculate the "total" cardinality over time. The canonical purpose example is differentiating "totally new" visitors from repeat visitors. A regular cardinality will show you the new visitors each day (but might be a repeat from yesterday), while cumulative_cardinality will only increment on each distinctly new user for the entire time period. This agg was added to a new "analytics" xpack plugin.

Lucene

Faster Bounding Box queries for shapes

Work is in progress to relax the logic that checks the relation between inner nodes and a given bounding box query. By doing so, it is increasing the possibility that an inner node can be totally skipped or included in the result and therefore eliminating costly point in rectangle checks that needs to be done at leaf level. For example, similar performance optimisation are already included in range queries over range fields.

QueryRescorer optimisation

A TODO in QueryRescorer means that currently the QueryRescorer sorts the full results array coming from the rescoring algorithm. This can improved by sorting only the top N results when N is less than thew total hits.

Other

There is a PR that introduces a shared counter between TopDocsCollectors when computing the total hits threshold.
Last week we reported some improvement on Nearest neighbour search for LatLonPoint. After those changes where merged Lucene geo benchmarks show an improvement on query performance of around 60% for such queries.

Changes

Changes in Elasticsearch

Changes in 8.0:

BREAKING: Remove the pidfile setting #45940
BREAKING: Remove processors setting #45905

Changes in 7.5:

More Efficient Ordering of Shard Upload Execution #42791
Upgrade to Azure SDK 8.4.0 #46094
return Cancellable in RestHighLevelClient #45688

Changes in 7.4:

Replace MockAmazonS3 usage in S3BlobStoreRepositoryTests by a HTTP server #46081
Flush engine after big merge #46066
Add Circle Processor #43851
BREAKING: Use float instead of double for query vectors. #46004
Add max_iterations configuration to watcher action with foreach execution #45715
Support Range Fields in Histogram and Date Histogram #45395
Add XContentType as parameter to HLRC ART#createServerTestInstance #46036
Disallow partial results when shard unavailable #45739
Add manage_own_api_key cluster privilege #45897
Do not create engine under IndexShard#mutex #45263
PKI realm authentication delegation #45906
Add Cumulative Cardinality agg (and Data Science plugin) #43661
Build: Support console-result language #45937
Better logging for TLS message on non-secure transport channel #45835
Fix Broken HTTP Request Breaking Channel Closing #45958
Fix plaintext on TLS port logging #45852
Allow Transport Actions to indicate authN realm #45767
Add deprecation check for pidfile setting #45939
Deprecate the pidfile setting #45938
Add deprecation check for processors #45925
Fix IngestService to respect original document content type #45799
Update joda to 2.10.3 #45495
Remove redundant Java check from Sys V init #45793

Changes in 7.3:

Ensure top docs optimization is fully disabled for queries with unbounded max scores. #46105
Fix rest-api-spec dep for external plugins #45949
Don't use assemble task on root project #45999
Update translog checkpoint after marking operations as persisted #45634
Fix bugs in Painless SCatch node #45880

Changes in 7.2:

[DOCS] Correct IIF conditional section title #45979

Changes in 6.8:

Start testing against AdoptOpenJDK #45666
Avoid overshooting watermarks during relocation #46079
Add rest_total_hits_as_int in HLRC's search requests #46076
Handle delete document level failures #46100
Handle no-op document level failures #46083
Consider artifact repositories backed by S3 secure #45950

Changes in Elasticsearch Hadoop Plugin

Changes in 7.4:

use File type for javaHome and bump build-tools version #1338

Changes in Elasticsearch SQL ODBC Driver

Changes in 7.4:

Add CBOR integration testing #177
SQLNumParams now counts parameter markers #174
CBOR support for parameters #175
Use default float representation for parameters #179

Changes in 7.3:

Fix handling of size with DATE parameters #178
fix length read from app ptr when param binding #173

Changes in Rally

Changes in 1.3.0:

Option track-revision should work with track-repository #751

The Search AI Company

Generative AI

Search

Security

Observability

By solution

Industries

This Week In Elasticsearch, September 2nd

Elasticsearch

Enrich

Painless

Vector Search

Reindex

Shard allocation

Analytics

Lucene

Faster Bounding Box queries for shapes

QueryRescorer optimisation

Other

Changes

Changes in Elasticsearch

Changes in Elasticsearch Hadoop Plugin

Changes in Elasticsearch SQL ODBC Driver

Changes in Rally

Follow us

About us

Join us

Press

Partners

Trust & Security

Investor relations

EXCELLENCE AWARDS