This Week In Elasticsearch, September 2nd
The Enrich project's goal is to provide a means to enrich incoming data with some reference data prior to indexing. Internally we are working through many concerns, one of which is user friendliness. We have recently made many adjustments to the original design to make these workflows more user friendly. The most recent change allows the same entity to be enriched from multiple sources. For example, suppose you are trying to add additional information about a user (i.e. email, address, etc.) prior to indexing. This additional information comes from a variety of sources, but conceptually it is all reference data for the same user. Pre index time enrichment will now support data from multiple sources for the same entity from a single processor. Prior to this change, multiple policies and processors would have been needed to serve this use case. Once Enrich is released, this will allow less configuration to enrich a single entity from multiple sources. As part of this change we changed the "exact_match" policy name to a "match" policy name with a configurable max number of enrichments. #46041
We've also enhanced the delete policy API to delete all of the enrich indices that are associated with the policy. It also pointed out some slightly flawed logic we had with unlocking policies in the delete call, so that was addressed and fixed as well in #45870
In addition to continuing decoupling of ANTLR from the painless AST, we began a protoype of a Painless AST Builder. This API will allow constructing an AST from pure java code, which should allow other alternate grammars, like one that we might want to represent a parsed script in a debugger.
Vector fields allow to encode numeric multidimensional vectors and use them for computations. Typical use cases include finding similar images, audios, or texts encoded through embeddings. We have published a new blog that explains how vector fields can be used for text similarity. Currently all vector functions are based on the brute force approach, where a vector function is applied to all documents' vectors in the index. We found some areas how the brute force approach can be optimized and below the PRs that target these optimizations:
- Switch to ByteBuffer for vector encoding. (https://github.com/elastic/elasticsearch/pull/45936)
- Use float instead of double for query vectors. (https://github.com/elastic/elasticsearch/pull/46004)
- Combine vector decoding and function computation. (https://github.com/elastic/elasticsearch/pull/46103)
- Use an array instead of a List for the query vector. (https://github.com/elastic/elasticsearch/pull/46155)
We merged a change that fixes the behavior of the allow_partial_search_results=false option when a executing a search hits a shard that is recovering. This is important for Reindex which executes scroll searches to retrieve the documents to reindex.
We have also changing the Reindex, Update-by-query and Delete-by-query APIs so that they now use allow_partial_search_results flag and don't silently ignore red/unavailable shards anymore.
Elasticsearch now ensures that only a single reindex persistent task can write to the .reindex index at a time by using conditional update based on terms and sequence numbers.
We have merged community PR that automatically removes the read-only-allow-delete block when it's no longer necessary. This block is automatically added when the node breaches the flood-stage disk watermark (95% by default) and in current versions we leave it in place until it is manually removed. We see many cases where this block is in place because of a temporary disk shortage of which the user is unaware, so from 7.4 onwards we will automatically remove any read-only-allow-delete block when nodes gain enough free space to continue indexing, i.e. when their disk usage drops below the high watermark (90% by default).
We found a bug in the disk-based shard allocator. Disk based shard allocation is in charge of monitoring the disk usage of data nodes so that shards don't get allocated to a node if there's no enough disk free space on that node to hold them. To compute the disk usage, the allocator takes into account the size of the shards that are already allocated to a node plus the size of the shards that are actively relocating to or relocating away from the node. But we noticed that the new relocating shards - for which the decision to relocate has been decided but the relocating process is not started yet - were just ignored and not taken into account... making the disk-based shard allocator to allocate or deallocate too many shards from nodes and possibly overshoot watermarks during relocation. This bug is particularly problematic if cluster.routing.allocation.node_concurrent_recoveries is set to a high value, and we now have a fix.
We merged in support for aggregations on Range fields, making feature freeze so it will be part of 7.4. The PR includes agg support for Histogram, Date Histogram and Terms bucketing aggregations, as well as Cardinality, Value Count and Missing metric aggs. We're investigating support for other aggs although some don't have entirely clear semantics (what's a "maximum" range? from the start or end of the range? Maybe the middle?). More to come in the future as we decide on behaviors and hear back from users.
Adding support for ranges was a large and complicated feature addition, since it touched many of the deep, dark places of the aggregation framework. We've been wanting to refactor the agg framework for a while, and the range project accidentally showed us many places that need refactoring (how to resolve the "value type" when the field is unmapped but we have a script, and the agg supports two different incompatible types?, how can licensed features enhance existing aggs? etc etc). We took a lot of notes a long the way and have a clearer idea what needs to be done now as a side effect.
We also merged a cumulative_cardinality pipeline aggregation. This is a Basic+ agg that uses a cardinality agg in your request to calculate the "total" cardinality over time. The canonical purpose example is differentiating "totally new" visitors from repeat visitors. A regular cardinality will show you the new visitors each day (but might be a repeat from yesterday), while cumulative_cardinality will only increment on each distinctly new user for the entire time period. This agg was added to a new "analytics" xpack plugin.
Faster Bounding Box queries for shapes
Work is in progress to relax the logic that checks the relation between inner nodes and a given bounding box query. By doing so, it is increasing the possibility that an inner node can be totally skipped or included in the result and therefore eliminating costly point in rectangle checks that needs to be done at leaf level. For example, similar performance optimisation are already included in range queries over range fields.
A TODO in QueryRescorer means that currently the QueryRescorer sorts the full results array coming from the rescoring algorithm. This can improved by sorting only the top N results when N is less than thew total hits.
- There is a PR that introduces a shared counter between TopDocsCollectors when computing the total hits threshold.
- Last week we reported some improvement on Nearest neighbour search for LatLonPoint. After those changes where merged Lucene geo benchmarks show an improvement on query performance of around 60% for such queries.
Changes in Elasticsearch
Changes in 8.0:
Changes in 7.5:
- More Efficient Ordering of Shard Upload Execution #42791
- Upgrade to Azure SDK 8.4.0 #46094
- return Cancellable in RestHighLevelClient #45688
Changes in 7.4:
- Replace MockAmazonS3 usage in S3BlobStoreRepositoryTests by a HTTP server #46081
- Flush engine after big merge #46066
- Add Circle Processor #43851
- BREAKING: Use float instead of double for query vectors. #46004
- Add max_iterations configuration to watcher action with foreach execution #45715
- Support Range Fields in Histogram and Date Histogram #45395
- Add XContentType as parameter to HLRC ART#createServerTestInstance #46036
- Disallow partial results when shard unavailable #45739
manage_own_api_keycluster privilege #45897
- Do not create engine under IndexShard#mutex #45263
- PKI realm authentication delegation #45906
- Add Cumulative Cardinality agg (and Data Science plugin) #43661
- Build: Support
- Better logging for TLS message on non-secure transport channel #45835
- Fix Broken HTTP Request Breaking Channel Closing #45958
- Fix plaintext on TLS port logging #45852
- Allow Transport Actions to indicate authN realm #45767
- Add deprecation check for pidfile setting #45939
- Deprecate the pidfile setting #45938
- Add deprecation check for processors #45925
- Fix IngestService to respect original document content type #45799
- Update joda to 2.10.3 #45495
- Remove redundant Java check from Sys V init #45793
Changes in 7.3:
- Ensure top docs optimization is fully disabled for queries with unbounded max scores. #46105
- Fix rest-api-spec dep for external plugins #45949
- Don't use assemble task on root project #45999
- Update translog checkpoint after marking operations as persisted #45634
- Fix bugs in Painless SCatch node #45880
Changes in 7.2:
- [DOCS] Correct
IIFconditional section title #45979
Changes in 6.8:
- Start testing against AdoptOpenJDK #45666
- Avoid overshooting watermarks during relocation #46079
- Add rest_total_hits_as_int in HLRC's search requests #46076
- Handle delete document level failures #46100
- Handle no-op document level failures #46083
- Consider artifact repositories backed by S3 secure #45950
Changes in Elasticsearch Hadoop Plugin
Changes in 7.4:
- use File type for javaHome and bump build-tools version #1338
Changes in Elasticsearch SQL ODBC Driver
Changes in 7.4:
- Add CBOR integration testing #177
- SQLNumParams now counts parameter markers #174
- CBOR support for parameters #175
- Use default float representation for parameters #179
Changes in 7.3:
Changes in Rally
Changes in 1.3.0:
- Option track-revision should work with track-repository #751