22 March 2019

This Week in Elasticsearch and Apache Lucene - 2019-03-22

Daniel Mitterdorfer

•

•

•

•

•

•

Highlights

Scripting

We have added a REST API that will return information about script contexts and associated whitelists - this is the first step toward automation of whitelist APIs in the documentation, and may possibly be used for other internal features as well.

Snapshot and Restore UI

We merged PR for repositories list and details UI. Next for the app is the create repository form.

Analytics

Pipeline aggs can only reference a single numeric value or a specific value of a multi-value agg (like a stats or percentiles agg). Anything else throws a gross error which isn't very helpful.We tried to fix it before but the implementation wasn't great, and we closed without merging. Recently, a user ran into this issue again, and was confused because the error was being thrown by an "intermediate" aggregation in the path. We rebooted the fix with a less invasive PR. It doesn't have quite as nice of errors as the first PR, but touches a lot less code and should be more maintainable.

MovingAverage pipeline agg was just removed from 8.0. Users should migrate to MovingFunction instead
Sampler/DiversifiedSampler aggs should be more OOM resistant now by limiting internal data structure sizes to the maxDoc of shards, and better circuit breaker accounting

Search alias resolution

We opened a pr to speed up search on index patterns that match lots of indices. Currently we resolve the index pattern to a list of concrete indices and then for each concrete index, we check whether it was matched through an alias, meaning we might have to apply alias filters. Unfortunately this second per-index operation runs in linear time with the number of matched concrete indices, which means that alias resolution runs in O(num_indices^2) overall. So queries get exponentially slower as an index pattern matches more indices. The linked pr makes alias resolution into a one-step operation that runs in linear time with the number of matches indices, and then a per-index operation that runs in linear time with the number of aliases of this index. This makes alias resolution run is O(num_indices * num_aliases_per_index) overall instead.

Proximity boost

We merged a pr to expose proximity boosting in the query DSL. This new query for date and geo_point fields can efficiently boost documents' scores based on proximity to some given origin. This query can skip non competitive documents when the total hits that match the query is not requested, give it a try:

https://www.elastic.co/guide/en/elasticsearch/reference/master/query-dsl-distance-feature-query.html

Resiliency

We fixed a bug that made an optimization unsound in the presence of cascading primary failures. This optimization is currently only used by CCR and allows follower shards under certain conditions to forego doing ID lookups in Lucene during indexing, thereby indexing faster and catching up more quickly to their leader index. This is a different optimization than the one introduced in ES 5.0 that allowed faster indexing by avoiding ID lookups in case of using auto-generated IDs. The newer optimization works by having the primary establish whether write operations are just adding data or updating / deleting existing data, and transfer this information to CCR follower shards that can then, based on the information established by the leader’s primary, avoid ID lookups, even in the case of non-auto-generated IDs. In the future, we will apply this optimization to regular replicas in the same cluster as well, providing them with the same speedup as follower shards.

Shard-level APIs, stats and settings

We merged a community PR that exposes external refresh stats through the stats API. Elasticsearch distinguishes between external and internal refreshes. External refreshes are typically triggered through user requests or by the "index.refresh_interval" setting, in order to make changes visible to search. Internal refreshes, which are not exposed to searches, are triggered by the system. They are used to free heap memory, for example when a shard is using too much heap and should move buffered indexed/deleted documents to disk. Internal refreshes also allow internally accessing indexed data that is not available yet to searches, for example in the case of a realtime get.

We merged a community PR that allows the index.translog.sync_interval setting to be dynamically updated.

We adapted the _flush API to reject a certain combination of parameters as it would lead to unexpected behavior: If the wait_if_ongoing parameter was set to false while the force parameter was set to true and there was another ongoing flush, then the force flush would be ignored.

Apache Lucene

Lucene 9 may require Java 11+

A vote was initiated in order to decide whether to bump the minimum Java version requirement from 8 to 11 for the next major version of Lucene.

Forcemerge no longer merges more than necessary

We fixed a bug introduced in Lucene 7.5 that causes calls to the forcemerge API to merge down indices to one segment regardless of the maxSegmentCount parameter.

Other

We discovered that the fact that SegmentInfos doesn't preserve the next generation that should be written means that some files might be written twice in case the IndexWriter is not closed gracefully or is open from an old commit.
The fact that OR intervals only return minimal intervals might be confusing to users.
We are making ConstantScoreQuery able to optimize collection of top hits.
We are thinking about better defaults regarding when to load the terms index off-heap, and looking into making it easier to configure on a per-field basis.
Can we optimize doc values for boolean values?

Changes in Elasticsearch

Changes in 8.0:

BREAKING: Remove MovingAverage pipeline aggregation #39328
BREAKING: Blob Store compress default to true #40033

Changes in 7.1:

Add use_field option to intervals query #40157
Make setting index.translog.sync_interval be dynamic #37382
Remove throws IOException from PipelineAggregationBuilder#create #40222
[ML] Data Frame HLRC Preview API #40206
Return cached segments stats if include_unloaded_segments is true #39698
Ensure flush happen before closing an index #40184
Reject illegal flush parameters #40213
Add date and date_nanos conversion to the numeric_type sort option #40199
Always fail engine if delete operation fails #40117
Expose proximity boosting #39385
[ML] Data Frame HLRC start & stop APIs #40154
Replace java mail with jakarta mail #40088
Expose external refreshes through the stats API #38643
Fix IndexSearcherWrapper visibility #39071
Do not allow Sampler to allocate more than maxDoc size, better CB accounting #39381
Add an option to force the numeric type of a field sort #38095
SQL: Introduce MAD (MedianAbsoluteDeviation) aggregation #40048
Handle empty input in AddStringKeyStoreCommand #39490

Changes in 7.0:

Use bundled JDK in Docker images #40238
CCS: skip empty search hits when minimizing round-trips #40098
Node repurpose tool #39403
BREAKING: Remove Migration Upgrade and Assistance APIs #40075
BREAKING: Remove cluster state size #40061
Add no-jdk distributions #39882

Changes in 6.7:

SQL: unwrap the first value in an array in case of array leniency #40318
SQL: fix LIKE function equality by considering its pattern as well #40260
SQL: Preserve original source for cast/convert function #40271
Reduce retention lease sync intervals #40302
SQL: fix incorrect ordering of groupings (GROUP BY) based on orderings (ORDER BY) #40087
Allow non super users to create API keys #40028
SQL: Fix issue with getting DATE type in JDBC #40207
Only count some fields types for deprecation check #40166
Cascading primary failure lead to MSU too low #40249
Upgrade bundled JDK and Docker images to JDK 12 #40229
Serialize top-level pipeline aggs as part of InternalAggregations #40177
Skip sibling pipeline aggregators reduction during non-final reduce #40101
SQL: Add multi_value_field_leniency inside FieldHitExtractor #40113
LoggingAuditTrail correctly handle ReplicatedWriteRequest #39925
Deprecate Migration Assistance and Upgrade APIs #40072
Create retention leases file during recovery #39359

Changes in 6.6:

SQL: rewrite ROUND and TRUNCATE functions with a different optional parameter handling method #40242
SQL: Fix issue with optimization on queries with ORDER BY/LIMIT #40256
SQL: Fix issue with date columns returned always in UTC #40163
Enable reading auto-follow patterns from x-content #40130
Safe publication of AutoFollowCoordinator #40153
Stop auto-followers on shutdown #40124

Changes in Elasticsearch Hadoop Plugin

Changes in 6.7:

Fix missing service files in ES-Hadoop jars #1265

Changes in Elasticsearch Management UI

Changes in 7.1:

New platform dropdown #33520
refetch autocomplete info after updating dev console settings #32587
moved btn-radio directive to vis editor #33373

Changes in Elasticsearch SQL ODBC Driver

Changes in 7.1:

Configurable floats format on conversions #130

Changes in 6.6:

Lower ujson4c's stack requirements on x86 #134
Add missing Paket license. Remove replaced bash dependencies generation script #127

Changes in Rally

Changes in 1.1.0:

Allow to override request timeout for force-merge #669
Add sleep operation #667
Introduce new command line parameter --track-revision #666

Changes in Rally Tracks

Adapt geopointshape track to track conventions #69
Add a track for Metricbeat data #56

The Search AI Company

Generative AI

Search

Security

Observability

By solution

Industries

This Week in Elasticsearch and Apache Lucene - 2019-03-22

Highlights

Scripting

Snapshot and Restore UI

Analytics

Search alias resolution

Proximity boost

Resiliency

Shard-level APIs, stats and settings

Apache Lucene

Lucene 9 may require Java 11+

Forcemerge no longer merges more than necessary

Other

Changes in Elasticsearch

Changes in Elasticsearch Hadoop Plugin

Changes in Elasticsearch Management UI

Changes in Elasticsearch SQL ODBC Driver

Changes in Rally

Changes in Rally Tracks

Follow us

About us

Join us

Press

Partners

Trust & Security

Investor relations

EXCELLENCE AWARDS