This Week in Elasticsearch and Apache Lucene - 2019-03-22
Highlights
Scripting
We have added a REST API that will return information about script contexts and associated whitelists - this is the first step toward automation of whitelist APIs in the documentation, and may possibly be used for other internal features as well.
Snapshot and Restore UI
We merged PR for repositories list and details UI. Next for the app is the create repository form.
Analytics
Pipeline aggs can only reference a single numeric value or a specific value of a multi-value agg (like a stats or percentiles agg). Anything else throws a gross error which isn't very helpful.We tried to fix it before but the implementation wasn't great, and we closed without merging. Recently, a user ran into this issue again, and was confused because the error was being thrown by an "intermediate" aggregation in the path. We rebooted the fix with a less invasive PR. It doesn't have quite as nice of errors as the first PR, but touches a lot less code and should be more maintainable.
- MovingAverage pipeline agg was just removed from 8.0. Users should migrate to MovingFunction instead
- Sampler/DiversifiedSampler aggs should be more OOM resistant now by limiting internal data structure sizes to the maxDoc of shards, and better circuit breaker accounting
Search alias resolution
We opened a pr to speed up search on index patterns that match lots of indices. Currently we resolve the index pattern to a list of concrete indices and then for each concrete index, we check whether it was matched through an alias, meaning we might have to apply alias filters. Unfortunately this second per-index operation runs in linear time with the number of matched concrete indices, which means that alias resolution runs in O(num_indices^2) overall. So queries get exponentially slower as an index pattern matches more indices. The linked pr makes alias resolution into a one-step operation that runs in linear time with the number of matches indices, and then a per-index operation that runs in linear time with the number of aliases of this index. This makes alias resolution run is O(num_indices * num_aliases_per_index) overall instead.
Proximity boost
We merged a pr to expose proximity boosting in the query DSL. This new query for date and geo_point fields can efficiently boost documents' scores based on proximity to some given origin. This query can skip non competitive documents when the total hits that match the query is not requested, give it a try:
https://www.elastic.co/guide/en/elasticsearch/reference/master/query-dsl-distance-feature-query.html
Resiliency
We fixed a bug that made an optimization unsound in the presence of cascading primary failures. This optimization is currently only used by CCR and allows follower shards under certain conditions to forego doing ID lookups in Lucene during indexing, thereby indexing faster and catching up more quickly to their leader index. This is a different optimization than the one introduced in ES 5.0 that allowed faster indexing by avoiding ID lookups in case of using auto-generated IDs. The newer optimization works by having the primary establish whether write operations are just adding data or updating / deleting existing data, and transfer this information to CCR follower shards that can then, based on the information established by the leader’s primary, avoid ID lookups, even in the case of non-auto-generated IDs. In the future, we will apply this optimization to regular replicas in the same cluster as well, providing them with the same speedup as follower shards.
Shard-level APIs, stats and settings
We merged a community PR that exposes external refresh stats through the stats API. Elasticsearch distinguishes between external and internal refreshes. External refreshes are typically triggered through user requests or by the "index.refresh_interval" setting, in order to make changes visible to search. Internal refreshes, which are not exposed to searches, are triggered by the system. They are used to free heap memory, for example when a shard is using too much heap and should move buffered indexed/deleted documents to disk. Internal refreshes also allow internally accessing indexed data that is not available yet to searches, for example in the case of a realtime get.
We merged a community PR that allows the index.translog.sync_interval setting to be dynamically updated.
We adapted the _flush
API to reject a certain combination of parameters as it would lead to unexpected behavior: If the wait_if_ongoing parameter was set to false while the force parameter was set to true and there was another ongoing flush, then the force flush would be ignored.
Apache Lucene
Lucene 9 may require Java 11+
A vote was initiated in order to decide whether to bump the minimum Java version requirement from 8 to 11 for the next major version of Lucene.
Forcemerge no longer merges more than necessary
We fixed a bug introduced in Lucene 7.5 that causes calls to the forcemerge API to merge down indices to one segment regardless of the maxSegmentCount
parameter.
Other
- We discovered that the fact that SegmentInfos doesn't preserve the next generation that should be written means that some files might be written twice in case the IndexWriter is not closed gracefully or is open from an old commit.
- The fact that OR intervals only return minimal intervals might be confusing to users.
- We are making ConstantScoreQuery able to optimize collection of top hits.
- We are thinking about better defaults regarding when to load the terms index off-heap, and looking into making it easier to configure on a per-field basis.
- Can we optimize doc values for boolean values?
Changes in Elasticsearch
Changes in 8.0:
- BREAKING: Remove MovingAverage pipeline aggregation #39328
- BREAKING: Blob Store compress default to true #40033
Changes in 7.1:
- Add
use_field
option to intervals query #40157 - Make setting index.translog.sync_interval be dynamic #37382
- Remove throws IOException from PipelineAggregationBuilder#create #40222
- [ML] Data Frame HLRC Preview API #40206
- Return cached segments stats if
include_unloaded_segments
is true #39698 - Ensure flush happen before closing an index #40184
- Reject illegal flush parameters #40213
- Add date and date_nanos conversion to the numeric_type sort option #40199
- Always fail engine if delete operation fails #40117
- Expose proximity boosting #39385
- [ML] Data Frame HLRC start & stop APIs #40154
- Replace java mail with jakarta mail #40088
- Expose external refreshes through the stats API #38643
- Fix IndexSearcherWrapper visibility #39071
- Do not allow Sampler to allocate more than maxDoc size, better CB accounting #39381
- Add an option to force the numeric type of a field sort #38095
- SQL: Introduce MAD (MedianAbsoluteDeviation) aggregation #40048
- Handle empty input in AddStringKeyStoreCommand #39490
Changes in 7.0:
- Use bundled JDK in Docker images #40238
- CCS: skip empty search hits when minimizing round-trips #40098
- Node repurpose tool #39403
- BREAKING: Remove Migration Upgrade and Assistance APIs #40075
- BREAKING: Remove cluster state size #40061
- Add no-jdk distributions #39882
Changes in 6.7:
- SQL: unwrap the first value in an array in case of array leniency #40318
- SQL: fix LIKE function equality by considering its pattern as well #40260
- SQL: Preserve original source for cast/convert function #40271
- Reduce retention lease sync intervals #40302
- SQL: fix incorrect ordering of groupings (GROUP BY) based on orderings (ORDER BY) #40087
- Allow non super users to create API keys #40028
- SQL: Fix issue with getting DATE type in JDBC #40207
- Only count some fields types for deprecation check #40166
- Cascading primary failure lead to MSU too low #40249
- Upgrade bundled JDK and Docker images to JDK 12 #40229
- Serialize top-level pipeline aggs as part of InternalAggregations #40177
- Skip sibling pipeline aggregators reduction during non-final reduce #40101
- SQL: Add multi_value_field_leniency inside FieldHitExtractor #40113
- LoggingAuditTrail correctly handle ReplicatedWriteRequest #39925
- Deprecate Migration Assistance and Upgrade APIs #40072
- Create retention leases file during recovery #39359
Changes in 6.6:
- SQL: rewrite ROUND and TRUNCATE functions with a different optional parameter handling method #40242
- SQL: Fix issue with optimization on queries with ORDER BY/LIMIT #40256
- SQL: Fix issue with date columns returned always in UTC #40163
- Enable reading auto-follow patterns from x-content #40130
- Safe publication of AutoFollowCoordinator #40153
- Stop auto-followers on shutdown #40124
Changes in Elasticsearch Hadoop Plugin
Changes in 6.7:
- Fix missing service files in ES-Hadoop jars #1265
Changes in Elasticsearch Management UI
Changes in 7.1:
- New platform dropdown #33520
- refetch autocomplete info after updating dev console settings #32587
- moved btn-radio directive to vis editor #33373
Changes in Elasticsearch SQL ODBC Driver
Changes in 7.1:
- Configurable floats format on conversions #130
Changes in 6.6:
- Lower ujson4c's stack requirements on x86 #134
- Add missing Paket license. Remove replaced bash dependencies generation script #127
Changes in Rally
Changes in 1.1.0:
- Allow to override request timeout for force-merge #669
- Add sleep operation #667
- Introduce new command line parameter --track-revision #666