This Week in Elasticsearch and Apache Lucene - 2019-12-02
Elasticsearch
Rollup Refactor and GA
The rollup feature is going through some changes with the aim to improve the user experience so rollup indexes can be used much more like regular indexes. The principal changes that we are planning to make are:
- Remove rollup jobs in favour of an ILM action to rollup an index. This will mean that rolling up an index will work similarly to shrinking an index. The rollup will be done when indexing is complete and the action will rollup the entire index at the same time.
- Add new field types specifically for rollup data. There will be two new field types:
- Grouping tuple - The current thinking is this would be analagous to the groups in the current rollups and defines the diminsions you want to use in your rollup documents
- Rollup metric - This field type will be used for the metrics you want to rollup within each group, i.e. the metrics to calculate the
avg
,min
,max
etc.
- Remove the
_rollup_search
endpoint in favour of implementing searching on rollup indexes within the_search
API. This will mean we'll need to modify the aggregation to be able to work on rollup fields as well as the currently existing field types.
Scripting Languages and Contexts API
We have opened a PR adding a new API which exposes the types of scripts allows (inline/stored), the language, and the contexts that each language may be used in. This API will allow Kibana to stop hard coding scripting languages in order to provide a selector when creating scripted fields.
Reindex
Resilient reindex sorts by _seq_no
in order to be able to resume on failure. We are adding a Rally challenge for reindex to determine the performance impact of the sort. Current results show that resilient reindex is slower than non-resilient reindex, and we are looking into the root causes.
We amended the reindex documentation to clarify that source types are disregarded in reindex in 7.0+.
Als, there is a new setting to allow X-Pack to override the security headers that are required for reindexing, the code of which is OSS.
Faster sorted queries
In Elasticsearch 7.0 we introduced an optimization that allows to skip non-competitive documents when sorting documents by relevance. This optimization is exposed by default and doesn't require any configuration. We've also added new queries that take this optimization into account to give more weight to documents closer to a certain date or location. However, this optimization only works if documents are sorted by score (relevancy) so if you need to sort your documents by date or by a numeric field we have to switch to the slow execution that requires to visit all documents even if you don't need the total hit count of documents that match the query.
Today, we are happy to announce that we have merged a change which allows Elasticsearch to expose the optimization to queries sorted by an indexed numeric field. The idea is to automatically translate the numeric sort into a distance feature query that is able to prune efficiently the documents that are too far from the current top N. Unlike index sorting, this optimization work when sorting in both ascending and descending orders and can be very efficient as shown in our nightly benchmark where sorting by timestamp is now up to 35x faster than before.
We're now working on applying this optimization when using search_after or during a scroll. We hope that this will allow more use cases that require to retrieve a stream of documents from Elasticsearch in a sorted fashion.
We also want to work on another related change which will take into account the min and max time range on the shards and compare them to the sort value on the last competitive hit. If the shard cannot possibly contain a competitive hit it can then be skipped further improving the performance of searches sorted by date fields. This should mean that even when frozen indices are included in the search request, we can efficiently discount them, both improving the performance of the search and allowing Elasticsearch to release resources since we know these shards are unnecessary for the search.
More detail for the curious
The origin of the query is the minimum indexed value in the field for the ascending sort and the maximum when the sort is descending. We then take the average of the field ((max-min)/2) to compute a score based on how close the document is to the origin. We cannot use the numeric values directly since the framework that allows to skip documents during a query is exposed only for float values so the distance is an approximation of the real distance to the origin using the pivot value as a decay. When the queue of top documents is full, the distance feature query will check how many documents can be pruned by looking at the indexed values directly in order to eliminate all documents that are after the current worse top doc.
Lucene
- Interval queries will now include the field in their toString representation.
- An UnsupportedOperationException with the unified highlighter and interval queries has been addressed.
- Interval queries are now capable of telling which terms matched.
- Can we speed up the way BM25 scores are computed?
- Should concurrent search take the size of the queue into account to decide on the number of slices?
- A concurrency bug in polygon queries was fixed.
Changes
Changes in Elasticsearch
Changes in 8.0:
- Update randomizedrunner to 2.7.4 #49345
- IDs for doc snippets #49008
- Adjustments for FIPS 140 testing #49319
Changes in 7.6:
- Add a listener to track the progress of a search request locally #49471
- Slash missed in indices.put_mapping url #49468
- Enable LicenceServiceTests for all jdks #49440
- New Histogram field mapper that supports percentiles aggregations. #48580
- #48475 Pure disjunctions should rewrite to a MatchNoneQueryBuilder #48557
- Make docker build task incremental #49613
- Fix typo when assigning null_value in GeoPointFieldMapper #49645
- Add templating support to pipeline processor. #49030
- BREAKING: Add a cluster setting to disallow loading fielddata on _id field #49166
- Add templating support to enrich processor #49093
- Add Debug/Trace logging for authentication #49575
- Annotated text type should extend TextFieldType #49555
- Optimize sort on long field #48804
- Introduce on_failure_pipeline ingest metadata inside on_failure block #49076
- SQL: Add TRUNC alias for TRUNCATE #49571
- Return 400 when handling invalid JSON #49552
- print id detail when id is too long. #49433
- Fix HLRC parsing of CancelTasks response #47017
- Add the simple strategy to cluster settings #49414
- Flush instead of synced-flush inactive shards #49126
- Deprecate misconfigured SSL server config #49280
Changes in 7.5:
- Do not mutate request on scripted upsert #49578
- SQL: Fix issue with GROUP BY YEAR() #49559
- Fix extraction of notarized Elasticsearch release distribution #49511
- Replace required pipeline with final pipeline #49470
- Netty4: switch to composite cumulator #49478
Changes in 7.4:
- SQL: Fix issue with CASE/IIF pre-calculating results #49553
- SQL: Fix issue with folding of CASE/IIF #49449
Changes in 6.8:
- Fix iterate-from-1 bug in smart realm order #49473
- [Java.time] Retain prefixed date pattern in formatter #48703
Changes in Elasticsearch Hadoop Plugin
Changes in 7.6:
- Do not build scala docs as javadocs for scala 2.10 #1394
Changes in Elasticsearch SQL ODBC Driver
Changes in 6.8: