This Week in Elasticsearch and Apache Lucene - 2018-07-27
We have recently undertaken an effort to add significant functionality to the ingest node feature of Elasticsearch. This week’s work included PRs to add a conditional to any processor (so that you can only execute a processor if some per-document condition holds), a drop processor (to drop documents in the ingest pipeline), enabling default pipelines, dissect functionality (less powerful than Grok but faster and simpler for certain use-cases), and the ability within the convert processor to parse strings representing hex integers.
Weighted Average Aggregation
We have added a new weighted average metric aggregation, which is similar to the standard average aggregation but uses a weight value from the document together with the values to average. This allows users to produce averages where the denominator (the count) is not necessarily 1 and is instead determined by another field in the document.
Document ID bug in rollups
We have found a bug in our experimental rollups feature where our use of a 32-bit hash to generate document IDs has meant there is a reasonable chance of collisions when the number of rollup documents reaches the order of 100,000s. We have a plan to fix this and migrate current rollup jobs over to the fix.
A variety of new security features have been merged and will be available in an upcoming release. These features include Kerberos, FIPS-compliance, and application privileges, the latter of which will clear the way for Kibana to store detailed authorization information in Elasticsearch for more granular security.
Working towards enabling nanosecond timestamps
Nanosecond timestamp resolution is a heavily requested feature and important for logging use cases where nanosecond timestamps are important for correctly ordering high-speed events (e.g., 10GbE network events; even a gigabit networks can render millisecond resolution insufficient). Today we rely on the Joda library for interacting with dates and times. However, this library only supports milliseconds. Joda was for many years a highly-respected library but is now deprecated in favor of the Java date/time API added in JDK 8 and so will never see support for such resolutions. The problem is Joda time is everywhere in the codebase from aggregations to the clients to ingest to the mapping layer to scripting so it is a massive effort to cutover. We are currently executing on a plan to migrate to the new Java date/time API.
Zen2 Node Discovery
We opened a PR which will add to Zen2 the ability to discover master-eligible nodes via a gossip-like mechanism. This work is an important piece of the usability story for our new cluster coordination layer.
More String functions for SQL
We have added support for a number of string manipulation functions for SQL. This is part of a larger effort to support SQL scalar functions so we will be adding more scalar function support for manipulating other data types including dates and numerics.
Changes in 6.4:
- Copy missing segment attributes in getSegmentInfo #32396
- Introduce fips_mode setting and associated checks #32326
- [Kerberos] Add Kerberos authentication support #32263
- Security: revert to old way of merging automata #32254
- Tribe: Add error with secure settings copied to tribe #32298
- Add ERR to ranking evaluation documentation #32314
- Introduce Application Privileges with support for Kibana RBAC #32309
- Backport to 6.x - Add Snapshots Status API to High Level Rest Client #32295
- Register ERR metric with NamedXContentRegistry #32320
- [CI] Reactivate 3rd party tests on CI #32315
- Rest HL client: Add put watch action (#32026) #32191
- Consistent encoder names #29492
- Add WeightedAvg metric aggregation #31037
- Rename ranking evaluation quality_level to metric_score #32168
- Fail shard if IndexShard#storeStats runs into an IOException #32241
- Fix range queries on _type field for singe type indices (#31756) #32161
- CCE when re-throwing "shard not available" exception in TransportShardMultiGetAction #32185
Changes in 6.5:
- Release requests in cors handle #32410
- Rest HL client: Add put license action #32214
- Release requests in cors handler #32364
- Add Restore Snapshot High Level REST API #32155
- BREAKING: Introduce index store plugins #32375
- Add opaque_id to index audit logging #32260
- Ingest: Support integer and long hex values in convert #32213
Changes in 7.0:
Disabling hit counts by default
The major release highlight of Lucene 8 is going to be a set of optimizations that allow to compute top matches sorted by score more efficiently by not having to visit all matches. Unfortunately, these improvements are not going to be noticed by our users if we keep computing total hit counts by default, which requires to visit all matches.
As a consequence, we have been discussing what it would take to disable hit counts by default. While some UIs don't need hit counts at all, for instance a lot of mobile search UIs implement infinite scrolling and don't give information about the hit count, traditional search UIs still give information about the hit count (often approximated) and provide pagination support. For instance if you want to allow users to paginate up to page 10 and display 20 hits per page, it is useful to count hits accurately up to 200 in order to know how many pages you need to make available to your users. In order to keep Lucene practical, we have been discussing changing our top collectors so that the computation of the hit count is not a yes/no choice, but rather a number of hits to count accurately. If the number of matches of a query is below this number then the hit count will be accurate, and otherwise it will be a lower bound of the actual hit count.
ReqOptSumScorer to optimize query processing based on impacts
One way that queries now optimize collection of top documents is by adding information about the produced scores directly into the skip lists of the inverted index so that documents that don't yield competitive scores can be skipped. This is currently leveraged by term queries, disjunctions and conjunctions. We are now optimizing ReqOptSumScorer, which is used for queries that mix MUST/FILTER and SHOULD clauses. These queries are typically used in order to boost the score of some documents by the value of another field such as "pagerank" or "popularity" by putting the regular query in a MUST clause and the boosting query in a SHOULD clause: this will return the same matches as the regular query, but scores will be the sum of the regular query score and the boosting query score. Having those queries optimized is exciting as it means that one could incorporate features into the score and still benefit from optimized computation of the top hits.
LatLonPoint has moved to core
After having lived in sandbox for years, we promoted LatLonPoint to lucene-core. This is the best available option to index geo-points with Lucene, and the one that is used by Elasticsearch's geo_point field.
SegmentReader now exposes both soft and hard deletes
Until now SegmentReader would either expose hard deletes, or soft deletes if you configured a soft-deletes field with your IndexWriter. This proved problematic since you sometimes need to ignore soft deletes, yet not all deleted documents are documents that have been soft-deleted in that case: Lucene also creates born-deleted documents when it hits an exception during the indexing process, for instance if the analyzer throws an exception. In order to make it possible to distinguish legitimate documents that have been soft-deleted from documents that failed indexing, we introduced the ability to fetch hard deletes on a SegmentReader.
We introduced a new getSubMatches() method in the matches API, which allows to iterate sub-matches of a top-level match. For instance if you think of a phrase query "quick fox" with a slop of 2 and text "the quick yellow fox", then the query would return "quick yellow fox" as a top-level match, and "quick" and "fox" as sub matches.
We are now adding support for matches to interval queries.
- DaciukMihovAutomatonBuilder, a helper class that implements an algorithm that can build an automaton in constant-time from a sorted set of strings, now has protection against stack overflows.
- A soft deletes optimization to copy live docs efficiently got broken because of an unrelated change.
- The way that Lucene allows to pre-fill PriorityQueue objects with sentinels made could be hard to use.
- We found some leniency in the way that Lucene carries over previous generations when overriding an existing index.
- Could we optimize TopFieldCollector when the index is sorted but hit counts need to be computed?
- Null payloads shouldn't count as non-null payloads for scoring.
- Clarified the contract that Directory implementations must obey.
- TopFieldCollector no longer computes scores at collection time, this is now done after collection, only on top hits.
- The stempel stemmer is sometimes way too aggressive.