27 de julho de 2018

This Week in Elasticsearch and Apache Lucene - 2018-07-27

Por

•

•

•

•

•

Colin Goodheart-Smithe

Elasticsearch

Ingest Processors

We have recently undertaken an effort to add significant functionality to the ingest node feature of Elasticsearch. This week’s work included PRs to add a conditional to any processor (so that you can only execute a processor if some per-document condition holds), a drop processor (to drop documents in the ingest pipeline), enabling default pipelines, dissect functionality (less powerful than Grok but faster and simpler for certain use-cases), and the ability within the convert processor to parse strings representing hex integers.

Weighted Average Aggregation

We have added a new weighted average metric aggregation, which is similar to the standard average aggregation but uses a weight value from the document together with the values to average. This allows users to produce averages where the denominator (the count) is not necessarily 1 and is instead determined by another field in the document.

Document ID bug in rollups

We have found a bug in our experimental rollups feature where our use of a 32-bit hash to generate document IDs has meant there is a reasonable chance of collisions when the number of rollup documents reaches the order of 100,000s. We have a plan to fix this and migrate current rollup jobs over to the fix.

Security Features

A variety of new security features have been merged and will be available in an upcoming release. These features include Kerberos, FIPS-compliance, and application privileges, the latter of which will clear the way for Kibana to store detailed authorization information in Elasticsearch for more granular security.

Working towards enabling nanosecond timestamps

Nanosecond timestamp resolution is a heavily requested feature and important for logging use cases where nanosecond timestamps are important for correctly ordering high-speed events (e.g., 10GbE network events; even a gigabit networks can render millisecond resolution insufficient). Today we rely on the Joda library for interacting with dates and times. However, this library only supports milliseconds. Joda was for many years a highly-respected library but is now deprecated in favor of the Java date/time API added in JDK 8 and so will never see support for such resolutions. The problem is Joda time is everywhere in the codebase from aggregations to the clients to ingest to the mapping layer to scripting so it is a massive effort to cutover. We are currently executing on a plan to migrate to the new Java date/time API.

Zen2 Node Discovery

We opened a PR which will add to Zen2 the ability to discover master-eligible nodes via a gossip-like mechanism. This work is an important piece of the usability story for our new cluster coordination layer.

More String functions for SQL

We have added support for a number of string manipulation functions for SQL. This is part of a larger effort to support SQL scalar functions so we will be adding more scalar function support for manipulating other data types including dates and numerics.

Changes

Changes in 6.4:

Copy missing segment attributes in getSegmentInfo #32396
Introduce fips_mode setting and associated checks #32326
[Kerberos] Add Kerberos authentication support #32263
Security: revert to old way of merging automata #32254
Tribe: Add error with secure settings copied to tribe #32298
Add ERR to ranking evaluation documentation #32314
Introduce Application Privileges with support for Kibana RBAC #32309
Backport to 6.x - Add Snapshots Status API to High Level Rest Client #32295
Register ERR metric with NamedXContentRegistry #32320
[CI] Reactivate 3rd party tests on CI #32315
Rest HL client: Add put watch action (#32026) #32191
Consistent encoder names #29492
Add WeightedAvg metric aggregation #31037
Rename ranking evaluation quality_level to metric_score #32168
Fail shard if IndexShard#storeStats runs into an IOException #32241
Fix range queries on _type field for singe type indices (#31756) #32161
CCE when re-throwing "shard not available" exception in TransportShardMultiGetAction #32185

Changes in 6.5:

Release requests in cors handle #32410
Rest HL client: Add put license action #32214
Release requests in cors handler #32364
Add Restore Snapshot High Level REST API #32155
BREAKING: Introduce index store plugins #32375
Add opaque_id to index audit logging #32260
Ingest: Support integer and long hex values in convert #32213

Changes in 7.0:

INGEST: Fix Deprecation Warning in Script Proc. #32407
Networking: Fix test leaking buffer #32296

Lucene

Disabling hit counts by default

The major release highlight of Lucene 8 is going to be a set of optimizations that allow to compute top matches sorted by score more efficiently by not having to visit all matches. Unfortunately, these improvements are not going to be noticed by our users if we keep computing total hit counts by default, which requires to visit all matches.

As a consequence, we have been discussing what it would take to disable hit counts by default. While some UIs don't need hit counts at all, for instance a lot of mobile search UIs implement infinite scrolling and don't give information about the hit count, traditional search UIs still give information about the hit count (often approximated) and provide pagination support. For instance if you want to allow users to paginate up to page 10 and display 20 hits per page, it is useful to count hits accurately up to 200 in order to know how many pages you need to make available to your users. In order to keep Lucene practical, we have been discussing changing our top collectors so that the computation of the hit count is not a yes/no choice, but rather a number of hits to count accurately. If the number of matches of a query is below this number then the hit count will be accurate, and otherwise it will be a lower bound of the actual hit count.

ReqOptSumScorer to optimize query processing based on impacts

One way that queries now optimize collection of top documents is by adding information about the produced scores directly into the skip lists of the inverted index so that documents that don't yield competitive scores can be skipped. This is currently leveraged by term queries, disjunctions and conjunctions. We are now optimizing ReqOptSumScorer, which is used for queries that mix MUST/FILTER and SHOULD clauses. These queries are typically used in order to boost the score of some documents by the value of another field such as "pagerank" or "popularity" by putting the regular query in a MUST clause and the boosting query in a SHOULD clause: this will return the same matches as the regular query, but scores will be the sum of the regular query score and the boosting query score. Having those queries optimized is exciting as it means that one could incorporate features into the score and still benefit from optimized computation of the top hits.

LatLonPoint has moved to core

After having lived in sandbox for years, we promoted LatLonPoint to lucene-core. This is the best available option to index geo-points with Lucene, and the one that is used by Elasticsearch's geo_point field.

SegmentReader now exposes both soft and hard deletes

Until now SegmentReader would either expose hard deletes, or soft deletes if you configured a soft-deletes field with your IndexWriter. This proved problematic since you sometimes need to ignore soft deletes, yet not all deleted documents are documents that have been soft-deleted in that case: Lucene also creates born-deleted documents when it hits an exception during the indexing process, for instance if the analyzer throws an exception. In order to make it possible to distinguish legitimate documents that have been soft-deleted from documents that failed indexing, we introduced the ability to fetch hard deletes on a SegmentReader.

Match iterators

We introduced a new getSubMatches() method in the matches API, which allows to iterate sub-matches of a top-level match. For instance if you think of a phrase query "quick fox" with a slop of 2 and text "the quick yellow fox", then the query would return "quick yellow fox" as a top-level match, and "quick" and "fox" as sub matches.

We are now adding support for matches to interval queries.

Other

DaciukMihovAutomatonBuilder, a helper class that implements an algorithm that can build an automaton in constant-time from a sorted set of strings, now has protection against stack overflows.
A soft deletes optimization to copy live docs efficiently got broken because of an unrelated change.
The way that Lucene allows to pre-fill PriorityQueue objects with sentinels made could be hard to use.
We found some leniency in the way that Lucene carries over previous generations when overriding an existing index.
Could we optimize TopFieldCollector when the index is sorted but hit counts need to be computed?
Null payloads shouldn't count as non-null payloads for scoring.
Clarified the contract that Directory implementations must obey.
TopFieldCollector no longer computes scores at collection time, this is now done after collection, only on top hits.
The stempel stemmer is sometimes way too aggressive.

Elasticsearch Platform

ELK Stack

Elastic Cloud

Observability

Security

Search

Por setor

Por solução

Cliente em destaque

Desenvolvedores

Conectar-se

Aprender

Ajuda

Veja o que está acontecendo na Elastic

This Week in Elasticsearch and Apache Lucene - 2018-07-27

Siga-nos

Sobre nós

Junte-se a nós

Imprensa

Parceiros

Confiança e segurança

Relações com investidores

EXCELLENCE AWARDS