This Week in Elasticsearch and Apache Lucene - 2020-02-07

Elasticsearch

Slow (completion) stats

We looked into a couple of cases of struggling clusters and saw that the hot threads dominated by computing Engine#completionStats . There's a bit of a pattern here, looking back over past issues: there are some overenthusiastic third-party monitoring systems that hit GET _stats or GET _cluster/stats far too frequently, and these stats can be expensive to compute. The unbounded queue in the management threadpool doesn't help matters either. We opened a PR to suggest streamlining these stats in particular, and an issue to discuss whether we should introduce backpressure for stats requests.

Security User Ingest Processor

We have updated the Set Security User Ingest Processor to support additional authentication information.

This processor adds a new field to a document to record information about the user that ingested it. It has existed for a long time, but it was of limited use because it was not possible to enforce the use of a specific pipeline or processor, so a malicious user could just write their own fake “user” object into a document and then skip the set security user processor to make it look like the document was ingested by someone else.

Elasticsearch 7.5.0 added support for enforced pipelines, which makes the Set Security User Processor much more useful as a true record regarding the origin of documents.

We’ve taken the opportunity to give that processor a little bit more love, and it now supports adding fields related to:

  • The realm from which the user was resolved/authenticated
  • The type of authentication that was used (realm, bearer token, api key, anonymous)
  • The name & ID of the API key that was used, if any.

The last point was the key driver for scheduling this work now. There is work underway to make heavier use of API keys within beats, and recording API key information can tell us exactly which beat, running on which host ingested a particular document, even if all of those API keys are owned by the same logical user.

Analytics

We opened a PR to add a new boxplot aggregation. A boxplot is a fairly common descriptive statistic used to visualize numerical quantities. It consists of a vertical box, where the upper bound of the box is the 75th percentile, the lower bound is the 25th percentile and a line somewhere inside for the median (50th percentile). Two "whiskers" extend from the box to touch the min/max. This visualization gives a huge amount of information in a compact form, and is often used in financial analysis for that reason. The boxplot agg will live in our licensed Analytics plugin.

blog-this-week-2020-02-07.png

Sorted Queries

In Elasticsearch 7.0 we improved queries sorted by relevance (score), by a large margin, for users that don't need to know the exact number of documents that match the query. This new infrastructure built in Lucene specifically for queries sorted by score has been used since then to also improve queries sorted by a field. We pushed a change in Elasticsearch to optimize queries sorted by a date or long field that proved very efficient on time-based indices. This change takes advantage of the almost-sorted property of a shard built with time-based events to efficiently filter blocks of documents or even entire segments when looking for the oldest or newest documents (sorted by timestamp).

However, this optimization only works at the shard level so if you have an index pattern that spans years of data, Elasticsearch still needs to search every shard to return the oldest/newest documents. This issue can be mitigated by adding a range filter that restricts the search to the last N days then weeks, months but that's also the kind of expert usage that we'd like to make obsolete. Search should be fast “out of the box” if you're looking after the oldest/newest documents that match a query in an index pattern.

Following, we opened a PR to filter entire shards on sorted queries over time-based indices, which should make timestamp-based sorted queries on time-based indices cheap and fast even for a very large number of shards.

This change won't affect queries that run aggregations (users won't see any speed up). While the majority of our users run aggregation, we're also building a lot of new features under the assumption that sorted queries are fast. The current work on EQL for instance relies on sorted queries to implement efficient joins.

We are now looking at improving queries that use search_after in order to allow users to iterate quickly over a stream of sorted documents (beyond the first page).

Benchmarking Highly Concurrent Ingest Performance

Security has been looking into ingest performance with high concurrency and small doc size and as we are investigating possible Elasticsearch improvements, we turned to the performance team for standardized reproducible benchmarks. This turned into a parallelized effort with an initial solution using the Locust highly concurrent load testing framework and then starting to work on a technical spike of a new Rally load generator component based on async I/O to support vastly higher client counts, while we improved the initial Locust script making it a more stable and flexible solution.

Breaking changes:

No breaking changes to report this week!

Apache Lucene

Off-heap stored fields index

Lucene's stored fields are a document store. Elasticsearch uses it in order to store the _id, _source and _routing values when specified, as well as any user-configured fields that have store: true. This document store has two components: a data file that stores fields for all documents sequentially, and an index file that enables Lucene to know where stored fields for a given document start in the data file. Until now the content of this file was loaded in memory. It's not much memory, if you look at the nightly benchmarks, geonames only spends 3.8MB on this stored fields index for 11M documents. But it's also a pity that Lucene spends more memory on stored fields than it does on BKD trees or doc values, which are typically more performance-sensitive. The next version of Lucene will have the index read directly from disk.

Simplification of Geo API

Graduating LatLonShape and XYShape from lucene sandbox to core has opened the possibility to simplify the current Lucene geo API. This refactoring effort provides just one query implementation for all queryable geometries. As a side effect, it allows the implementation of queries with multiple geometries more efficiently. In many cases, those queries need to be rewritten as a boolean query, each query representing one of the geometries. With this new API, it would be possible to generate it as a single query.