Achieve faster cardinality aggregations via dynamic pruning

the-end-of-databases-B_(1).jpg

Elasticsearch 8.9 introduces a speedup to cardinality aggregations through support for dynamic pruning. This optimization needs specific conditions to be met to kick in, but when it does it often yields spectacular results. We have observed some cardinality aggregations that ran by as much as 1,000x faster with this change.

For instance, computing the number of unique Kubernetes deployments monitored by the Elastic Kubernetes integration benefits from this optimization:

POST metrics-*/_search
{
  "query": { // giving an example query, but any query would work
    "bool": {
      "filter": [
        { "range": { "@timestamp": { "gte": "now-1d/h" } } },
        { "match": { "data_stream.dataset": "kubernetes.pod" } }
      ]
    }
  },
  "size": 0,
  "track_total_hits": false,
  "aggs": {
    "deployment_count": {
      "cardinality": {
        "field": "​​kubernetes.deployment.name"
      }
    }
  }
}

How does it work?

Dynamic pruning is the process of using index structures to dynamically reduce the set of matches that need to be evaluated when running a query. For instance, if you query for the top 10 events sorted by descending timestamp, start evaluating matches, and find 10 hits whose timestamp is in the last hour, then you can dynamically introduce a filter on the timestamp field to ignore events that are more than one hour old: they have no chance of making it to the top-10.

The optimization to the cardinality aggregation follows a similar idea: once you have seen a value, there is no point in looking at that value again later on since it will not influence the count of unique values of the field. So during query evaluation, the cardinality integration automatically introduces a filter on a disjunctive query that only matches values that haven't been seen so far. As documents with new values get collected, these values get removed from the disjunction.

For instance, imagine that you are computing the cardinality of a field that has two unique values: a and b. The following table lists all matches from the query, with the Lucene doc IDs that match the query in the first column, and the value that is associated with this doc ID in the second column.

Doc IDValue
3b
10b
12a
19b
30a

When it starts evaluating the query, Elasticsearch implicitly adds a filter on a OR b to the main query. After seeing the first match, doc ID 3, value b doesn't need to be seen another time, so the filter will be mutated into a more selective filter that only matches value a. This helps save evaluating doc ID 10, since it has b as a value as well, and jumps directly to the next document that has a as a value: doc ID 12. At this point, a gets removed from the filter and Elasticsearch knows that there is no point in evaluating more matches, since it has already seen all unique values of the field. This helped save evaluating doc IDs 19 and 30.

The first phase of this optimization when a filter is dynamically introduced already helps significantly reduce the number of documents that the query needs to evaluate, which in turn speeds up query evaluation. But this second phase when the query exits when it has seen all unique values is what triggers the most spectacular speedups, as it could help skip the majority of documents of the index. Note that this second phase will not always happen depending on the query — it is possible that some values only exist in documents that do not match the query.

When does it kick in?

Disjunctive queries don't scale well with the number of clauses, so the main limitation of this optimization is that it can only work on fields that have a relatively small cardinality. As a consequence, Elasticsearch only enables this optimization on segments that have no more than 1,024 unique values.

Furthermore, this optimization is only supported on keyword fields to take advantage of the fact that they are indexed with an inverted index and that their doc values give us the number of unique values per segment.

Finally, the cardinality aggregation must be the only aggregation and be at the top level of the aggregation tree.

Conclusion

This optimization was evaluated against dashboards of the Elastic Kubernetes integration where it triggered noticeable speedups to dashboard loading times, especially when working with large amounts of data. In particular, the example query that was shared in the introduction of this blog got a 90% reduction in latency. We hope you'll enjoy the speedup!

What else is new in Elastic 8.9? Check out the 8.9 announcement post to learn more.

The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.