20 12월 2013

Elasticsearch's New Aggregations

By Alex Brasetvik

UPDATE: This article refers to our hosted Elasticsearch offering by an older name, Found. Please note that Found is now known as Elastic Cloud.

Elasticsearch's faceting feature has made it extremely popular not just for realtime search, but also for analytics. With its new aggregations framework, it'll take you even further.

Introduction

Facets come in many forms and can be processed and visualized in many ways. Superfast faceting and filtering underpins usage of search engines for analytical purposes. We are not necessarily interested in finding the top n documents, but instead in what we can learn from filtering sets of results and seeing how e.g. histograms and charts change. The massively popular combination of Elasticsearch and Kibana for data crunching and visualization is an example of this kind of usage.

While facets are great and people are finding all sorts of creative ways of using them, they have some limitations. To deal with these limitations, Elasticsearch has engineered the new aggregations framework, which brings a whole new level of awesome.

What Facets Cannot Do

The illustrations and examples that follow are simplified. They are intended to explain how faceting and aggregation can be used, and not necessarily how they are implemented under the covers. A facet accepts a set of documents (as determined by the query) and produces some sort of summary, such as the how many documents have a certain tag, or how many documents fall in a certain “bucket” in a date histogram. While crunching these numbers, it does not retain information about which documents fall into which buckets: just the tallies. So it isn’t possible to easily say for example “for every store, show me a histogram of sales performance over time” with this model, just “show me sales performance per store” or “show me sales performance over time”.

Aggregations

Aggregations can retain information about which documents go into which bucket. A “bucketing” aggregator takes a set of documents and outputs one or more buckets. These buckets also have their own sets of documents. Therefore, one aggregations’ output can be used as another one’s input! The resulting possibilities are vast. You can now compose aggregations, creating trees of aggregators. Before we jump into building vast forrests of aggregation trees we’ll start with something simple.

Let’s start by combining aggregators to create something like the existing terms_stats facet. The terms stats facet is equivalent to a combination of terms and stats aggregators. It makes it possible to get statistics about one field per value of another, e.g. price statistics per tag. There is no corresponding terms_stats-aggregation. It’s not necessary, as we can now achieve the same goal by combining terms and stats aggregations. Our query will first perform a terms-aggregation and then perform a stats-aggregation. For every bucket the terms-aggregation produces, a stats-aggregation is run on the documents in that bucket.

Facets vs. aggregations

Facets vs. aggregations

Buckets vs. Metrics

The stats-aggregation is an example of what’s called a metric aggregator, and sometimes also a “calculator”. Example metric aggregators are min, max, sum, avg, value_count, stats, and extended_stats. These take a set of documents as input and crunch some numbers for the provided field. The output of metric aggregators does not include any information linked to individual documents, just the statistical data. i.e. metric aggregators can calculate that the minimum price is 10, but will not tell you which documents have that value.

stats is a metric aggregator

stats is a metric aggregator

The other kind of aggregator is a bucketing aggregator. These aggregators produce buckets, where a bucket has an associated value and a set of documents. For example, the terms-aggregator produces a bucket per term for the field it’s aggregated on. A document can end up in multiple buckets if the document has multiple values for the field being aggregated on. If a bucket aggregator has one or more “downstream” (i.e. child) aggregators, these are run on every bucket produced.

terms is an aggregator that produces buckets

terms is an aggregator that produces buckets

Bucket Ordering

By default, the buckets for a terms aggregator are ordered by the number of documents within them, high to low. You can make the order depend on the value of a metrics sub-aggregator as well. For example, you may want to produce a list of products and order them by the revenue generated, and not just the sales count. Similarily, the histogram-aggregation orders buckets by the implied key by default, but this can be changed as well.

Ordering the buckets by computed metrics

Ordering the buckets by computed metrics

Real World Example

Honza Král presents an excellent example of a more involved aggregation in his PyCon PL 2013-presentation “Explore Your Data”. Using Stackoverflow data as an example, he shows how to structure a query that can answer the question: “What’s the number of questions per tag? And per tag, what’s the average number of comments? Who’s commenting the most, and what’s the highest score of a comment per author?”

Summary

Aggregations bring a lot of new possibilities to the table. They will undoubtedly further strengthen Elasticsearch’s position as an analytics engine. In a future post we will look more into how they are implemented under the covers, in order to get a better understanding of the performance and memory complexity involved.

While the facet framework will continue to be supported for a while, it will be deprecated as of 1.0. Thus, you should plan on moving to the aggregations framework in the not too distant future.