Engineering

Aggregate all the things: New aggregations in Elasticsearch 7

The aggregations framework has been part of Elasticsearch since version 1.0, and through the years it has seen optimizations, fixes, and even a few overhauls. Since the Elasticsearch 7.0 release, quite a few new aggregations have been added to Elasticsearch like the rare_terms, top_metrics or auto_date_histogram aggregation. In this blog post we will explore a few of those and take a closer look at what they can do for you.

In order to test out these new aggs, we're going to set up a sample dataset in an Elasticsearch 7.9 deployment. Keeping your cluster version current is a great way to ensure you have all the latest features (and fixes), but if you're on an older version and upgrading isn't an option, you can follow along in a free trial of Elastic Cloud.

The following documents could represent an ecommerce use case where a user clicks on a product and retrieves the product details. Of course it is missing a lot of details like the session id of a single user, so you could start and monitor sessions or even go a step further and utilize transforms to get further insights into your data. This post remains simple to make sure all concepts are understood.

Use Discover within Kibana or cURL from a command line to index the sample data: 

PUT website-analytics/_bulk?refresh 
{ "index" : {}} 
{ "product_id": "123", "@timestamp" : "2020-10-01T11:11:23.000Z", "price" : 12.34, "response_time_ms": 242 } 
{ "index" : {}} 
{ "product_id": "456", "@timestamp" : "2020-10-02T12:14:00.000Z", "price" : 20.58, "response_time_ms": 98 } 
{ "index" : {}} 
{ "product_id": "789", "@timestamp" : "2020-10-03T13:15:00.000Z", "price" : 34.16, "response_time_ms": 123 } 
{ "index" : {}} 
{ "product_id": "123", "@timestamp" : "2020-10-02T14:16:00.000Z", "price" : 12.34, "response_time_ms": 465 } 
{ "index" : {}} 
{ "product_id": "123", "@timestamp" : "2020-10-02T14:18:00.000Z", "price" : 12.34, "response_time_ms": 158 } 
{ "index" : {}} 
{ "product_id": "123", "@timestamp" : "2020-10-03T15:17:00.000Z", "price" : 12.34, "response_time_ms": 168 } 
{ "index" : {}} 
{ "product_id": "789", "@timestamp" : "2020-10-06T15:17:00.000Z", "price" : 34.16, "response_time_ms": 220 } 
{ "index" : {}} 
{ "product_id": "789", "@timestamp" : "2020-10-10T15:17:00.000Z", "price" : 34.16, "response_time_ms": 99 }

Auto-bucketing aggregations

These types of aggregations change the way how buckets are defined. When you have a time based aggregation, you usually define your buckets based on a time interval, like 1d. However, sometimes you do not know the nature of your data, just telling the number of expected buckets is easier from a user perspective.

This is where the following two new aggregations come into play.

auto_date_histogram Aggregation

The auto_date_histogram aggregation runs on date fields and allows you to configure the number of buckets you expect to be returned. Let's try this on our small dataset:

POST website-analytics/_search?size=0 
{ 
  "aggs": { 
    "views_over_time": { 
      "auto_date_histogram": { 
        "field": "@timestamp", 
        "buckets": 3 
      } 
    } 
  } 
} 
POST website-analytics/_search?size=0 
{ 
  "aggs": { 
    "views_over_time": { 
      "auto_date_histogram": { 
        "field": "@timestamp", 
        "buckets": 10 
      } 
    } 
  } 
}

Running these two queries will show that the interval that is being returned is based on the number of buckets that are requested. In the case of three buckets it should be one bucket per week, whereas in the case of 10 buckets this will be one bucket per day.

If you need a minimum interval, this can be configured as well, see the auto_date_histogram documentation.

variable_width_histogram Aggregation

The variable width histogram allows to dynamically create a pre configured number of buckets. On top of that, those buckets have a variable width compared to the fixed width of a regular histogram aggregation.

POST website-analytics/_search?size=0 
{ 
  "aggs": { 
    "prices": { 
      "variable_width_histogram": { 
        "field": "price", 
        "buckets": 3 
      } 
    } 
  } 
}

As there are only three different prices in our dataset, the min/max/key values are the same. However, you could try with two buckets and see one bucket now having different values.

Also, keep in mind, that bucket bounds are approximate.

A use case for this might be an ecommerce application, where you would like to display the price buckets as part of your faceted navigation. However, using this would make your website navigation rather sensitive for outliers, so think about having a category filter in place before doing this.

Thanks to community member James for getting this aggregation into Elasticsearch and grokking the hierarchical agglomerative clustering algorithm. You can read more about it in the GitHub pull request.

Aggregations on strings

The following aggregations are working on string based fields — usually keyword fields.

rare_terms Aggregation

As an Elastic Stack user, you will likely have stumbled over the terms aggregation already. The most occurring terms in a dataset will be returned. You can change the sorting to return the least found terms. This will come with an unbounded error, however, and thus the result is probably an approximation as this data is collected over several shards across a cluster. This is because Elasticsearch tries to prevent copying all the data from different shards to a single node, as this would be expensive and slow.

The rare_terms aggregation tries to circumvent these problems by using a different implementation compared to the terms aggregation. Even though this is still doing an approximate count the rare_terms aggregation has a well defined bounded error.

To figure out the product id, that has been indexed the least in the above dataset, try the following

POST website-analytics/_search?size=0 
{ 
  "aggs": { 
    "rarest_product_ids": { 
      "rare_terms": { 
        "field": "product_id.keyword" 
      } 
    } 
  } 
}

You can also play around with the max_doc_count (opposite to the min_doc_count of the terms aggregation) and change the number of buckets being returned.

string_stats Aggregation

How about getting some statistics about the values of a string field in your data? Let's go and try the string_stats aggregation:

POST website-analytics/_search?size=0 
{ 
  "aggs": { 
    "rarest_product_ids": { 
      "string_stats": { 
        "field": "product_id.keyword", 
        "show_distribution" : true 
      } 
    } 
  } 
}

This will return statistics about min/max/average length of your strings in that field, but by adding the show_distribution parameter, you will also see the distribution for each character being found.

This is an easy way for a quick check against your data to find outliers, that may be misindexed data for example, like an overly long or short product id. Also the Shannon Entropy value being returned can be used for purposes like finding DNS data exfiltration attempts.

Metrics based aggregations

Let's dig into the second group of aggregations, the ones that calculate on top of bucketed numeric fields.

top_metrics Aggregation

You probably know the top_hits aggregation already, which returns a full hit including its source. However if you are only interested in a single value and would like to sort by that, take a look at the top_metrics aggregation. If you do not need the whole document, this aggregation will be a lot faster than the top_hits one, often being used to retrieve the most recent value from each bucket.

In our clickstream dataset you might be interested in the price of the latest click event.

POST website-analytics/_search 
{ 
  "size": 0, 
  "aggs": { 
    "tm": { 
      "top_metrics": { 
        "metrics": {"field": "price"}, 
        "sort": {"@timestamp": "desc"} 
      } 
    } 
  } 
}

The sorting also supports _score or geo distances. In addition, you can specify more than one metric, so you could add another field to the metrics field, which then needs to become an array.

boxplot Aggregation

The boxplot aggregation does exactly what the name says —  providing a box plot:

GET website-analytics/_search 
{ 
  "size": 0, 
  "aggs": { 
    "by_date": { 
      "date_histogram": { 
        "field": "@timestamp", 
        "calendar_interval": "day", 
        "min_doc_count": 1 
      }, 
      "aggs": { 
        "load_time_boxplot": { 
          "boxplot": { 
            "field": "price" 
          } 
        } 
      } 
    } 
  } 
}

The above query returns a box plot for every day, where there is data in a daily bucket.

We will skip the t-test aggregation as the super small dataset does not allow for any useful aggregation request here. To see the value of this aggregation, you'd need a data set where you assume a change of behaviour that can be uncovered via a statistical hypothesis.

Pipeline aggregations

Next up, aggregations on top of aggregations known as pipeline aggregations. Quite a few have been added over the last year.

cumulative_cardinality Aggregation

This is a useful aggregation for finding the count of new items in your dataset.

GET website-analytics/_search 
{ 
  "size": 0, 
  "aggs": { 
    "by_day": { 
      "date_histogram": { 
        "field": "@timestamp", 
        "calendar_interval": "day" 
      }, 
      "aggs": { 
        "distinct_products": { 
          "cardinality": { 
            "field": "product_id.keyword" 
          } 
        }, 
        "total_new_products": { 
          "cumulative_cardinality": { 
            "buckets_path": "distinct_products" 
          } 
        } 
      } 
    } 
  } 
}

With the above query you can figure out how many new and before unknown products have been visited on a daily basis and also create a count of those. This might help you in an ecommerce setting to figure out if your new products are actually watched, or if your best sellers are staying in the top ranks and you should maybe change your marketing approach.

normalize Aggregation

Let's try to get an overview, which day has the most traffic on a percentage basis with 100% being the whole data that matched the query.

GET website-analytics/_search 
{ 
  "size": 0, 
  "aggs": { 
    "by_day": { 
      "date_histogram": { 
        "field": "@timestamp", 
        "calendar_interval": "day" 
      }, 
      "aggs": { 
        "normalize": { 
          "normalize": { 
            "buckets_path": "_count", 
            "method": "percent_of_sum" 
          } 
        } 
      } 
    } 
  } 
}

This returns additional information for each bucket: the percentage of the number of documents found in each bucket compared to the total documents the search returned.

You may want to check the normalize aggregation documentation as there are more method values that you can pick from, like mean or rescaling within a range.

moving percentiles Aggregation

This pipeline aggregation works on top of the percentiles aggregation and calculates a cumulative percentile using a sliding window.

GET website-analytics/_search 
{ 
  "size": 0, 
  "aggs": { 
    "by_day": { 
      "date_histogram": { 
        "field": "@timestamp", 
        "calendar_interval": "day" 
      }, 
      "aggs": { 
        "response_time": { 
          "percentiles": { 
            "field": "response_time_ms", 
            "percents": [ 75, 99 ] 
          } 
        }, 
        "moving_pct": { 
          "moving_percentiles": { 
            "buckets_path": "response_time", 
            "window": 2 
          } 
        } 
      } 
    } 
  } 
}

Quite a bit to unfold here, so let's start. After bucketing by day, the percentiles aggregation calculates percentiles for each day bucket. The moving_percentiles pipeline agg then takes the previous two buckets and calculates the moving average out of that. Note, that you can change the behaviour which buckets should be used for calculation by using the shift parameter, if you also want to include the current one.

pipeline inference Aggregation

We will skip the inference bucket aggregation, as there is a blog post planned to be released very soon, that explains this aggregation. To whet your appetite: you can run a pretrained model against the results of a parent bucket aggregation. Stay tuned!

Support for the histogram field type

This is not strictly an aggregation, but aggregations are affected by this data type, so it is worth mentioning.

You may have stumbled already over the histogram field type, allowing you to store pre-aggregated numerical data — this is used extensively in Elastic Observability for example. This special field type supports a subset of aggregations, and you will find a few aggregations that are not supported... yet. However there is more work being done to support these.

Support for geo_shape in geo aggregations

This is again not strictly about a single aggregation, but a huge step forwards and thus worth a mention. 

A huge investment has been taken to make the geo_bounds, geo_tile, geo_hashgrid and geo_centroid aggregations work with the geo_shape field type, in addition to the geo_point field type.

More, more more...

Thanks for following along. If you have more questions about any aggregations go ahead and ask those in our Discuss forums. Also, there are more aggregations planned for upcoming releases. One of them is the rate aggregation, which can be used inside a date_histogram aggregation and calculates a rate of documents or a field in each bucket.

If you want to try out any of those aggregations, spin up a cluster in a free trial of Elastic Cloud, and play around with your own data!