Intro to Aggregations

Many people are familiar with Elasticsearch for its search functionality. And while it is excellent at search, many organizations are using Elasticsearch for an entirely different purpose: analytics.

Beneath the surface of Elasticsearch is a powerful analytics engine, waiting to be unleashed on your data. This article series will show you a developer-centric approach to using aggregations, the analytical functionality in Elasticsearch, that can help you to analyze your data, build custom dashboards, and reap the benefits of near real-time responses.

Search & Analytics: Two Sides of the Same Coin

When searching for a document, you have a goal in mind. For instance, “Find all transactions for UserXYZ.” You don’t know which documents are related to the query, but you know that a collection of documents will match the query.

Search is all about retrieving a subset of documents from the entire dataset. In contrast, analytic workloads don’t care about individual documents. A dashboard won’t show individual transactions, but rather the aggregate summary of those transactions. When building analytics, you roll up your data into useful summaries like, “What was the total revenue from the transactions by UserXYZ?" Search is concerned with retrieving documents, and analytics is busy calculating summaries about those documents.

Because aggregations operate over the same Lucene indices that power search, they gain the near real-time nature of Lucene data. This means that reports generated by aggregations will update as soon as the data changes, rather than waiting for the next nightly batch cron-job to finish in other systems.

Getting Started with Aggregations

Since this is a developer-focused series, we suggest you follow along at home, so that you can play with the data and the various code examples that we show. For this article, we are going to use the New York City Traffic Incident dataset. We used this dataset a while ago to demonstrate Kibana and Logstash for non-developers.

This series of blogs will utilize a range of aggregation features, some of which have been introduced in later versions of Elasticsearch. To follow the entire series from start to finish, we recommend using Elasticsearch version 1.4.2. But this introductory blog only deals with the basics, so any version >= 1.0 will work.

To follow along, you'll need to restore a Snapshot into your local cluster. The snapshot is about 200MB, and it may take some time depending on your connection (NOTE: the snapshot URL only works through the API, it won't work in your browser):

// Register the NYC Traffic Repository
PUT /_snapshot/demo_nyc_accidents
{
  "type": "url",
  "settings": {
    "url": "http://download.elasticsearch.org/demos/nycopendata/snapshot/"
  }
}
// (Optional) Inspect the repository to view available snapshots
GET /_snapshot/demo_nyc_accidents/_all
// Restore the snapshot into your cluster
POST /_snapshot/demo_nyc_accidents/demo_nyc_accidents/_restore
// Watch the download progress.
GET /nyc_visionzero/_recovery

Once your cluster has finished restoring the Snapshot, let’s perform a simple search to see what the data holds:

GET /nyc_visionzero/_search
{
  "_shards": {...},
  "hits": {
   "total": 332871,
   "max_score": 1,
   "hits": [
     {
      "_index": "nyc_visionzero",
      "_type": "logs",
      "_id": "97vKYDDLSZuDmj-B2RLzXg",
      "_score": 1,
      "_source": {
        "message": [
         "04/02/2014,23:15,QUEENS,11377,40.7372519,-73.9179831,\"(40.7372519, -73.9179831)\",48 STREET            ,50 AVENUE            ,,0,0,0,0,0,0,0,0,Unspecified,,,,,316762"
        ],
        "@version": "1",
        "@timestamp": "2014-04-03T06:15:00.000Z",
        "host": "localhost",
        "date": "04/02/2014",
        "time": "23:15",
        "borough": "QUEENS",
        "zip_code": "11377",
        "latitude": "40.7372519",
        "longitude": "-73.9179831",
        "location": "(40.7372519, -73.9179831)",
        "on_street_name": "48 STREET",
        "cross_street_name": "50 AVENUE",
        "off_street_name": null,
        "number_of_persons_injured": 0,
        "number_of_persons_killed": 0,
        "number_of_pedestrians_injured": 0,
        "number_of_pedestrians_killed": 0,
        "number_of_cyclist_injured": 0,
        "number_of_cyclist_killed": 0,
        "number_of_motorist_injured": 0,
        "number_of_motorist_killed": 0,
        "unique_key": "316762",
        "coords": [
         -73.9179831,
         40.7372519
        ],
        "contributing_factor_vehicle": "Unspecified"
      }
     },
     ...

Each document represents a single traffic “incident” in NYC. These incidents contain a variety of metadata, including the time, the street name, the latitude and longitude, the borough, the statistics about injuries, and a summary of the contributing factors. There are 300,000+ documents; plenty to play with.

Let’s start building a few simple analytics. In Elasticsearch, all analytics are built using aggregations. Aggregations are constructed similar to queries, via a JSON-based DSL. The aggregation is appended to a search request, and both the search and aggregation are executed simultaneously.

Here is a simple aggregation:

GET /nyc_visionzero/_search?search_type=count
{
 "aggs" : {
  "all_boroughs": {
   "terms": {
    "field": "borough"
   }
  }
 }
}

Several things to note in this example:

It is being executed against the _search endpoint. Aggregations are just another feature of search, and use the same API endpoint.
Aggregations are specified under the “aggs” parameter. You can still specify a “query” for normal search if you also want to execute a search query.
We are executing the search with a count search_type. This omits the fetch phase in search, and is a performance trick if you don’t care about the search results (just the aggregation results)

So, what is this aggregation doing? It's building a list of all the boroughs in NYC, based on the boroughs included in the documents in the dataset. Here is the response:

{
  "took": 11,
  "hits": {
   "total": 332871,
   "max_score": 0,
   "hits": []
  },
  "aggregations": {
   "all_boroughs": {
     "buckets": [
      {
        "key": "BROOKLYN",
        "doc_count": 86549
      },
      {
        "key": "MANHATTAN",
        "doc_count": 76122
      },
      {
        "key": "QUEENS",
        "doc_count": 73000
      },
      {
        "key": "BRONX",
        "doc_count": 36239
      },
      {
        "key": "STATEN ISLAND",
        "doc_count": 15763
      }
     ]
   }
  }
}

Notice that the aggregation results are named “all_boroughs”, which was defined in the aggregation.

As you can see, there are five boroughs in NYC. Under each borough is a document count: Brooklyn had 86,000 traffic incidents, while Staten Island only had 15,000. Congrats, you ran your first aggregation!

Bucketing Based on Criteria

The “all_boroughs” aggregation we ran was an example of a bucket aggregation. Bucketing aggregations define a criteria. If documents match that criteria, they are added to the bucket. Documents can be added to multiple buckets, or to no buckets at all. When the aggregation finishes, you are left with a collection of documents matching various criteria.

In the example above, we used a Terms Bucket. It's called a Terms Bucket because this aggregation dynamically builds buckets based on the terms in your data. There were five boroughs in our data, so we got five corresponding buckets. There are also many other bucketing aggregations at your disposal. For example, you could bucket by time using a date_histogram. This will generate one bucket per month (giving you effectively a line-graph over time):

GET nyc_visionzero/_search?search_type=count
{
 "aggs" : {
  "months": {
   "date_histogram": {
    "field": "@timestamp",
    "interval": "month"
   }
  }
 }
}

Or you could bucket by how many cyclists were injured in each incident, using a histogram bucket:

GET nyc_visionzero/_search?search_type=count
{
 "aggs" : {
  "injuries": {
   "histogram": {
    "field": "number_of_cyclist_injured",
    "interval": 1
   }
  }
 }
}

You could even bucket documents based on the criteria of missing a field, using the missing bucket. This will collect all documents that don’t have a value for that particular field:

GET nyc_visionzero/_search?search_type=count
{
 "aggs" : {
  "missing_borough": {
   "missing": {
    "field": "borough"
   }
  }
 }
}

There are many bucketing aggregations available. Skim the reference documentation at some point to familiarize yourself with the various buckets.

Metrics: More Than Just Doc Counts

You may have noticed something interesting about the bucket aggregation responses: they only list document counts. Buckets simply collect documents based on a criteria, which means the only statistic they possess is a document count.

But what if you want to calculate a value based on fields in the document, like the average price or the total revenue? These operations are performed by the second type of Aggregation in Elasticsearch called metrics.

Metrics are simple mathematical operations, like min, max, avg, sum, percentiles etc. Metrics extract values out of the documents to use for the calculation.

Metrics can be used to calculate a simple number. For example, we can calculate the total number of cyclists injured in NYC:

GET /nyc_visionzero/_search?search_type=count
{
 "aggs" : {
  "cyclist_injuries": {
   "sum" : {
    "field": "number_of_cyclist_injured"
   }
  }
 }
}

Which returns a result of over 7,000 injuries:

{
  "took": 8,
  "aggregations": {
   "cyclist_injuries": {
     "value": 7129
   }
  }
}

Naming aggregations

If you look back at all the examples so far —buckets and metrics — you'll see that all aggregations must be named. This is important because it allows you to use several aggregations simultaneously. For instance, we could specify a bucket and a metric:

GET /nyc_visionzero/_search?search_type=count
{
 "aggs" : {
  "all_boroughs": {
   "terms": {
    "field": "borough"
   }
  },
  "cyclist_injuries": {
   "sum" : {
    "field": "number_of_cyclist_injured"
   }
  }
 }
}

Because we gave each aggregation its own name, the responses that comes back from Elasticsearch will have two named sections with the respective data. Furthermore, these two separate aggregations will be executed simultaneously in a single pass over the data. By merging multiple aggregations into a single API call, you can build many reports with just one pass over your data.

Conclusion

In this blog, we learned that aggregations serve a purpose different than search, but operate on the same near-realtime Lucene indices under the covers. We explored how to use buckets and metrics, the building blocks of the aggregation DSL. Next week we'll learn how to use sub-aggregations to create sophisticated, multi-level reports. Stay tuned!