Intro to Aggregations pt. 2: Sub-Aggregations

Welcome Back! Last time on Intro to Aggregations, we explored the basic concepts of aggregations and how to start using them. We toyed with simple bucket and metric aggregations, which gave us simple analytics.

Today, we are going to learn about sub-aggregations. The most powerful feature of aggregations in Elasticsearch is the ability to embed aggregations (both buckets and metrics) inside…wait for it…other aggregations. Sub-aggregations allow you to continuously refine and separate groups of criteria of interest, then apply metrics at various levels in the aggregation hierarchy to generate your report.

The first approach we will look at involves embedding buckets inside of other buckets. If we take our first example from last week about the Five Boroughs, we can embed an additional bucket underneath to determine the top three incident contributing factors in each borough:

GET /nyc_visionzero/_search?search_type=count{
 "aggs" : {
  "all_boroughs": {
   "terms": {
    "field": "borough"
   },
   "aggs": {
    "cause": {
     "terms": {
      "field": "contributing_factor_vehicle",
      "size": 3
     }
    }
   }
  }
 }
}

In this aggregation, the original “all_boroughs” aggregation remains unchanged. There is a new “aggs” element that has been added as a JSON sibling to “terms” (i.e., on the same level in the JSON tree). The new element tells Elasticsearch to begin a sub-aggregation. Inside the new “aggs” element, we name our new sub-aggregation (“cause”) and use another terms bucket. This bucket’s criteria is the contributing factor field, limited to 3 results.

The response looks like this (truncated):

{
  "took": 29,
  "aggregations": {
   "all_boroughs": {
     "buckets": [
      {
        "key": "BROOKLYN",
        "doc_count": 86549,
        "cause": {
         "buckets": [
           {
            "key": "Unspecified",
            "doc_count": 57203
           },
           {
            "key": "Driver Inattention/Distraction",
            "doc_count": 6954
           },
           {
            "key": "Failure to Yield Right-of-Way",
            "doc_count": 4209
           }
         ]
        }
      },
      ...

Now you can see that we have our original aggregation data. We see again that Brooklyn has over 86,000 traffic incidents. But now it’s been enriched with the top three causes (“Unspecified”, “Inattention”, and “Failure to yield”) and the results of these newly added bucket are embedded under the results of the old bucket. If you were to run this example, you would see that similar data has enriched the rest of the boroughs as well, since the sub-aggregation runs for every bucket generated by the parent.

And this being Elasticsearch, we can embed as much as we want. You could, for example, find out how many incidents occurred per-month, per-cause, per-borough:

GET /nyc_visionzero/_search?search_type=count
{
 "aggs" : {
  "all_boroughs": {
   "terms": {
    "field": "borough"
   },
   "aggs": {
    "cause": {
     "terms": {
      "field": "contributing_factor_vehicle",
      "size": 3
     },
     "aggs": {
      "incidents_per_month": {
       "date_histogram": {
        "field": "@timestamp",
        "interval": "month"
       }
      }
     }
    }
   }
  }
 }
}

Pretty cool, huh?

Mixing Buckets and Metrics

On their own, metrics are a bit pedestrian. When looking at the examples last week, you may have been a bit underwhelmed. On their own, metrics can only calculate values against the global dataset.

But when metrics are embedded under a bucket as a sub-aggregation, they become infinitely more useful. Metrics will calculate their value based on the documents residing inside the bucket. This allows you to set up a range of criteria and sub-criteria with buckets, then place metrics to calculate values for your reports about each criteria.

As an example, let’s calculate the total number of cyclists injured in each borough:

GET /nyc_visionzero/_search?search_type=count
{
 "aggs" : {
  "all_boroughs": {
   "terms": {
    "field": "borough"
   },
   "aggs": {
    "total_cyclists_injured": {
     "sum" : {
      "field": "number_of_cyclist_injured"
     }
    }
   }
  }
 }
}

We are using the same terms bucket to provide a criteria (“Which borough did this incident happen in?”), and then embedding a sum metric underneath it to calculate the total number of injuries. The response shows us that, although there were many incidents in each borough, relatively few of those involved cyclist injuries:

{
  "took": 19,
  "aggregations": {
   "all_boroughs": {
     "buckets": [
      {
        "key": "BROOKLYN",
        "doc_count": 86549,
        "total_cyclists_injured": {
         "value": 2691
        }
      },
      {
        "key": "MANHATTAN",
        "doc_count": 76122,
        "total_cyclists_injured": {
         "value": 2000
        }
      },
      {
        "key": "QUEENS",
        "doc_count": 73000,
        "total_cyclists_injured": {
         "value": 1271
        }
      },
      {
        "key": "BRONX",
        "doc_count": 36239,
        "total_cyclists_injured": {
         "value": 503
        }
      },
      {
        "key": "STATEN ISLAND",
        "doc_count": 15763,
        "total_cyclists_injured": {
         "value": 69
        }
      }
     ]
   }
  }
}

You’ll notice here that the metric is extracting the value of the `total_cyclists_injured` field and summing that for each borough, rather than just counting documents. Once you start mixing buckets and metrics, the sky’s the limit. Try running this example:

GET /nyc_visionzero/_search?search_type=count
{
 "aggs" : {
  "all_boroughs": {
   "terms": {
    "field": "borough"
   },
   "aggs": {
    "causes": {
     "terms": {
      "field": "contributing_factor_vehicle",
      "size": 3
     },
     "aggs": {
      "incidents_per_month": {
       "date_histogram": {
        "field": "@timestamp",
        "interval": "month"
       },
       "aggs": {
        "cyclist_injury_avg": {
         "avg": {
          "field": "number_of_cyclist_injured"
         }
        }
       }
      }
     }
    },
    "total_cyclists_injured": {
     "sum" : {
      "field": "number_of_cyclist_injured"
     }
    }
   }
  },
  "top_causes": {
   "terms": {
    "field": "contributing_factor_vehicle",
    "size": 3
   }
  },
  "outside_borough": {
   "missing": {
    "field": "borough"
   },
   "aggs": {
    "total_cyclists_injured": {
     "sum" : {
      "field": "number_of_cyclist_injured"
     }
    }
   }
  }
 }
}

The aggregation is a bit large. But it is simple once you break it down into components:

all_boroughs.causes.incidents_per_month.cyclist_injury_avg: reports average number of cyclists injured per-month, per contributing cause factor, per borough
all_boroughs.total_cyclists_injured: reports total cyclists injured per borough, irrespective of cause
top_causes: reports the top three contributing incident causes, irrespective of borough
outside_borough.total_cyclists_injured: reports total cyclists injured in areas outside of a borough

Conclusion

That last aggregation was pretty nifty, no? This entire report runs in a single API call and takes only milliseconds to execute. It operates on fresh, near-real-time data. As new incidents stream into your system, your aggregations will immediately update to reflect the new data. This is what makes Kibana so powerful: near-real-time rollups and summaries of your data, sliced and diced as you wish.

These first two introductory articles have barely scratched the surface of aggregations in Elasticsearch. In the following articles, we will begin to dissect datasets and show how real insights can be extracted using aggregations. Stay tuned!