18 September 2014 Engineering

The Top Hits Aggregation

Aggregations is a powerful framework to build analytical information from data residing in Elasticsearch. With the release Elasticsearch 1.3.0, a new metric aggregation named the top_hits aggregation has been added to our lenghty list of existing aggregations.

The top_hits aggregation is different than other aggregations; it keeps track of the top matching hits or documents instead of computing a metric like sum, min or average.

Let’s get to know this new aggregation by example. I’ve indexed all the questions from Programmers Stack Exchange. Lets say we want to get some insight into programming questions regarding web related topics. Lets execute a simple query and use the top_hits aggregation:

{
 
"query": {
   
"match": {
     
"body": "web"
   
}
 
},
 
"aggs": {
   
"top-questions": {
     
"top_hits": {
       
"size" : 1
     
}
   
}
 
},
 
"size" : 1
}

Finding the most relevant question that mentions the term ‘web’ will likely give us what we want, especially since we already know that many questions will contain the term ‘web’ and our results set will be a bit overwhelming. To find the most relevant question, we’ll set the size option of the regular hits and the top_hits aggregation to 1. By default the top_hits aggregation returns 3 results. The response can look something like:

{
   
...
   
"hits": {
     
"total": 3532,
     
"max_score": 2.1443942,
     
"hits": [
         
{
           
"_index": "stack",
           
"_type": "question",
           
"_id": "64543",
           
"_score": 2.1443942,
           
"_source": {
               
"body": "<p>Does a web application have to live in a browser to be called a web application? Or is a thin client that uses a web service for most of it's functionality a web applicaiton?</p>n",
               
"title": "Does a web application have to live in a browser to be called a web application?"
           
}
         
}
     
]
   
},
   
"aggregations": {
     
"top-questions": {
         
"hits": {
           
"total": 3532,
           
"max_score": 2.1443942,
           
"hits": [
               
{
                 
"_index": "stack",
                 
"_type": "question",
                 
"_id": "64543",
                 
"_score": 2.1443942,
                 
"_source": {
                     
"body": "<p>Does a web application have to live in a browser to be called a web application? Or is a thin client that uses a web service for most of it's functionality a web applicaiton?</p>n",
                     
"title": "Does a web application have to live in a browser to be called a web application?"
                 
}
               
}
           
]
         
}
     
}
   
}
}

The top answer returned in these hits is very generic, and doesn’t help really yield any insights since we’re using the top_hits aggregation as the only aggregation. When solely using the top_hits aggregation, it just repeats what is already in the regular hits in the response. The top_hits aggregation becomes much more powerful when used as a subaggregator of a bucket aggregator like the terms or histogram aggregator.

In our previous example, we would have gotten much better insights if we grouped the questions by programming language:

{
 
"query": {
   
"match": {
     
"body": "web"    
   
}
 
},
 
"size" : 1,
 
"aggs": {
   
"top-programming-languages": {
     
"terms": {
       
"field": "tags",
       
"include": "java|javascript|python|php|python|ruby|perl|c#"
       
"size": 10
     
},
     
"aggs": {
       
"top-questions": {
         
"top_hits": {
           
"size": 1
         
}
       
}
     
}
   
}
 
}
}

By defining the top_hits aggregation as a subaggegator of the top-programming-languages terms aggregator, the top matching hits are now shown per bucket (or, in this case, per tag). Programmers Stack Exchange has many tags, so by only including specific programming languages we can reduce noise by omitting non-programming language related tags.

{
   
...
   
"aggregations": {
     
"top-programming-languages": {
         
"buckets": [
           
{
               
"key": "java",
               
"doc_count": 285,
               
"top-questions": {
                 
"hits": {
                     
"total": 285,
                     
"max_score": 1.5314255,
                     
"hits": [
                       
{
                           
"_index": "stack",
                           
"_type": "question",
                           
"_id": "118749",
                           
"_score": 1.5314255,
                           
"_source": {
                             
"body": "<p>I am new to web services. I have been working with Java web applications for the past 3 years. Are there books that explain the basics of web services clearly. I have read <a href=\"http://programmers.stackexchange.com/questions/82244/starting-java-web-services-and-feeling-lost\">Starting Java Web Services and feeling lost</a> but I did not find a proper book recommendation there.</p>n",
                             
"title": "Recommendation for a book that explains web services in java"
                           
}
                       
}
                     
]
                 
}
               
}
           
},
           
{
               
"key": "php",
               
"doc_count": 244,
               
"top-questions": {
                 
"hits": {
                     
"total": 244,
                     
"max_score": 1.5413362,
                     
"hits": [
                       
{
                           
"_index": "stack",
                           
"_type": "question",
                           
"_id": "49624",
                           
"_score": 1.5413362,
                           
"_source": {
                             
"body": "<p>How can the lack of Unicode support in PHP affect a PHP web app?</p>n",
                             
"title": "What does the lack of Unicode support in PHP mean?"
                           
}
                       
}
                     
]
                 
}
               
}
           
},
           
{
               
"key": "javascript",
               
"doc_count": 209,
               
"top-questions": {
                 
"hits": {
                     
"total": 209,
                     
"max_score": 1.3169197,
                     
"hits": [
                       
{
                           
"_index": "stack",
                           
"_type": "question",
                           
"_id": "65348",
                           
"_score": 1.3169197,
                           
"_source": {
                             
"body": "<p><strong>Edited for brevity:</strong></p>nn<p>What software engineering tools/practices and design/architectural patterns are used in web application development? What tools and practices would large companies or large web development teams follow? What kind of design considerations need to be made when working with JavaScript?</p>nn<p>e.g.,</p>nn<ul>n<li>Would Google, Facebook, or Yahoo use UML in their design process?</li>n<li>Do web developers care about cyclomatic complexity?</li>n<li>How would a global web development team diagram and document a complex design?</li>n<li>Do most web development teams use Agile?</li>n<li>Are there any architectural/design patterns specific to web development or JavaScript development?</li>n</ul>n",
                             
"title": "Design/Architecture Patterns & Practices for Web Development"
                           
}
                       
}
                     
]
                 
}
               
}
           
},
           
{
               
"key": "c#",
               
"doc_count": 160,
               
"top-questions": {
                 
"hits": {
                     
"total": 160,
                     
"max_score": 1.228869,
                     
"hits": [
                       
{
                           
"_index": "stack",
                           
"_type": "question",
                           
"_id": "33946",
                           
"_score": 1.228869,
                           
"_source": {
                             
"body": "<p>The terms rapid web development gets associated with Python/Django and ROR. Why is this not the case with C# ASP.NET?</p>n",
                             
"title": "Why is C# ASP.NET generally not regarded as a rapid web development framework?"
                           
}
                       
}
                     
]
                 
}
               
}
           
},
           
{
               
"key": "python",
               
"doc_count": 95,
               
"top-questions": {
                 
"hits": {
                     
"total": 95,
                     
"max_score": 1.313168,
                     
"hits": [
                       
{
                           
"_index": "stack",
                           
"_type": "question",
                           
"_id": "89313",
                           
"_score": 1.313168,
                           
"_source": {
                             
"body": "<p>I come from a PHP background so I'm used to everything being in one place. I want to install a python program (<a href=\"http://zine.pocoo.org/\" rel=\"nofollow\">Zine</a>)so that I can hack on it. The instructions I've found install it to the system in multiple folders. I do not want to edit files that are installed in Linux system directories.</p>nn<p>What is the python way to layout the directory on the filesystem for a web program that under active development?</p>nn<p>would it be:</p>nn<pre><code>/web/htdocs/python/modules/app.pyn</code></pre>nn<p>or</p>nn<pre><code>/web/python/modules/app.pyn/web/htdocs/n</code></pre>nn<p>or</p>nn<pre><code>/web/python/modules/app.pyn/web/python/htdocs/n</code></pre>n",
                             
"title": "Python file layout for web development package hacking"
                           
}
                       
}
                     
]
                 
}
               
}
           
},
           
{
               
"key": "ruby",
               
"doc_count": 35,
               
"top-questions": {
                 
"hits": {
                     
"total": 35,
                     
"max_score": 1.6018035,
                     
"hits": [
                       
{
                           
"_index": "stack",
                           
"_type": "question",
                           
"_id": "16945",
                           
"_score": 1.6018035,
                           
"_source": {
                             
"body": "<p>From my understanding, Ruby is a language that can be used to creat desktop applications. How does this get converted into a 'Web App' - and is a 'Web App' really any different than a Web Site with interactivity? </p>n",
                             
"title": "Is a Ruby 'Web App' designed differently than a PHP 'Web Site'?"
                           
}
                       
}
                     
]
                 
}
               
}
           
},
           
{
               
"key": "perl",
               
"doc_count": 13,
               
"top-questions": {
                 
"hits": {
                     
"total": 13,
                     
"max_score": 0.91885525,
                     
"hits": [
                       
{
                           
"_index": "stack",
                           
"_type": "question",
                           
"_id": "51003",
                           
"_score": 0.91885525,
                           
"_source": {
                             
"body": "<p>I'm a front and backend .NET web developer (most solutions use MS SQL Server) and I won't be using any non-MS solutions for a while.</p>nn<p>Will Perl be useful for situations that require scripting in an MS product environment?</p>n",
                             
"title": "Should a .NET, JavaScript and SQL Web App developer learn Perl?"
                           
}
                       
}
                     
]
                 
}
               
}
           
}
         
]
     
}
   
}
}

The top questions per programming language results give us much more relevant information into what kinds of questions are asked related to ‘web’ versus the first example where we just asked for the top question related to ‘web’.

There’s a bit more going on here, so let’s take a closer look.

If you take a look at the ordering of the buckets, you will see that the php bucket is ordered after the java bucket and the ruby buckets are sorted as the second to last bucket. However, score-wise both these buckets should appear before the java bucket. This sorting isn’t a big deal at the moment since there are only a few buckets, but what if there were many? Also, it would be much better if the first bucket was always the one with the most relevant document.

By default, buckets are sorted by the number of documents that fall into them. The top_hits aggregation just keeps track of the most relevant documents and that is it. In order to make sure that the buckets are sorted according to score, we need to rely on the order setting of the terms aggregator.

{
 
"query": {
   
"match": {
     
"body": "web"    
   
}
 
},
 
"size" : 1,
 
"aggs": {
   
"top-programming-languages": {
     
"terms": {
       
"field": "tags",
       
"include": "java|javascript|python|php|python|ruby|perl|c#",
       
"size": 10,
       
"order": {
         
"max_score": "desc"
       
}
     
},
     
"aggs": {
       
"top-questions": {
         
"top_hits": {
           
"size" : 1
         
}
       
},
       
"max_score" : {
         
"max": {
           
"lang": "expression",
           
"script": "doc.score"
         
}
       
}
     
}
   
}
 
}
}

Compared to the previous example, we added a max_score aggregation that keeps track of the highest score per bucket and we instruct the top-programming-languages aggregator to order the buckets according to the max_score in descending order.

{
   
"aggregations": {
     
"top-programming-languages": {
         
"buckets": [
           
{
               
"key": "ruby",
               
"doc_count": 35,
               
"max_score": {
                 
"value": 1.6018035411834717
               
},
               
"top-questions": {
                 
"hits": {
                     
"total": 35,
                     
"max_score": 1.6018035,
                     
"hits": [
                       
{
                           
"_index": "stack",
                           
"_type": "question",
                           
"_id": "16945",
                           
"_score": 1.6018035,
                           
"_source": {
                             
"body": "<p>From my understanding, Ruby is a language that can be used to creat desktop applications. How does this get converted into a 'Web App' - and is a 'Web App' really any different than a Web Site with interactivity? </p>n",
                             
"title": "Is a Ruby 'Web App' designed differently than a PHP 'Web Site'?"
                           
}
                       
}
                     
]
                 
}
               
}
           
},
           
{
               
"key": "php",
               
"doc_count": 244,
               
"max_score": {
                 
"value": 1.541336178779602
               
},
               
"top-questions": {
                 
"hits": {
                     
"total": 244,
                     
"max_score": 1.5413362,
                     
"hits": [
                       
{
                           
"_index": "stack",
                           
"_type": "question",
                           
"_id": "49624",
                           
"_score": 1.5413362,
                           
"_source": {
                             
"body": "<p>How can the lack of Unicode support in PHP affect a PHP web app?</p>n",
                             
"title": "What does the lack of Unicode support in PHP mean?"
                           
}
                       
}
                     
]
                 
}
               
}
           
},
           
{
               
"key": "java",
               
"doc_count": 285,
               
"max_score": {
                 
"value": 1.5314254760742188
               
},
               
"top-questions": {
                 
"hits": {
                     
"total": 285,
                     
"max_score": 1.5314255,
                     
"hits": [
                       
{
                           
"_index": "stack",
                           
"_type": "question",
                           
"_id": "118749",
                           
"_score": 1.5314255,
                           
"_source": {
                             
"body": "<p>I am new to web services. I have been working with Java web applications for the past 3 years. Are there books that explain the basics of web services clearly. I have read <a href=\"http://programmers.stackexchange.com/questions/82244/starting-java-web-services-and-feeling-lost\">Starting Java Web Services and feeling lost</a> but I did not find a proper book recommendation there.</p>n",
                             
"title": "Recommendation for a book that explains web services in java"
                           
}
                       
}
                     
]
                 
}
               
}
           
},
           
{
               
"key": "javascript",
               
"doc_count": 209,
               
"max_score": {
                 
"value": 1.3169196844100952
               
},
               
"top-questions": {
                 
"hits": {
                     
"total": 209,
                     
"max_score": 1.3169197,
                     
"hits": [
                       
{
                           
"_index": "stack",
                           
"_type": "question",
                           
"_id": "65348",
                           
"_score": 1.3169197,
                           
"_source": {
                             
"body": "<p><strong>Edited for brevity:</strong></p>nn<p>What software engineering tools/practices and design/architectural patterns are used in web application development? What tools and practices would large companies or large web development teams follow? What kind of design considerations need to be made when working with JavaScript?</p>nn<p>e.g.,</p>nn<ul>n<li>Would Google, Facebook, or Yahoo use UML in their design process?</li>n<li>Do web developers care about cyclomatic complexity?</li>n<li>How would a global web development team diagram and document a complex design?</li>n<li>Do most web development teams use Agile?</li>n<li>Are there any architectural/design patterns specific to web development or JavaScript development?</li>n</ul>n",
                             
"title": "Design/Architecture Patterns & Practices for Web Development"
                           
}
                       
}
                     
]
                 
}
               
}
           
},
           
{
               
"key": "python",
               
"doc_count": 95,
               
"max_score": {
                 
"value": 1.3131680488586426
               
},
               
"top-questions": {
                 
"hits": {
                     
"total": 95,
                     
"max_score": 1.313168,
                     
"hits": [
                       
{
                           
"_index": "stack",
                           
"_type": "question",
                           
"_id": "89313",
                           
"_score": 1.313168,
                           
"_source": {
                             
"body": "<p>I come from a PHP background so I'm used to everything being in one place. I want to install a python program (<a href=\"http://zine.pocoo.org/\" rel=\"nofollow\">Zine</a>)so that I can hack on it. The instructions I've found install it to the system in multiple folders. I do not want to edit files that are installed in Linux system directories.</p>nn<p>What is the python way to layout the directory on the filesystem for a web program that under active development?</p>nn<p>would it be:</p>nn<pre><code>/web/htdocs/python/modules/app.pyn</code></pre>nn<p>or</p>nn<pre><code>/web/python/modules/app.pyn/web/htdocs/n</code></pre>nn<p>or</p>nn<pre><code>/web/python/modules/app.pyn/web/python/htdocs/n</code></pre>n",
                             
"title": "Python file layout for web development package hacking"
                           
}
                       
}
                     
]
                 
}
               
}
           
},
           
{
               
"key": "c#",
               
"doc_count": 160,
               
"max_score": {
                 
"value": 1.2288689613342285
               
},
               
"top-questions": {
                 
"hits": {
                     
"total": 160,
                     
"max_score": 1.228869,
                     
"hits": [
                       
{
                           
"_index": "stack",
                           
"_type": "question",
                           
"_id": "33946",
                           
"_score": 1.228869,
                           
"_source": {
                             
"body": "<p>The terms rapid web development gets associated with Python/Django and ROR. Why is this not the case with C# ASP.NET?</p>n",
                             
"title": "Why is C# ASP.NET generally not regarded as a rapid web development framework?"
                           
}
                       
}
                     
]
                 
}
               
}
           
},
           
{
               
"key": "perl",
               
"doc_count": 13,
               
"max_score": {
                 
"value": 0.9188552498817444
               
},
               
"top-questions": {
                 
"hits": {
                     
"total": 13,
                     
"max_score": 0.91885525,
                     
"hits": [
                       
{
                           
"_index": "stack",
                           
"_type": "question",
                           
"_id": "51003",
                           
"_score": 0.91885525,
                           
"_source": {
                             
"body": "<p>I'm a front and backend .NET web developer (most solutions use MS SQL Server) and I won't be using any non-MS solutions for a while.</p>nn<p>Will Perl be useful for situations that require scripting in an MS product environment?</p>n",
                             
"title": "Should a .NET, JavaScript and SQL Web App developer learn Perl?"
                           
}
                       
}
                     
]
                 
}
               
}
           
}
         
]
     
}
   
}
}

As you can see, the buckets are now sorted by the top matching document in each bucket, and the first bucket contains the best matching document. How does it work?

The terms aggregations can only sort buckets by numeric metric subaggregations. The top_hits aggregation is a metric aggregation, but it isn’t a numeric metric aggregation – it produces hits instead of numbers. The additional max aggregation allows us to sort by score.

The top_hits aggregation supports more features than just ‘size’ and can be used in other bucket aggregations as well, e.g. the date_histogram. The top_hits aggregation supports the sort option, which allows you to sort by an arbitary field, e.g. by a create date instead of the default which is by relevancy.

More features, options and examples can be found in the reference documentation. If you’ve got a story to share about how aggregations are making your life better, we’d love to hear it. Tell us on Twitter!