19 April 2018 Engineering

Practical BM25 - Part 1: How Shards Affect Relevance Scoring in Elasticsearch

By Shane Connelly

This is the first post in the three-part Practical BM25 series about similarity ranking (relevancy). The next post is linked at the bottom.

Background

In Elasticsearch 5.0, we switched to Okapi BM25 as our default similarity algorithm, which is what’s used to score results as they relate to a query. I won’t get too much into BM25 vs alternative measures in this blog, but if you want an introduction to the theoretical justification of BM25, you can hop on over and watch the BM25 Demystified presentation from Elastic{ON} 2016. Instead, I’m going to cover (and hopefully demystify) the practical usage of BM25 for you, including covering the parameters that are available and what affects scoring.

Do bear in mind that this blog will largely pertain to those doing scoring of text documents. That is, it’s really focused on helping our search users. If you’re indexing logs or metrics and returning results sorted by some explicit metadata/numeric order like the timestamp, this blog may serve mostly to fill curiosity.

Understanding How Shards Affect Scoring

Since I’m hoping you’ll follow along at home, one of the first things we need to get out of the way is understanding how having more than 1 shard affects scoring, as Elasticsearch uses 5 primary shards per index by default. Let’s start off by creating an index called “people.” The settings I’m providing here would be the default (and thus unnecessary to define), but I’m going to do so anyway to be explicit for demonstration purposes. I’ll use variants of my name (“Shane Connelly”) here, but feel free to replace it with a name of your choosing if you’re following along at home.

PUT people
{
  "settings": {
    "number_of_shards": 5,
    "index" : {
        "similarity" : {
          "default" : {
            "type" : "BM25"
          }
        }
    }
  }
}

And now let’s add a document and then search for it. First, we’ll just add my first name:

PUT /people/_doc/1
{
  "title": "Shane"
}
GET /people/_doc/_search
{
    "query": {
        "match": {
             "title": "Shane"
         }
      }
}

You get 1 hit back at this point, with a score of 0.2876821. We’ll dive into how this scoring is derived in a bit, but let’s first look at what happens when we add a few more documents with different variants of my full name.

PUT /people/_doc/2
{
  "title": "Shane C"
}
PUT /people/_doc/3
{
  "title": "Shane Connelly"
}
PUT /people/_doc/4
{
  "title": "Shane P Connelly"
}

And now do the same search again:

GET /people/_doc/_search
{
    "query": {
        "match": {
             "title": "Shane"
         }
      }
}

At this point, you should indeed have 4 hits, but if you look at the scores, you may be scratching your head. Documents 1 and 3 both have a score of 0.2876821 but document 2 has a score of 0.19856805 and document 4 has a score of 0.16853254. This is something that often throws new users off. Documents 2 and 3 are very similar — they both have 2 terms and they both match “shane,” but document 2 has a much lower score. You may start to assume there’s something different about the scoring of “C” vs the scoring of “Connelly” but in fact this has to do with how the documents landed in the shards.

As a reminder, Elasticsearch divides documents up into shards, and each shard holds a subset of the data. If we take a look at:

GET /_cat/shards/people?v

If you execute this, you’ll see that shard 2 has 2 documents while shards 3 and 4 have only 1 document in them (shards 0 and 1 don’t have any documents yet). This means that the total occurrences of the term “shane” is different in these different shards and that’s what’s ultimately driving the difference of the scores in this case. By default, Elasticsearch calculates scores on a per-shard basis.

People start loading just a few documents into their index and ask “why does document A have a higher/lower score than document B” and sometimes the answer is that the user has a relatively high ratio of shards to documents so that the scores are skewed across different shards. There are a few ways to get more consistent scores across shards:

  1. The more documents you load into your index, the more normalized your shards’ term statistics will become. With enough documents, you may not notice the slight differences in term statistics and thus scoring in each shard.
  2. You could use a lower shard count to reduce statistical deviations in the term frequencies. For example, if we had set number_of_shards to 1 in the index settings, we’d have very different scores. We’d see document 1 with a score of 0.13245322, document 2 and 3 with scores of 0.105360515 each, and document 4 with a score of 0.0874691. There are some tradeoffs to having various numbers of primary shards, which is discussed in our quantitative cluster sizing webinar.
  3. You can add ?search_type=dfs_query_then_fetch to the request, which first gathers the distributed term frequencies (DFS = Distributed Frequency Search) and then computes the scores using those. In fact, this returns the same score as having just 1 shard. Have a look at how the results differ with and without the “search_type” parameter:
    GET /people/_doc/_search?search_type=dfs_query_then_fetch
    {
        "query": {
            "match": {
                 "title": "Shane"
             }
          }
    }
    This provides the same as having set number_of_shards=1. You may then ask, “well if this produces more accurate scores, why isn’t it turned on by default?” The answer is that it adds an extra round trip during processing to go gather all of the statistics, and for some use cases (where scoring accuracy isn’t as important as speed), this round trip is unnecessary. Also, with enough data in the shards, the statistics can become very close to one another, making round trips unnecessary as well. If you have enough data, search_type=dfs_query_then_fetch is mostly needed only when the data between shards continues to be unevenly distributed, as is the case with some custom routing.

OK, now we have a sense of how sharding can affect our scoring (and how to adjust for it). Next, we’ll look at the BM25 algorithm and see how different variables come into play.

Continue this series with: Part 2: The BM25 Algorithm and its Variables