26 April 2014

A Gentle Intro to Function Scoring

By Andrew Cholakian

UPDATE: This article refers to our hosted Elasticsearch offering by an older name, Found. Please note that Found is now known as Elastic Cloud.

We'll cover the basics of scoring using functions while taking a look at some use cases where functional scoring techniques are highly useful and effective.

Introduction

At the heart of any search engine, Elasticsearch included, is the concept of scoring. Scoring can be loosely defined as finding data meeting some set of criteria and returning it sorted in order of relevance. Relevance is often implemented via an algorithm like TF-IDF, which tries to find out which documents are most textually similar to the submitted query. While TF-IDF and its cousins such as BM25 are fantastic, there are times where the question of relevance must be answered via other algorithms, or augmented with other scoring heuristics. It is here that Elasticsearch’s script_score and function_score features become useful. This article will cover the usage of these tools.

One example of a domain where text similarity is not the most important factor is geo search. If one is looking to find a good coffee shop near a given point, ranking coffee shops by how textually similar they are to the query won’t be very useful to the user, but ranking them in order of how geographically close they are may be.

Another example might be videos on a video sharing site, where search results should probably factor in the relative popularity of a video. If a pop star uploads a video with a given title, thus garnering millions of views, that video should probably outrank an unpopular video with similar textual relevance.

Diving In

Let’s dive in and see how we might approach the problem we just mentioned, ranking videos on a website using a combination of textual relevance and the videos popularity on the site. The best way to approach these problems is to think of the objectives in plain English, then convert them into an actual mathematical formula. In the case of our video site, our objectives can be listed as below.

  • Factor in the standard TF-IDF textual relevance of the video metadata (title, description, etc.)
  • Factor in the number of likes a given video has received
  • Factor in the number of views a given video has received

Given these criteria, we can come up with a naive formula, something like: score*(likes+views+1). Note that we have specified the sum of likes and views as a coefficient (adding a 1 in case both are 0). Using multiplication here is the right approach, since the absolute values of score should not be counted upon. The specific numbers returned as a score may vary across Elasticsearch versions, queries, and even across shards. There are a number of obvious problems with this naive approach, primarily that very poor matches with large numbers of likes and/or views will rank very high, despite their poor textual relevance. We need to tame the impact of large likes and views values. It is obvious that a good text match with 1 view will score lower than a poor match with 1mm views, this is a situation we must fix post-haste!

A simple way to dampen the effects of more popular videos, thus capping the influence of any one coefficient, is by using using its logarithm instead, so the above formula might be changed to:

score*log(likes+views+1)

This formula will make the algorithm quite a bit more fair, not letting ultra-popular videos dominate all results. The algorithm at this point is an OK conceptual model for our scoring algorithm, but it most likely will not work perfectly out of the box. Let’s try and set up some test data and give it a shot. In the example below we’ll create a schema for our video site, setup some test data, then run some queries against it.

# Create the index
curl -XPOST http://localhost:9200/searchtube

# Create the mappings
curl -XPUT http://localhost:9200/searchtube/video/_mapping -d '{
  "video": {
    "properties": {
      "title": {
        "type": "string",
        "analyzer": "snowball"
      },
      "description": {
        "type": "string",
        "analyzer": "snowball"
      },
      "views": {
        "type": "integer"
      },
      "likes": {
        "type": "integer"
      },
      "created_at": {
        "type": "date"
      }
    }
  }
}'

# Create some docs
curl -XPUT http://localhost:9200/searchtube/video/1 -d '{
  title: "Sick Sad World: Cold Breeze on the Interstate",
  description: "Is your toll collector wearing pants, a skirt, or nothing but a smile? Cold Breeze on the Interstate, next on Sick, Sad World.",
  views: 500,
  likes:2,
  created_at: "2014-04-22T08:00:00"
}'

curl -XPUT http://localhost:9200/searchtube/video/2 -d '{
  title: "Sick Sad World: The Severed Pianist",
  description: "When he turned up his nose at accordion lessons, they cut off his inheritance molto allegro. The Severed Pianist, next on Sick, Sad World.",
  views: 6000,
  likes: 100,
  created_at: "2014-04-22T12:00:00"
}'

curl -XPUT http://localhost:9200/searchtube/video/3 -d '{
  title: "Sick Sad World: Avant Garde Obstetrician",
  description: "Meet the avant-garde obstetrician who has turned his cast offs into art work. Severed Umbilical cord sculpture next, on Sick, Sad World.",
  views: 100,
  likes: 130,
  created_at: "2014-04-22T23:00:00"
}'

With this data set we can test out our formula with the query below.

curl -XPOST http://localhost:9200/searchtube/_search -d '
{
  "query": {
    "function_score": {
      "query": {"match": {"_all": "severed"}},
      "script_score": {
        "script": "_score * log(doc['likes'].value + doc['views'].value + 1)"
      }
    }
  }
}'

By using the script_score option to a function_score query we were able to directly translate our mathematical formula into an Elasticsearch formula! It is possible to refine this score further, by weighting likes more than views perhaps, or by further altering the scale. In the next section, we’ll cover decay functions, and there, we’ll see how we can turn this script_score query into a cleaner array of decay functions.

Decay Functions in Elasticsearch

Functional scoring techniques are useful for more than just modifying the default Elasticsearch scoring algorithm, they can be used to completely replace it. A great example of this is a ‘trending’ search, showing items within a topic that are rapidly gaining in popularity.

Such scores cannot be based on a simple metric such as ‘likes’ or ‘views’, but must be continually adjusted based on the current time. A video receiving 1000 views in 1 hour should generally be considered ‘hotter’ than a video receiving 10000 views in 24 hours. Elasticsearch ships several decay functions that make solving this sort of problem a breeze.

For our video database, what we’d like is to show our most popular queries within the previous day, weighting our more recent videos higher. Elasticsearch has three types of ‘curves’ for dealing with this, gauss, exp, and linear. They let you map a value to a point on a curve. The linear distribution is perhaps simplest. Linear decay means that all points in time decay evenly, that the farther you get from the origin of the more the item is decayed. An exponential decay looks like a parabola, the farther away you get from the origin the more severely the values are decayed. Finally, the gaussian distribution has more of an ‘s’ on its side shape, dropping off very little for values near the origin, then falling sharply in an almost linear manner, then tapering off once again at a very low value. Graphs of these functions can be seen in the image below, and even snazzier 3d versions of them can be found in this Elasticsearch blog post on the topic, along with detailed formulas.

Linear, Gaussian, and Exponential Curves

Linear, Gaussian, and Exponential Curves

Of these various formulas, the gauss formula is the one I most often find to be useful. The gentle taper at the start of the curve means that, in our video example for instance, more recent items get clustered together at the top, and that there isn’t a big gap between an item posted 1 hour ago, and one posted 1.5 hours ago. At some point however we want it to start severely penalizing videos, which is where the more linear portion of the curve becomes useful, that’s where videos start to quickly disappear down the results page. Finally, we want it to ‘bottom out’ at some point, which is what the bottom of the curves does. This separates our results into three neat compartments with a very natural transition between all three parts. In plain English this means that very recent videos get a score penalty, moderately recent videos get a larger score penalty, increasing more and more with age, and old videos get a severe penalty.

The shape of the curve may be controlled with the origin, scale, offset, and decay. These three variables are your main tools for controlling the shape of the curve. The origin and scale parameters can be thought of as your min and max, defining a bounding box within which the curve will be defined. If we want our trending videos list to cover a full day it would be best to define the origin as the current timestamp, and the scale as 24h. The offset can be used to fully flatten the curve at the start, setting it to 1h for instance, removes all penalties for recent videos. Finally, the decay option alters how severely the document is demoted based on its position. The default decay value is 0.5, larger values will make the curve much steeper, and its effect much more pronounced.

Origin and Scale visualized

Origin and Scale visualized

We can now craft a ‘trending’ query, that will factor in recency in scoring results. Examine the query below.

{
  "query": {
    "function_score": {
        "functions": [
          {
            "gauss": {
                "created_at": {
                    "origin": "2014-04-22T23:50:00",
                    "scale": "12h",
                    "offset": "1h",
                    "decay": 0.3
                }
            }
          },
          {
            "gauss": {
              "likes": {
                  "origin": 20000,
                  "scale": 20000
              }
            }
          },
          {
            "gauss": {
              "views": {
                  "origin": 20000,
                  "scale": 20000
              }
            }
          }
        ]
    }
  }
}'

You’ll notice that this query ranks the “Sick Sad World: Avant Garde Obstetrician” episode as the first video, even though it only has 100 views, and 130 likes, while the next result has thousands of views and likes. This is due to recency being factored in. You’ll also note that we no longer have script_score query, and have refactored both likes and views into our query as additional functions using gaussian curves. By using the built-in guassian decay functions our query should execute a bit faster than a script query, and we’ll be able to tune the shape of those adjusting curves a little easier as well.

These sorts of queries can be customized further. By specifying additional functions you can change the shape of the curve any way you wish, and integrate many more additional variables. Additionally, these queries can be used for geographic queries, helping to cluster results more tightly around geo points.

For a more complete rundown of the options function score queries make available, check out this documentation on the topic!