August 15, 2018

Test-Driven Relevance Tuning of Elasticsearch using the Ranking Evaluation API

This blog post is written for engineers that are always looking for ways to improve the result sets of their search application built on Elasticsearch. The goal of this post is to raise awareness of why you should care about relevance, what components are involved and how you can control them.

Tuning relevance can be hard. But by the end of this blogpost you’ll have a better understanding of how you can tune the relevance of your search engine and learn that it's actually not so hard with the right tools.

Let's start with a bit of theory

Let's consider a search application to use a combination of data (index), data models (mapping and analyzers) and query templates (Query DSL). Using the search system means entering queries (typically a set of parameters, such as query string, filters, etc.) and receiving results. The consumer of the results can be a person that accesses the search engine via a web-based GUI or an automated system (I will mainly talk about users during this blogpost, but it’s also applicable for bots). Those results may or may not match the users expectations.

The job of a developer that builds a search application is to make sure that their users will find what they are looking for. Optimizing the search engine for this is what we call relevance tuning.

Ultimately, relevance is typically very subjective and often depends the individual evaluating the results of the query. Whether a user will be happy with the results or not — which is ultimately the only relevant metric for relevance — is hard to predict.

Every dataset is unique, as are your users. This means that ultimately nobody can tell you exactly which configuration to choose. Even if there was a database full of use cases that included feedback about what worked well, your case will always be specific and unique.

Very often, customers ask which query they should use. Of course, we can give guidelines that serve as a starting point. For example, it is best practice to use a multi-match query in an eCommerce use case. This usually works well until some edge cases are discovered that should yield better results. And when it comes to this kind of tuning it is usually better to follow a test-driven approach as explained in the following:

Let's consider relevance tuning to be an optimization problem.
There are lots of moving parts for a data model and a query template.
Finding the perfect combination of all those variables is the ultimate goal.

There are dozens of ways to setup a query template and hundreds of ways to configure your data model including all options for setting up analyzers. Sometimes relevance isn't your only concern. Query speed and required disk space are usually also important factors. But let's consider them as small constraints in our quest to optimize relevance.

To be successful you need a tool that enables you to measure and tune the relevance for your users and your search engine. Such a tool became available with version 6.2 of Elasticsearch, namely the Ranking Evaluation API.

Using the Ranking Evaluation API

With the Ranking Evaluation API, it is possible to measure the search quality of your search engine. To use this API, you’ll need an existing index, a set of queries, and a list of relevance judgements for documents returned by those queries. Relevance judgements are labels that indicate whether a document matched a query well or not.

Usually you know the queries that you use. Either they can be extracted from the application code or from the slowlogs (Turning the slowlog setting to 0 seconds will log all queries. Warning: Don’t let this setting run in production for a long time, as it will impact performance.). If you use search templates, it’s very straightforward to extract the queries in JSON format.

The only tedious part is to gather the relevance judgements. You’ll need to collect for each query one or more documents that should match that query or not. In many cases this will be manual work. The exact format of the relevance judgements is explained in the ranking evaluation documentation.

Search quality can also be measured using proven information retrieval metrics like precision, mean reciprocal rank or discounted cumulative gain (DCG).

The queries you pick should be a somewhat representative sample of typical queries that you will serve in production. The easiest way to pick the sample if you already have a search system in place would be to take the top 100 queries from your website analytics logs.

Let’s assume you’re ready to go now: You have an index, query samples and relevance judgements. Here’s a little recipe how you can use the Ranking Evaluation API for test driven relevance tuning:

Talk to QA, and find out which problematic queries need to be tuned.

Run the queries that need to be tuned. For this example, we have 1 document in our movies index:

POST movies/doc
{
  "title": "Star Wars"
}

Let’s check if a query for “Star Trek” yields results:

GET movies/_rank_eval
{
  "requests": [
    {
      "id": "star_trek",
      "request": {
        "query": {
          "match_phrase": {
            "title": "Star Trek"
          }
        }
      },
      "ratings": [
        {
          "_index": "movies",
          "_id": "1",
          "rating": 1
        }
      ]
    }
  ],
  "metric": {
    "precision": {
      "k": 5,
      "relevant_rating_threshold": 1
    }
  }
}

Run an initial test to measure the current state of your system:

{
  "quality_level": 0,
  "details": {
    "star_trek": {
      "quality_level": 0,
      "unknown_docs": [],
      "hits": [],
      "metric_details": {
        "precision": {
          "relevant_docs_retrieved": 0,
          "docs_retrieved": 0
        }
      }
    }
  },
  "failures": {}
}

Ok the quality_level is 0. This is not what we want. The quality level can range from 0 to 1. We should aim for something close to 1. So maybe we have to use a different query.

Change a variable. For this example, let's change the query type.

GET movies/_rank_eval
{
  "requests": [
    {
      "id": "star_trek",
      "request": {
        "query": {
          "match": {
            "title": "Star Trek"
          }
        }
      },
      "ratings": [
        {
          "_index": "movies",
          "_id": "1",
          "rating": 1
        }
      ]
    }
  ],
  "metric": {
    "precision": {
      "k": 5,
      "relevant_rating_threshold": 1
    }
  }
}

Run the test again.

{
  "quality_level": 1,
  "details": {
    "star_trek": {
      "quality_level": 1,
      "unknown_docs": [],
      "hits": [
        {
          "hit": {
            "_index": "movies",
            "_type": "doc",
            "_id": "1",
            "_score": 0.2876821
          },
          "rating": 1
        }
      ],
      "metric_details": {
        "precision": {
          "relevant_docs_retrieved": 1,
          "docs_retrieved": 1
        }
      }
    }
  },
  "failures": {}
}

A lot better. Now we could find something. This is just an example to illustrate the workflow. (Note: A more detailed walkthrough of this process will be contained in a follow up blogpost to this topic.)

At this point, you have two options:
- If the overall quality increases and meets your quality SLAs, you’re done, otherwise repeat until you reach the desired quality.
- If the overall quality decreases, revert the change, make a note of it for future reference and try something else.

Just remember: Ranking evaluation should be repeated periodically as data, users, and/or queries change. Every changing variable can affect relevancy (different index size, different data, different query templates, etc.). Please keep in mind, that it is usually not possible to completely exhaust every possible search. The number of questions (queries) and possible answers (results) is just too open and large to be evaluated scientifically with full test coverage.

In some cases, when you use your search engine to perform matching (e.g. name matching) and look for one specific result (or very few), then you could theoretically test it completely and your evaluation result would be a lot more representative. In most cases though you can only do partial testing of a sample. This is the reason why you should try to pick a representative sample of queries and you should try to work with realistic relevance judgements.

Good luck with tuning relevance. As you start to use the Ranking Evaluation API, let us know about your feedback and experiences on our Discuss forum.