Judgment lists: Evaluating search query relevance in Elasticsearch

Elasticsearch is packed with new features to help you build the best search solutions for your use case. Learn how to put them into action in our hands-on webinar on building a modern Search AI experience. You can also start a free cloud trial or try Elastic on your local machine now.

Developers working on search engines often encounter the same issue: the business team is not satisfied with one particular search because the documents they expect to be at the top of the search results appear third or fourth on the list of results.

However, when you fix this one issue, you accidentally break other queries since you couldn’t test all cases manually. But how can you or your QA team test if a change in one query has a ripple effect in other queries? Or even more importantly, how can you be sure that your changes actually improved a query?

Towards a systematic evaluation

Here is where judgment lists come in useful. Instead of depending on manual and subjective testing any time you make a change, you can define a fixed set of queries that are relevant for your business case, together with their relevant results.

This set becomes your baseline. Every time you implement a change, you use it to evaluate if your search actually improved or not.

The value of this approach is that it:

Removes uncertainty: you no longer need to wonder if your changes impact other queries; the data will tell you.
Stops manual testing: once the judgment sets are recorded, the test is automatic.
Supports changes: You can show clear metrics that support the benefits of a change.

How to start building your judgment list

One of the easiest ways to start is to take a representative query and manually select the relevant documents. There are two ways to do this list:

Binary Judgments: Each document associated with a query gets a simple tag: relevant (usually with a score of “1”) and not-relevant (“0”).
Graded Judgments: Here, each document gets a score with different levels. For example: setting a 0 to 4 scale, similar to a Likert scale, where 0 = “not at all relevant” and 4 = “totally relevant,” with variations like “relevant,” “somewhat relevant,” etc.

Binary judgments work well when the search intent has clear limits: Should this document be in the results or not?

Graded judgements are more useful when there are grey areas: some results are better than others, so you can get “very good,” “good,” and “useless” results and use metrics that value the order of the results and the user’s feedback. However, graded scales also introduce drawbacks: different reviewers may use the scoring levels differently, which makes the judgments less consistent. And because graded metrics give more weight to higher scores, even a small change (like rating something a 3 instead of a 4) can create a much bigger shift in the metric than the reviewer intended. This added subjectivity makes graded judgments noisier and harder to manage over time.

Do I need to classify the documents myself?

Not necessarily, since there are different ways to create your judgment list, each with its own advantages and disadvantages:

Explicit Judgments: Here, SMEs go over each query/document and manually decide if (or how) relevant it is. Though this provides quality and control, it is less scalable.
Implicit Judgments: With this method, you infer the relevant documents based on real-user behavior like clicks, bounce rate, and purchases, among others. This approach allows you to gather data automatically, but it might be biased. For example, users tend to click top results more often, even if they are not relevant.
AI-Generated Judgments: This last option uses models (like LLMs) to automatically evaluate queries and documents, often referred to as LLM juries. It’s fast and easy to scale, but the quality of the data depends on the quality of the model you’re using and how well LLM training data aligns with your business interests. As with human grades, LLM juries can introduce their own biases or inconsistencies, so it’s important to validate their output against a smaller set of trusted judgments. LLM models are probabilistic by nature, so it is not uncommon to see an LLM model giving different grades to the same result regardless of setting temperature parameter as 0.

Below are some recommendations to choose the best method for creating your judgment set:

Decide how critical some features are for you that only users can properly judge (like price, brand, language, style, and product details). If those are critical, you need explicit judgments for at least some part of your judgment list.
Use implicit judgements when your search engine already has enough traffic so you can use clicks, conversions, and lingering time metrics to detect usage trends. You should still interpret these carefully, contrasting them with your explicit judgement sets to prevent any bias (e.g: users tend to click top-ranked results more often, even if lower-ranked results are more relevant)

To address this, position debiasing techniques adjust or reweight click data to better reflect true user interest. Some approaches include:

Results shuffling: Change the order of search results for a subset of users to estimate how position affects clicks.
Click models include Dynamic Bayesian Network DBN, User Browsing Model UBM. These Statistical models estimate the probability of a click reflects real interest rather than just position, using patterns like scrolling, dwell time, click sequence, and returning to the results page.

Example: Movie rating app

Prerequisites

To run this example, you need a running Elasticsearch 8.x cluster, locally or Elastic Cloud (Hosted or Serverless), and access to the REST API or Kibana.

Think about an app in which users can upload their opinions about movies and also search for movies to watch. As the texts are written by users themselves, they can have typos and many variations in terms of expression. So it’s essential that the search engine is able to interpret that diversity and provide helpful results for the users.

To be able to iterate queries without impacting the overall search behavior, the business team in your company created the following binary judgment set, based on the most frequent searches:

Query	DocID	Text
DiCaprio performance	doc1	DiCaprio's performance in The Revenant was breathtaking.
DiCaprio performance	doc2	Inception shows Leonardo DiCaprio in one of his most iconic roles.
DiCaprio performance	doc3	Brad Pitt delivers a solid performance in this crime thriller.
DiCaprio performance	doc4	An action-packed adventure with stunning visual effects.
sad movies that make you cry	doc5	A heartbreaking story of love and loss that made me cry for hours.
sad movies that make you cry	doc6	One of the saddest movies ever made — bring tissues!
sad movies that make you cry	doc7	A lighthearted comedy that will make you laugh
sad movies that make you cry	doc8	A science-fiction epic full of action and excitement.

Creating the index:

BULK request:

Below is the Elasticsearch query the app is using:

From judgment to metrics

By themselves, judgment lists do not provide much information; they are only an expectation of the results from our queries. Where they really shine is when we use them to calculate objective metrics to measure our search performance.

Nowadays, most of the popular metrics include

Precision: Measures the proportion of results that are truly relevant within all search results.
Recall: Measures the proportion of relevant results the search engine found among x results.
Discounted Cumulative Gain (DCG): Measures the quality of the result’s ranking, considering the most relevant results should be at the top.
Mean Reciprocal Rank (MRR): Measures the position of the first relevant result. The higher it is in the list, the higher its score.

Using the same movie rating app as an example, we’ll calculate the recall metric to see if there’s any information that is being left out of our queries.

In Elasticsearch, we can use the judgment lists to calculate metrics via the Ranking Evaluation API. This API receives as input the judgment list, the query, and the metric you want to evaluate, and returns a value, which is a comparison of the query result with the judgment list.

Let’s run the judgment list for the two queries that we have:

We’ll use two requests to _rank_eval: one for the DiCaprio query and another for sad movies. Each request includes a query and its judgment list (ratings). We don’t need to grade all documents since the ones not included in the ratings are considered as with no judgment. To do the calculations, recall only considers the “relevant set,” the documents that are considered relevant in the rating.

In this case, the DiCaprio query has a recall of 1, while the sad movies got 0. This means that for the first query, we were able to get all relevant results, while in the second query, we did not get any. The average recall is therefore 0.5.

Maybe we’re being too strict with the minimum_should_match parameter since by demanding that 100% of the words in the query are found in the documents, we’re probably leaving relevant results out. Let’s remove the minimum_should_match parameter so that a document is considered relevant if only one word in the query is found in it.

As you can see, by removing the minimum_should_match parameter in one of the two queries, we now get an average recall of 1 in both.

In summary, removing the minimum_should_match: 100% clause, allows us to got a perfect recall for both queries.

We did it! Right?

Not so fast!

By improving recall, we open the door to a wider range of results. However, each adjustment implies a trade-off. This is why defining complete test cases, using different metrics to evaluate changes.

Using judgment lists and metrics prevents you from going in blind when making changes since you now have data to back them up. Validation is no longer manual and repetitive, and you can test your changes in more than just one use case. Additionally, A/B testing allows you to test live which configuration works best for your users and business case, thus coming full circle from technical metrics and real-world metrics.

Final recommendations for using judgment lists

Working with judgment lists is not only about measuring but also about creating a framework that allows you to iterate with confidence. To achieve this, you can follow these recommendations:

Start small, but start. You don’t need to have 10,000 queries with 50 judgment lists each. You only need to identify the 5–10 most critical queries for your business case and define which documents you expect to see at the top of the results. This already gives you a base. You typically want to start with the top queries plus the queries with no results. You can also start testing with an easy-to-configure metric like Precision and then work your way up in complexity.
Validate with users. Complement the numbers with A/B testing in production. This way, you’ll know if changes that look good in the metrics are also generating a real impact.
Keep the list alive. Your business case will evolve, and so will your critical queries. Update your judgment periodically to reflect new needs.
Make it part of the flow. Integrate judgment lists into your development pipelines. Make sure each configuration change, synonym, or text analysis is automatically validated against your base list.
Connect technical knowledge with strategy. Don’t stop at measuring technical metrics like precision or recall. Use your evaluation results to inform business outcomes.