Elasticsearch relevance tuning: Measure & improve search recall

Elasticsearch is packed with new features to help you build the best search solutions for your use case. Learn how to put them into action in our hands-on webinar on building a modern Search AI experience. You can also start a free cloud trial or try Elastic on your local machine now.

Lexical search using the BM25 ranking algorithm is cheap, fast, and very effective for a wide range of queries. But it has a blind spot: queries that don't share tokens with your documents. In this article, you’ll measure exactly where BM25 falls short. We'll use Elasticsearch's ranking evaluation API (rank_eval) and close that gap by adding Jina AI embeddings through Elastic Inference Service (EIS). You’ll see the recall score go from 0.43 to 0.75 and understand why.

What is recall?

Recall measures on a scale from 0 to 1 how many of the documents that your users actually want appear somewhere in your search results. If a query should surface three products and your search returns only two of them in the top 10, recall@10 = 0.67 for that query. It’s a set-based metric: It doesn’t care about the position of the relevant documents within those k results. A relevant document in position 10 counts the same as one in position 1. Having a high recall means that you’re not losing relevant results.

Venn diagram illustrating how Recall@10 is calculated by showing the overlap between all relevant documents and the top 10 results retrieved by BM25, resulting in a Recall@10 score of 0.40.

The diagram shows two sets: all relevant documents (left) and what BM25 actually retrieved (top 10, right). Only the intersection counts toward recall, prod_1 and prod_2 were found, while prod_3, prod_4, and prod_6 were missed entirely. Result: Recall@10 = 2/5 = 0.40.

Prerequisites

Let's get down to business to better understand how recall works. This demonstration uses Python. You can follow along with it on the companion notebook (notebook.ipynb), where every code block is a cell ready to run.

The code provided uses the following:

Elasticsearch 9.3+
Python 3.10+

A .env file with your Elasticsearch credentials

The dataset

We’ll use a product catalog of 1,000 products, spanning categories such as footwear, electronics, tools, and more.

Each document has four fields:

Field	Type
`title`	text
`description`	text
`brand`	keyword
`category`	keyword

The dataset is loaded from dataset.csv.

The power and limits of lexical search

BM25 is the default ranking algorithm in Elasticsearch and most search engines. It scores documents by how often your query terms appear in them, adjusted for document length and the frequency of those terms across the entire index. You get analyzers on top: lowercase normalization, stemming, and stopword removal. A query for "running shoes" will match "Running Shoes" and likely "run" as well.

This works well for a large class of queries:

"running shoes" immediately matches products with those exact tokens in the title.
"bluetooth speaker" surfaces portable audio products because the tokens appear verbatim.

The results are deterministic and explainable: A document ranks highly because the query terms appear in it. Debugging relevance is straightforward.

Where it breaks

Now let’s try these queries against the same catalog:

"skincare routine": The word "routine" doesn’t appear in any product title. BM25 can partially match on "skincare," but face serums, body oils, and moisturizers are described using terms like "vitamin C," "retinol," or "brightening," none of which overlap with the query. Products that form a complete skincare routine are scattered across the index with no shared token to anchor them.

"pet travel accessories": This is a use-case grouping, not a product category. A dog sling carrier, a pet car seat, and a travel crate are all relevant, but their descriptions talk about portability, safety, and comfort rather than "travel accessories." BM25 matches "pet" broadly but has no signal to distinguish travel-specific products from the rest of the pet catalog.

This is a recall problem. The relevant documents exist in your index. BM25 just cannot find them because the user's words and the document's words do not match closely enough.

Adding synonyms helps for known cases. But you cannot enumerate every way a user might express an intent. That is where vectors come in.

Why you should measure recall

Before fixing a problem, you need to quantify it.

Recall@k measures how many of the documents that your users actually want appear somewhere in your search results. Formally:

Precision@k measures the top k results and how many are actually relevant:

High precision means that the results you do return are good. In ecommerce, missing a relevant product (low recall) is often worse than showing a slightly imperfect result (lower precision), because a hidden product is a lost sale.

Elasticsearch's rank_eval API lets you measure both systematically. You provide a list of queries, each with a set of rated documents, and Elasticsearch computes the metrics for you across all queries.

Setting up the evaluation

The rank_eval API needs a ratings dataset: a mapping of queries to the documents that are relevant for each one, along with a relevance grade (0 = not relevant, 1 = relevant, 2 = highly relevant).

In the notebook, this is the judgments list:

The mix is intentional: q1 is a query that BM25 handles well (exact tokens in product titles), while q2, q3, and q4 are intent-based queries where the user's intent is expressed as a concept rather than specific product keywords.

Measuring BM25 baseline recall

First, set up the Elasticsearch client and index the raw text data:

Now build the rank_eval request for BM25. Each request in the list combines a query with its ratings:

Result:

0.43 means that across all four queries, BM25 finds only 43% of the documents it should find. The shortfall is concentrated in the intent-based queries: "skincare routine" misses face serums and body oils because "routine" never appears in product titles, and "pet travel accessories" retrieves off-topic pet products while missing carriers and crates described in terms of portability and safety rather than "travel accessories."

This is our baseline. Now we have a number to beat.

Adding vector search with Jina embeddings

Vector search encodes documents and queries as high-dimensional vectors, a type of vector made up of hundreds or thousands of numerical values, each encoding a specific feature of the data it represents. Documents with similar meaning end up close together in vector space, even if they share no words. "Gym equipment" and "dumbbell set" will be nearby because the concepts are related. I chose Elasticsearch as my vector database because it supports hybrid search, giving me both semantic understanding and keyword precision out of the box.

EIS includes out-of-the-box support for embedding models through its inference API.

Step 1: Using Jina embeddings v5 as an inference endpoint

If your cluster has GPU resources (available in Elastic Cloud and Elasticsearch 9.3+), the embeddings are generated on GPU, which is significantly faster than CPU inference and removes the performance trade-off that historically made vectors expensive at scale.

Why Jina embeddings specifically? jina-embeddings-v5-text is a multilingual model (119+ languages) with a 32,000-token context window and support for task-specific Low-Rank Adaptation (LoRA) adapters. It works well for short product descriptions out of the box. Read more about jina-embeddings-v5-text model here.

Step 2: Create the index with a semantic field

The semantic_text field type is the key here. It’s a higher-level abstraction over dense_vector: You point it at an inference endpoint, and Elasticsearch takes care of generating embeddings automatically.

The copy_to property on title and description means content from both fields flows into semantic_field for embedding, so a single vector captures the full product representation.

Step 3: Index the products

At index time, Elasticsearch calls the inference endpoint for each document and stores the resulting embedding in semantic_field. No extra code on your side.

Hybrid search: Combining BM25 and vectors with RRF

Adding vectors improves recall, but using vectors alone risks losing precision on exact-match queries; "running shoes" should still rank verbatim matches first. Hybrid search retains the lexical component specifically to preserve that precision.

Hybrid search with Reciprocal Rank Fusion (RRF) keeps the best of both:

BM25 handles exact and near-exact queries with high precision.
Semantic search handles intent-based and multilingual queries with high recall.
RRF combines the two ranked lists into a single ranking.

The RRF formula assigns each document a score based on its rank in each result list:

A document that ranks highly in both lists gets a higher combined score. The rank_constant controls how much weight lower-ranked documents receive.

Result:

Hybrid improves substantially over BM25 (0.43) and preserves precision for exact-match queries like "running shoes."

Results: Before and after

Here’s the full comparison across all three approaches:

Result:

Method	Recall@10
BM25 (Lexical)	0.43
Hybrid (BM25 + Vectors)	0.75

Bar chart comparing Recall@10 between BM25 lexical search and hybrid search combining BM25 with vectors, showing hybrid search achieving significantly higher recall.

Breaking it down by query:

Grouped bar chart comparing Recall@10 between BM25 lexical and hybrid search across four product queries, showing hybrid search consistently outperforming lexical search for each query.

Conclusion

Throughout this post, we saw that BM25 lexical search is reliable when users type exact queries, but it loses recall when they search by intent rather than keywords. Using rank_eval, we established a reproducible baseline to measure that gap with real numbers. From there, we added a semantic_text field powered by Jina embeddings and ran the evaluation again. The result: Hybrid search improved recall from 0.43 to 0.75 while preserving precision on exact-match queries, though the actual margin will depend on your query mix.

The pattern scales beyond this example: Collect judgments from your users' actual queries, run rank_eval as a baseline, add semantic_text, and measure again. You'll know exactly what improved and by how much.

Next steps

Dive deeper into recall and vector search: Recall and vector search quantization by Jeff Vestal
Add reranking for even better precision on the top results
Explore Elasticsearch hybrid search documentation
Read more about the rank_eval API

Wie hilfreich war dieser Inhalt?

Nicht hilfreich

Einigermaßen hilfreich

Sehr hilfreich

Ein Problem melden

Zugehörige Inhalte

Preconditioning Vectors: Making Elasticsearch VectorDB Better Binary Quantization work for every vector

Vector Database Inside Elastic

27. April 2026

Preconditioning Vectors: Making Elasticsearch VectorDB Better Binary Quantization work for every vector

Modern quantization techniques can hurt recall when using older models or embeddings that aren’t normally distributed. Learn how preconditioning fixes these vectors through random orthogonal projection, making BBQ more effective and recovering recall.

Von: John Wagster

How we built Elasticsearch simdvec to make vector search one of the fastest in the world

Vector Database Inside Elastic

23. April 2026

How we built Elasticsearch simdvec to make vector search one of the fastest in the world

How we built Elasticsearch simdvec, the hand-tuned SIMD kernel library behind every vector search query in Elasticsearch.

CH LD SC

Von: Chris Hegarty, Lorenzo Dematte und Simon Cooper

Unsupervised document clustering with Elasticsearch + Jina embeddings

Vector Database ML Research+1

10. April 2026

Unsupervised document clustering with Elasticsearch + Jina embeddings

A practical, reproducible approach to unsupervised document clustering with Elasticsearch and Jina embeddings.

Von: Matthew Adams

When TSDS meets ILM: Designing time series data streams that don't reject late data

Index Data Vector Database

2. April 2026

When TSDS meets ILM: Designing time series data streams that don't reject late data

How TSDS time bounds interact with ILM phases; and how to design policies that tolerate late-arriving metrics.

Von: Bret Wortman

LINQ to Elasticsearch ES|QL: Write C#, query Elasticsearch

ES|QL Vector Database

1. April 2026

LINQ to Elasticsearch ES|QL: Write C#, query Elasticsearch

Exploring the new LINQ to Elasticsearch ES|QL provider in the Elasticsearch .NET client, which allows you to write C# code that’s automatically translated to ES|QL queries.

FB ML

Von: Florian Bernd und Martijn Laarman

How to measure and improve Elasticsearch search recall: from 0.43 to 0.75 with hybrid search

What is recall?

Prerequisites

The dataset

The power and limits of lexical search

Where it breaks

Why you should measure recall

Setting up the evaluation

Measuring BM25 baseline recall

Adding vector search with Jina embeddings

Step 1: Using Jina embeddings v5 as an inference endpoint

Step 2: Create the index with a semantic field

Step 3: Index the products

Hybrid search: Combining BM25 and vectors with RRF

Results: Before and after

Conclusion

Next steps

Wie hilfreich war dieser Inhalt?

Zugehörige Inhalte

Preconditioning Vectors: Making Elasticsearch VectorDB Better Binary Quantization work for every vector

How we built Elasticsearch simdvec to make vector search one of the fastest in the world

Unsupervised document clustering with Elasticsearch + Jina embeddings

When TSDS meets ILM: Designing time series data streams that don't reject late data

LINQ to Elasticsearch ES|QL: Write C#, query Elasticsearch

Sind Sie bereit, hochmoderne Sucherlebnisse zu schaffen?