How to measure and improve Elasticsearch search recall: from 0.43 to 0.75 with hybrid search

Learn how to measure and improve search recall in Elasticsearch by combining BM25 lexical search with Jina AI vector embeddings, using the rank_eval API to validate the improvement with real numbers.

Elasticsearch is packed with new features to help you build the best search solutions for your use case. Learn how to put them into action in our hands-on webinar on building a modern Search AI experience. You can also start a free cloud trial or try Elastic on your local machine now.

Lexical search using the BM25 ranking algorithm is cheap, fast, and very effective for a wide range of queries. But it has a blind spot: queries that don't share tokens with your documents. In this article, you’ll measure exactly where BM25 falls short. We'll use Elasticsearch's ranking evaluation API (rank_eval) and close that gap by adding Jina AI embeddings through Elastic Inference Service (EIS). You’ll see the recall score go from 0.43 to 0.75 and understand why.

What is recall?

Recall measures on a scale from 0 to 1 how many of the documents that your users actually want appear somewhere in your search results. If a query should surface three products and your search returns only two of them in the top 10, recall@10 = 0.67 for that query. It’s a set-based metric: It doesn’t care about the position of the relevant documents within those k results. A relevant document in position 10 counts the same as one in position 1. Having a high recall means that you’re not losing relevant results.


The diagram shows two sets: all relevant documents (left) and what BM25 actually retrieved (top 10, right). Only the intersection counts toward recall, prod_1 and prod_2 were found, while prod_3, prod_4, and prod_6 were missed entirely. Result: Recall@10 = 2/5 = 0.40.

Prerequisites

Let's get down to business to better understand how recall works. This demonstration uses Python. You can follow along with it on the companion notebook (notebook.ipynb), where every code block is a cell ready to run.

The code provided uses the following:

  • Elasticsearch 9.3+
  • Python 3.10+
  • A .env file with your Elasticsearch credentials

The dataset

We’ll use a product catalog of 1,000 products, spanning categories such as footwear, electronics, tools, and more.

Each document has four fields:

FieldType
`title`text
`description`text
`brand`keyword
`category`keyword

The dataset is loaded from dataset.csv.

BM25 is the default ranking algorithm in Elasticsearch and most search engines. It scores documents by how often your query terms appear in them, adjusted for document length and the frequency of those terms across the entire index. You get analyzers on top: lowercase normalization, stemming, and stopword removal. A query for "running shoes" will match "Running Shoes" and likely "run" as well.

This works well for a large class of queries:

  • "running shoes" immediately matches products with those exact tokens in the title.
  • "bluetooth speaker" surfaces portable audio products because the tokens appear verbatim.

The results are deterministic and explainable: A document ranks highly because the query terms appear in it. Debugging relevance is straightforward.

Where it breaks

Now let’s try these queries against the same catalog:

  • "skincare routine": The word "routine" doesn’t appear in any product title. BM25 can partially match on "skincare," but face serums, body oils, and moisturizers are described using terms like "vitamin C," "retinol," or "brightening," none of which overlap with the query. Products that form a complete skincare routine are scattered across the index with no shared token to anchor them.
  • "pet travel accessories": This is a use-case grouping, not a product category. A dog sling carrier, a pet car seat, and a travel crate are all relevant, but their descriptions talk about portability, safety, and comfort rather than "travel accessories." BM25 matches "pet" broadly but has no signal to distinguish travel-specific products from the rest of the pet catalog.

This is a recall problem. The relevant documents exist in your index. BM25 just cannot find them because the user's words and the document's words do not match closely enough.

Adding synonyms helps for known cases. But you cannot enumerate every way a user might express an intent. That is where vectors come in.

Why you should measure recall

Before fixing a problem, you need to quantify it.

Recall@k measures how many of the documents that your users actually want appear somewhere in your search results. Formally:

Precision@k measures the top k results and how many are actually relevant:

High precision means that the results you do return are good. In ecommerce, missing a relevant product (low recall) is often worse than showing a slightly imperfect result (lower precision), because a hidden product is a lost sale.

Elasticsearch's rank_eval API lets you measure both systematically. You provide a list of queries, each with a set of rated documents, and Elasticsearch computes the metrics for you across all queries.

Setting up the evaluation

The rank_eval API needs a ratings dataset: a mapping of queries to the documents that are relevant for each one, along with a relevance grade (0 = not relevant, 1 = relevant, 2 = highly relevant).

In the notebook, this is the judgments list:

The mix is intentional: q1 is a query that BM25 handles well (exact tokens in product titles), while q2, q3, and q4 are intent-based queries where the user's intent is expressed as a concept rather than specific product keywords.

Measuring BM25 baseline recall

First, set up the Elasticsearch client and index the raw text data:

Now build the rank_eval request for BM25. Each request in the list combines a query with its ratings:

Result:

0.43 means that across all four queries, BM25 finds only 43% of the documents it should find. The shortfall is concentrated in the intent-based queries: "skincare routine" misses face serums and body oils because "routine" never appears in product titles, and "pet travel accessories" retrieves off-topic pet products while missing carriers and crates described in terms of portability and safety rather than "travel accessories."

This is our baseline. Now we have a number to beat.

Adding vector search with Jina embeddings

Vector search encodes documents and queries as high-dimensional vectors, a type of vector made up of hundreds or thousands of numerical values, each encoding a specific feature of the data it represents. Documents with similar meaning end up close together in vector space, even if they share no words. "Gym equipment" and "dumbbell set" will be nearby because the concepts are related. I chose Elasticsearch as my vector database because it supports hybrid search, giving me both semantic understanding and keyword precision out of the box.

EIS includes out-of-the-box support for embedding models through its inference API.

Step 1: Using Jina embeddings v5 as an inference endpoint

If your cluster has GPU resources (available in Elastic Cloud and Elasticsearch 9.3+), the embeddings are generated on GPU, which is significantly faster than CPU inference and removes the performance trade-off that historically made vectors expensive at scale.

Why Jina embeddings specifically? jina-embeddings-v5-text is a multilingual model (119+ languages) with a 32,000-token context window and support for task-specific Low-Rank Adaptation (LoRA) adapters. It works well for short product descriptions out of the box. Read more about jina-embeddings-v5-text model here.

Step 2: Create the index with a semantic field

The semantic_text field type is the key here. It’s a higher-level abstraction over dense_vector: You point it at an inference endpoint, and Elasticsearch takes care of generating embeddings automatically.

The copy_to property on title and description means content from both fields flows into semantic_field for embedding, so a single vector captures the full product representation.

Step 3: Index the products

At index time, Elasticsearch calls the inference endpoint for each document and stores the resulting embedding in semantic_field. No extra code on your side.

Hybrid search: Combining BM25 and vectors with RRF

Adding vectors improves recall, but using vectors alone risks losing precision on exact-match queries; "running shoes" should still rank verbatim matches first. Hybrid search retains the lexical component specifically to preserve that precision.

Hybrid search with Reciprocal Rank Fusion (RRF) keeps the best of both:

  • BM25 handles exact and near-exact queries with high precision.
  • Semantic search handles intent-based and multilingual queries with high recall.
  • RRF combines the two ranked lists into a single ranking.

The RRF formula assigns each document a score based on its rank in each result list:

A document that ranks highly in both lists gets a higher combined score. The rank_constant controls how much weight lower-ranked documents receive.

Result:

Hybrid improves substantially over BM25 (0.43) and preserves precision for exact-match queries like "running shoes."

Results: Before and after

Here’s the full comparison across all three approaches:

Result:

MethodRecall@10
BM25 (Lexical)0.43
Hybrid (BM25 + Vectors)0.75

Breaking it down by query:

Conclusion

Throughout this post, we saw that BM25 lexical search is reliable when users type exact queries, but it loses recall when they search by intent rather than keywords. Using rank_eval, we established a reproducible baseline to measure that gap with real numbers. From there, we added a semantic_text field powered by Jina embeddings and ran the evaluation again. The result: Hybrid search improved recall from 0.43 to 0.75 while preserving precision on exact-match queries, though the actual margin will depend on your query mix.

The pattern scales beyond this example: Collect judgments from your users' actual queries, run rank_eval as a baseline, add semantic_text, and measure again. You'll know exactly what improved and by how much.

Next steps

Wie hilfreich war dieser Inhalt?

Nicht hilfreich

Einigermaßen hilfreich

Sehr hilfreich

Zugehörige Inhalte

Sind Sie bereit, hochmoderne Sucherlebnisse zu schaffen?

Eine ausreichend fortgeschrittene Suche kann nicht durch die Bemühungen einer einzelnen Person erreicht werden. Elasticsearch wird von Datenwissenschaftlern, ML-Ops-Experten, Ingenieuren und vielen anderen unterstützt, die genauso leidenschaftlich an der Suche interessiert sind wie Sie. Lasst uns in Kontakt treten und zusammenarbeiten, um das magische Sucherlebnis zu schaffen, das Ihnen die gewünschten Ergebnisse liefert.

Probieren Sie es selbst aus