Semantic Reranking in Elasticsearch with Retrievers

What is semantic reranking?

Semantic reranking is a method that allows us to utilize the speed and efficiency of fast retrieval methods while layering semantic search on top of it. It also lets us immediately add semantic search capabilities to existing Elasticsearch installations out there.

With the advancement of machine learning-powered semantic search we have more and more tools at our disposal for finding matches quickly from millions of documents. However, like cramming for a final exam, optimizing for speed means making some tradeoffs, and that usually comes at a loss in fidelity.

To offset this, we see some tools emerging and becoming increasingly available on the other side of the gradient. These are much slower, but can tell how closely a document matches a query with much more accuracy.

Semantic reranking hierarchy illustration

To explain some key terms: reranking is the process of reordering a set of retrieved documents in order to improve search relevance. In semantic reranking this is done with the help of a reranker machine learning model, which calculates a relevance score between the input query and each document.

Semantic reranking overview illustration

Rerankers typically operate on the top K results, a narrowed-down window of relevant candidates fulfilling the search query, since reranking a large list of documents would be extremely costly.

Why is semantic reranking important?

Semantic reranking is an important refinement layer for search users for a couple reasons.

First, users are expecting more from their search, where the right result isn't in the top ten hits or in the first page, but is the top answer. It's like that old search joke - the best place to hide a secret is in the second page of search results. Except today it's even more narrow: anything below the top one, two, or maybe three results will likely to get discarded.

This applies even more so for RAG (Retrieval Augmented Generation) - those Generative AI use cases need a tight context window. The best document could be the 4th result, but if you're only feeding in the top three, you aren't going to get the right answer, and the model could hallucinate.

On top of that, Generative AI use cases work best with an effective cutoff. You could define a minimum score or count up to which the results are considered "good", but this is hard to do without consistent scoring.

Semantic reranking solves these problems by reordering the documents so that the most relevant ones come out on top. It provides usable, normalized and well-calibrated scores, so you can measure how closely your results match your query. So you more reliably get much more accurate top results to feed to your large language model, and you can cut off results if there's a big dropoff in score in the top K hits to prevent hallucinations.

So how do we perform reranking?

The rerank inference type

Elastic recently introduced inference endpoints and related APIs. This feature allows us to use certain services, such as built-in or 3rd party machine learning models, to perform inference tasks. Supported inference tasks come in various shapes - for example a sparse_embedding task is where an ML model (such as ELSER) receives some text and generates a weighted set of terms, whereas a text_embedding task creates vector embeddings from the input.

Elastic Serverless - and the upcoming 8.14 release - adds a new task type: rerank. In the first iteration rerank supports integrating with Cohere's Rerank API. This means you can now create an inference endpoint in Elastic, supply your Cohere API key, and enjoy semantic reranking out of the box!

Let's see that in action with an example taken from the Cohere blog.

Assuming you have set up your rerank inference endpoint in Elastic with the Cohere Rerank v3 model, we can pass a query and an array of input text. As we can see, the short passages all relate to the word "capital", but not necessarily to the meaning of the location of the seat of government, which is what the query is looking for:

POST _inference/rerank/cohere-rerank-v3-model
  "query": "What is the capital of the USA?",
  "input": [
    "Carson City is the capital city of the American state of Nevada. At the 2010 United States Census, Carson City had a population of 55,274.",

    "Capital punishment (the death penalty) has existed in the United States since before the United States was a country. As of 2017, capital punishment is legal in 30 of the 50 states.",

    "The Commonwealth of the Northern Mariana Islands is a group of islands in the Pacific Ocean that are a political division controlled by the United States. Its capital is Saipan.",

    "Washington, D.C. (also known as simply Washington or D.C., and officially as the District of Columbia) is the capital of the United States. It is a federal district.",

    "Charlotte Amalie is the capital and largest city of the United States Virgin Islands. It has about 20,000 people. The city is on the island of Saint Thomas.",

    "North Dakota is a state in the United States. 672,591 people lived in North Dakota in the year 2010. The capital and seat of government is Bismarck."

The rerank task responds with an array of scores and document indices:

  "rerank": [
      "index": "3",
      "relevance_score": "0.99838966"
      "index": "1",
      "relevance_score": "0.587174"
      "index": "0",
      "relevance_score": "0.061199225"
      "index": "2",
      "relevance_score": "0.032283258"
      "index": "4",
      "relevance_score": "0.015365342"
      "index": "5",
      "relevance_score": "0.0040072887"

The topmost entry tells us that the highest relevance score of 99.8% is the 4th document ("index": 3 with zero-based indexing) of the original list, a.k.a. "Washington, D.C. ...". The rest of the documents are semantically less relevant to the original query.

This reranking inference step is an important puzzle piece of an optimized search experience, and now we are ready to place it in the puzzle board!

Reranking search results today - through your application

One way of harnessing the power of semantic reranking is to implement a workflow like this in a search application:

  1. A user enters a query in your app's UI.
  2. The search engine component retrieves a set of documents that match this query. This can be done using any retrieval strategy: lexical (BM25), vector search (e.g. kNN) or a method that combines the two, such as RRF.
  3. The application takes the top K documents, extracts the text field we are querying against from each document, then sends this list of texts to the rerank inference endpoint, which is configured to use Cohere.
  4. The inference endpoint passes the documents and the query to Cohere.
  5. The result is a list of scores and indices to match each score. Your app takes these scores, assigns them to the documents, and reorders them by this score in a descending order. This effectively moves the semantically most relevant documents to the top.
  6. If this flow is used in RAG to provide some sources to a generative LLM (such as summarizing an answer), then you can rest assured it will work with the right context and provide answer.

Semantic reranking through your application illustration

This works great, but it involves many steps, data massaging, and a complex processing logic with many moving parts. Can we simplify this?

Reranking search results tomorrow - with retrievers

Let's spend a minute talking about retrievers. Retriever is a new type of abstraction in the _search API, which is more than just a simple query. It's a building block for an end-to-end search flow for fetching hits and potentially modifying the documents' scores and their order.

Retrievers can be used in a pipeline pattern, where each retriever unit does something different in the search process. For example we can configure a first-stage retriever to fetch documents, pass the results to a second-stage retriever to combine with other results, trim the number of candidates etc. As a final stage, a retriever can update the relevance score of documents.

Soon we'll be adding new reranking capabilities with retrievers, text similarity reranker retriever being the first one. This will perform reranking on top K hits by calling a rerank inference endpoint. The workflow will be simplified into a single API call that hides all the complexity!

This is what the previously described multi-stage workflow looks like as a single retriever query:

Semantic reranking with retriever illustration

The text_similarity_reranker retriever is configured with the following details:

  • Nested retriever
  • Reranker inference configuration
  • Additional controls, such as minimum score cutoff for eliminating irrelevant hits

Below is an example text_similarity_reranker query. Let's dissect it to understand each part better!

POST my-index/_search
  "retriever": { // Retriever query
    "text_similarity_reranker": { // Outermost retriever will perform reranking
      "retriever": {
        "standard": { // First-stage retriever is a standard Elasticsearch query
          "query": {
            "match": { // BM25 matching
              "text": "What is the capital of the USA?"
      "field": "text", // Document field to send to reranker
      "window_size": 100, // Reranking will work on top K hits
      "inference_id": "cohere-rerank-v3-model", // Inference endpoint
      "inference_text": "What is the capital of the USA?",
      "min_score": 0.6 // Minimum relevance score

The request defines a retriever query as the root property. The outermost retriever will execute last, in this case it's a text_similarity_reranker. It specifies a standard first-stage retriever, which is responsible for fetching some documents. The standard retriever accepts an Elasticsearch query, which is a BM25 match in the example.

The text similarity reranker is pointed at the text document field that contains the text for semantic reranking. The top 100 documents will be sent for reranking to the cohere-rerank-v3-model rerank inference endpoint we have configured with Cohere. Only those documents will be returned that receive at least 60% relevance score in the reranking process.

The response is the exact same structure as that of a search query. The _score property is the semantic relevance score from the reranking process, and _rank refers to the ranking order of documents.

  "took": 213,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    "max_score": 0.99838966,
    "hits": [
        "_index": "my-index",
        "_id": "W7CDBo8BJDa_bRWhW1KH",
        "_score": 0.99838966,
        "_rank": 1,
        "_source": {
          "text": "Washington, D.C. (also known as simply Washington or D.C., and officially as the District of Columbia) is the capital of the United States. It is a federal district."

Semantic reranking with retrievers will be available shortly in a coming Elastic release.


Semantic reranking is an incredibly powerful tool for boosting the performance of a search experience or a RAG tool. It can be used as direct inference call, in context of a search experience, or as part of a simplified search flow with retrievers. Users can pick and choose the set of tools that work best for their use case and context.

Happy reranking! 😃

Ready to try this out on your own? Start a free trial.
Want to get Elastic certified? Find out when the next Elasticsearch Engineer training is running!
Recommended Articles