Improving text expansion performance using token pruning

This blog talks about token pruning, an exciting enhancement to ELSER performance released with Elasticsearch 8.13.0!

The strategy behind token pruning

We've already talked in great detail about lexical and semantic search in Elasticsearch and text similarity search with vector fields. These articles offer great, in-depth explanations of how vector search works.

We've also talked in the past about reducing retrieval costs by optimizing retrieval with ELSER v2. While Elasticsearch is limited to 512 tokens per inference field ELSER can still produce a large number of unique tokens for multi-term queries. This results in a very large disjunction query, and will return many more documents than an individual keyword search would - in fact, queries with a large number of resulting queries may match most or all of the documents in an index!

Now, let's take a more detailed look into an example using ELSER v2. Using the infer API we can view the predicted values for the phrase "Is Pluto a planet?"

POST /_ml/trained_models/.elser_model_2_linux-x86_64/_infer
{
  "docs":[{"text_field": "is Pluto a planet?"}]
}

This returns the following inference results:

{
  "inference_results": [
    {
      "predicted_value": {
        "pluto": 3.014208,
        "planet": 2.6253395,
        "planets": 1.7399588,
        "alien": 1.1358738,
        "mars": 0.8806293,
        "genus": 0.8014013,
        "europa": 0.6215426,
        "a": 0.5890018,
        "asteroid": 0.5530223,
        "neptune": 0.5525891,
        "universe": 0.5023148,
        "venus": 0.47205976,
        "god": 0.37106854,
        "galaxy": 0.36435634,
        "discovered": 0.3450894,
        "any": 0.3425274,
        "jupiter": 0.3314228,
        "planetary": 0.3290833,
        "particle": 0.30925226,
        "moon": 0.29885328,
        "earth": 0.29008925,
        "geography": 0.27968466,
        "gravity": 0.26251012,
        "astro": 0.2522782,
        "biology": 0.2520054,
        "aliens": 0.25142986,
        "island": 0.25103575,
        "species": 0.2500962,
        "uninhabited": 0.23360424,
        "orbit": 0.2327767,
        "existence": 0.21717428,
        "physics": 0.2001011,
        "nuclear": 0.1603676,
        "space": 0.15076339,
        "asteroids": 0.14343098,
        "astronomy": 0.10858688,
        "ocean": 0.08870865,
        "some": 0.065543786,
        "science": 0.051665734,
        "satellite": 0.042373143,
        "ari": 0.024783766,
        "list": 0.019822711,
        "poly": 0.018234596,
        "sphere": 0.01611787,
        "dino": 0.006902895,
        "rocky": 0.0062791444
      }
    }
  ]
}

These are the inference results that would be sent as input into a text expansion search. When we run a text expansion query, these terms eventually get joined together in one large weighted boolean query, such as:

{
  "query": {
    "bool": {
      "should": [
        {
          "match": {
            "pluto": {
              "query": "pluto",
              "boost": 3.014208
            }
          }
        },
        {
          "match": {
            "planet": {
              "query": "planet",
              "boost": 2.6253395
            }
          }
        },
        ...
        {
          "match": {
            "planets": {
              "query": "dino",
              "boost": 0.006902895
            }
          }
        },
        {
          "match": {
            "planets": {
              "query": "rocky",
              "boost": 0.0062791444
            }
          }
        }
      ]
    }
  }
}

Speed it up by removing tokens

Given the large number of tokens produced by ELSER text expansion, the quickest way to realize a performance improvement is to reduce the number of tokens that make it into that final boolean query. This reduces the total work that Elasticsearch invests when performing the search. We can do this by identifying non-significant tokens produced by the text expansion and removing them from the final query.

Non-significant tokens can be defined as tokens that meet both of the following criteria:

The weight/score is so low that the token is likely not very relevant to the original term
The token appears much more frequently than most tokens, indicating that it is a very common word and may not benefit the overall search results much.

We started with some default rules to identify non-significant tokens, based on internal experimentation using ELSER v2:

Frequency: More than 5x more frequent than the average token frequency for all tokens in that field
Score: Less than 40% of the best scoring token
Missing: If we see documents with a frequency of 0, that means that it never shows up at all and can be safely pruned

If you're using text expansion with a model other than ELSER, you may need to adjust these values in order to return optimal results.

Both the token frequency threshold and weight threshold must show the token is non-significant in order for the token to be pruned. This lets us ensure we keep frequent tokens that are very high scoring or very infrequent tokens that may not have as high of a score.

Performance improvements with token pruning

We benchmarked these changes using the MS Marco Passage Ranking benchmark. Through this benchmarking, we observed that enabling token pruning with the default values described above resulted in a 3-4x improvement in 99th pctile latency and above!

Relevance impact of token pruning

Once we measured a real performance improvement, we wanted to validate that relevance was still reasonable. We used a small dataset against the MS Marco passage ranking dataset. We did observe an impact on relevance when pruning the tokens; however, when we added the pruned tokens back in a rescore block the relevance was close to the original non-pruned results with only a marginal increase in latency. The rescore, adding in the tokens that were previously pruned, queries the pruned tokens only against the documents that were returned from the previous query. Then it updates the score including the dimensions that were previously left behind.

Using a sample of 44 queries with judgments against the MS Marco Passage Ranking dataset:

Top K	Rescore Window Size	Avg rescored recall vs control	Control NDCG@K	Pruned NDCG@K	Rescored NDCG@K
10	10	0.956	0.653	0.657	0.657
10	100	1	0.653	0.657	0.653
10	1000	1	0.653	0.657	0.653
100	100	0.953	0.51	0.372	0.514
100	1000	1	0.51	0.372	0.51

Now, this is only one dataset - but it's encouraging to see this even at smaller scale!

How to use: Pruning configuration

Pruning configuration will launch in our next release as an experimental feature. It's an optional, opt-in feature so if you perform text expansion queries without specifying pruning, there will be no change to how text expansion queries are formulated - and no change in performance.

We have some examples of how to use the new pruning configuration in our text expansion query documentation.

Here's an example text expansion query with both the pruning configuration and rescore:

GET my-index/_search
{
   "query":{
      "text_expansion":{
         "ml.tokens":{
            "model_id":".elser_model_2",
            "model_text":"Is pluto a planet?",
            "pruning_config": {
              "tokens_freq_ratio_threshold": 5,
              "tokens_weight_threshold": 0.4,
              "only_score_pruned_tokens": false
            }
         }
      }
   },
   "rescore": {
      "window_size": 100,
      "query": {
         "rescore_query": {
            "text_expansion": {
               "ml.tokens": {
                  "model_id": ".elser_model_2",
                  "model_text": "Is pluto a planet?",
                  "pruning_config": {
                    "tokens_freq_ratio_threshold": 5,
                    "tokens_weight_threshold": 0.4,
                    "only_score_pruned_tokens": true
                  }
               }
            }
         }
      }
   }
}

Note that the rescore query sets only_score_pruned_tokens to false, so it only adds those tokens that were originally pruned back into the rescore algorithm.

Weighted tokens queries

We also introduced a new weighted tokens query

There are two main use cases for this new query type:

Sending in your own precomputed inferences at query time rather than using the inference API
Fast prototyping, so you can experiment with changes (such as to your pruning configuration!)

Same usage:

GET my-index/_search
{
   "query":{
      "weighted_tokens": {
      "query_expansion_field": {
        "tokens": {"pluto":3.014208,"planet":2.6253395,"planets":1.7399588,"alien":1.1358738,"mars":0.8806293,"genus":0.8014013,"europa":0.6215426,"a":0.5890018,"asteroid":0.5530223,"neptune":0.5525891,"universe":0.5023148,"venus":0.47205976,"god":0.37106854,"galaxy":0.36435634,"discovered":0.3450894,"any":0.3425274,"jupiter":0.3314228,"planetary":0.3290833,"particle":0.30925226,"moon":0.29885328,"earth":0.29008925,"geography":0.27968466,"gravity":0.26251012,"astro":0.2522782,"biology":0.2520054,"aliens":0.25142986,"island":0.25103575,"species":0.2500962,"uninhabited":0.23360424,"orbit":0.2327767,"existence":0.21717428,"physics":0.2001011,"nuclear":0.1603676,"space":0.15076339,"asteroids":0.14343098,"astronomy":0.10858688,"ocean":0.08870865,"some":0.065543786,"science":0.051665734,"satellite":0.042373143,"ari":0.024783766,"list":0.019822711,"poly":0.018234596,"sphere":0.01611787,"dino":0.006902895,"rocky":0.0062791444},
        "pruning_config": {
          "tokens_freq_ratio_threshold": 5,
          "tokens_weight_threshold": 0.4,
          "only_score_pruned_tokens": false
        }
      }
    }
   },
   "rescore": {
      "window_size": 100,
      "query": {
         "rescore_query": {
            "weighted_tokens": {
              "query_expansion_field": {
                "tokens": {"pluto":3.014208,"planet":2.6253395,"planets":1.7399588,"alien":1.1358738,"mars":0.8806293,"genus":0.8014013,"europa":0.6215426,"a":0.5890018,"asteroid":0.5530223,"neptune":0.5525891,"universe":0.5023148,"venus":0.47205976,"god":0.37106854,"galaxy":0.36435634,"discovered":0.3450894,"any":0.3425274,"jupiter":0.3314228,"planetary":0.3290833,"particle":0.30925226,"moon":0.29885328,"earth":0.29008925,"geography":0.27968466,"gravity":0.26251012,"astro":0.2522782,"biology":0.2520054,"aliens":0.25142986,"island":0.25103575,"species":0.2500962,"uninhabited":0.23360424,"orbit":0.2327767,"existence":0.21717428,"physics":0.2001011,"nuclear":0.1603676,"space":0.15076339,"asteroids":0.14343098,"astronomy":0.10858688,"ocean":0.08870865,"some":0.065543786,"science":0.051665734,"satellite":0.042373143,"ari":0.024783766,"list":0.019822711,"poly":0.018234596,"sphere":0.01611787,"dino":0.006902895,"rocky":0.0062791444},
                "pruning_config": {
                  "tokens_freq_ratio_threshold": 5,
                  "tokens_weight_threshold": 0.4,
                  "only_score_pruned_tokens": true
                }
              }
            }
         }
      }
   }
}

This feature was released as a technical preview feature in 8.13.0. You can try it out in Cloud today! Be sure to head over to our discuss forums and let us know what you think.

Ready to try this out on your own? Start a free trial.
Elasticsearch has integrations for tools from LangChain, Cohere and more. Join our advanced semantic search webinar to build your next GenAI app!