June 23, 2021

Improve search relevance by combining Elasticsearch stemmers and synonyms

In a previous blog, we covered how you can incorporate synonyms into your Elasticsearch-powered application. Here, I build upon that blog and show how you can combine stemmers and multi-word synonyms to take the quality of your search results to the next level.

Motivation

Imagine that you are using Elasticsearch to power a search application for finding books, and in this application you want to treat the following words as synonyms:

brainstorm
brainstorming
brainstormed
brain storm
brain storming
brain stormed
envisage
envisaging
envisaged
etc.

It is tedious and error prone to explicitly use synonyms to define all possible conjugations, declensions, and inflections of a word or of a compound word.

However, it is possible to reduce the size of the list of synonyms by making use of a stemmer to extract the stem of each word before applying synonyms. This would allow us to get the same results as the above synonym list by specifying only the following synonyms:

brainstorm
brain storm
envisage

Custom analyzers

In this section, I show code snippets that define custom analyzers that can be used for matching synonyms. Later on in this blog I show how to submit the analyzers to Elasticsearch.

Our previous blog goes into details on the difference between index-time and search-time synonyms. In the solution presented here, I make use of search-time synonyms.

We will create a synonym graph token filter that matches the multi-word synonym “brain storm”, with “brainstorm” and “envisage”. This will also treat “mind” and “brain” as synonyms. We will call this token filter my_graph_synonyms, and it will look as follows:

"filter": { 
  "my_graph_synonyms": { 
    "type": "synonym_graph", 
    "synonyms": [ 
      "mind, brain", 
      "brain storm, brainstorm, envisage" 
    ] 
  } 
}

Next we need to define two separate custom analyzers, one that will be applied to text at index-time, and another that will be applied to text at search-time.

We define an analyzer called my_index_time_analyzer which uses the standard tokenizer and the lowercase token filter and the stemmer token filter as follows:

"my_index_time_analyzer": { 
  "tokenizer": "standard", 
  "filter": [ 
    "lowercase", 
    "stemmer" 
  ] 
}

We define an analyzer called my_search_time_analyzer, which also makes use of the standard tokenizer and the lowercase token filter and the stemmer token filter (as above). However, this also includes our custom token filter called my_graph_synonyms, which ensures that synonyms will be matched at search-time:

"my_search_time_analyzer": { 
  "tokenizer": "standard", 
  "filter": [ 
    "lowercase", 
    "stemmer", 
    "my_graph_synonyms" 
  ] 
}

Mappings

Mapping is the process of defining how a document, and the fields it contains, are stored and indexed. Each document is a collection of fields, which each have their own data type. In this example we define the mapping for a document with a single field called my_new_text_field, which we define as text. This field will make use of my_index_time_analyzer when documents are indexed, and will make use of my_search_time_analyzer when documents are searched. The mapping looks as follows:

"mappings": { 
  "properties": { 
    "my_new_text_field": { 
      "type": "text", 
      "analyzer": "my_index_time_analyzer", 
      "search_analyzer": "my_search_time_analyzer" 
    } 
  } 
}

Bringing it together

Below we combine our custom analyzers and mappings and apply it to an index called test_index as follows:

PUT /test_index 
{ 
  "settings": { 
    "index": { 
      "analysis": { 
        "filter": { 
          "my_graph_synonyms": { 
            "type": "synonym_graph", 
            "synonyms": [ 
              "mind, brain", 
              "brain storm, brainstorm, envisage" 
            ] 
          } 
        }, 
        "analyzer": { 
          "my_index_time_analyzer": { 
            "tokenizer": "standard", 
            "filter": [ 
              "lowercase", 
              "stemmer" 
            ] 
          }, 
          "my_search_time_analyzer": { 
            "tokenizer": "standard", 
            "filter": [ 
              "lowercase", 
              "stemmer", 
              "my_graph_synonyms" 
            ] 
          } 
        } 
      } 
    } 
  }, 
  "mappings": { 
    "properties": { 
      "my_new_text_field": { 
        "type": "text", 
        "analyzer": "my_index_time_analyzer", 
        "search_analyzer": "my_search_time_analyzer" 
      } 
    } 
  } 
}

Testing our custom search-time analyzer

If we wish to see how an analyzer is tokenizing and normalizing a given string, we can directly call the _analyze api as follows:

POST test_index/_analyze 
{ 
  "text" : "Brainstorm", 
  "analyzer": "my_search_time_analyzer" 
}

Testing on real documents

We can use the _bulk API to drive several documents into Elasticsearch as follows:

POST test_index/_bulk 
{ "index" : { "_id" : "1" } } 
{"my_new_text_field": "This is a brainstorm" } 
{ "index" : { "_id" : "2" } } 
{"my_new_text_field": "A different brain storm" } 
{ "index" : { "_id" : "3" } } 
{"my_new_text_field": "About brainstorming" } 
{ "index" : { "_id" : "4" } } 
{"my_new_text_field": "I had a storm in my brain" } 
{ "index" : { "_id" : "5" } } 
{"my_new_text_field": "I envisaged something like that" }

After driving the sample documents into test_index, we can execute a search that will correctly respond with document #1, #2, #3 and #5, as follows:

GET test_index/_search 
{ 
  "query": { 
    "match": { 
      "my_new_text_field": "brain storm" 
    } 
  } 
}

We can execute the following search which correctly returns only documents #2 and #4, as follows:

GET test_index/_search 
{ 
  "query": { 
    "match": { 
      "my_new_text_field": "brain" 
    } 
  } 
}

We can execute the following search which will correctly respond with document #1, #2, #3 and #5, as follows:

GET test_index/_search 
{ 
  "query": { 
    "match": { 
      "my_new_text_field": "brainstorming" 
    } 
  } 
}

We can execute the following search which correctly returns documents #2 and #4, as follows:

GET test_index/_search 
{ 
  "query": { 
    "match": { 
      "my_new_text_field": "mind storm" 
    } 
  } 
}

And finally, we can execute the following search which correctly returns only documents #2 and #4 as follows:

GET test_index/_search 
{ 
  "query": { 
    "match": { 
      "my_new_text_field": { 
        "query": "storm brain" 
      } 
    } 
  } 
}

Conclusion

In this blog I have demonstrated how you can combine stemmers and multi-word synonyms in Elasticsearch to improve search relevance.

If you don't yet have your own Elasticsearch cluster, you can spin up a free trial of Elastic Cloud in a few minutes, or download the Elastic Stack and run it locally. And if you prefer a pre-tuned search experience with drag and drop boosts and controls, check out Elastic App Search. With App Search, you can implement highly relevant search experience with minimal effort, and it comes with an intuitive interface for tuning, curation, and analytics.