August 20, 2019How to

The same, but different: Boosting the power of Elasticsearch with synonyms

Using synonyms is undoubtedly one of the most important techniques in a search engineer's tool belt. While novices sometimes underestimated their importance, almost no real-life search system can work without them. At the same time, some complexities and subtleties arising from their use are sometimes underestimated, even by advanced users. Synonym filters are part of the analysis process that converts input text into searchable terms, and while they are relatively easy to get started with, their use can be quite varied and require some deeper understanding of concepts before applying them successfully in a real-world scenario.

There have been some recent improvements around analysis in Elasticsearch lately. The most notable is probably functionality that allows for reloading search-time analyzers, which in turn enables search-time synonyms to be changed and reloaded. In addition to presenting this new API, this blog will answer some common questions around using synonyms and point out some frequent caveats around their use.

Why use synonyms?

To understand the usefulness and flexibility of synonyms, let’s take a quick look at how most of today's search engines work internally. Documents and queries are analyzed and reduced to their smallest units, often called tokens, which are essentially abstract symbols. The matching process when searching uses simple string similarity, which is the reason why even small spelling mistakes (“hous”) or the use of a plural of a word (“houses”) in a query won’t match a document containing only the singular (“house”). Things like stemmers or fuzzy queries address some of the most common of these problems, but they don’t bridge the gap between relating concepts and ideas or between slightly different vocabulary usage in the documents and queries.

This is where synonyms shine. The Greek origins of the word are the prefix σύν (syn, “together”) and ὄνομα (ónoma, “name”). The origin of the term already shows that synonyms describe different words with exactly or nearly the same meaning in the same language or domain. In practice, this can range over general synonyms (“tired” vs. “sleepy”), abbreviations (“lb.” vs. “pound”), different spelling variations of products in ecommerce search (“iPod” vs. “i-Pod”), small language differences (like British English “lift” vs. American English “elevator”), expert vs. layperson language (“canine” vs. “dog”), or simply denoting the same concept in two ways (“universe” or “cosmos”). By providing appropriate synonyms rules, the search engineer can provide information about which words in their domain mean similar things and should thus be treated similarly.

For a search engine it is important to know which terms in documents and queries should match, even though they look different. Since this is highly domain specific, users need to provide the appropriate rules. Synonyms filters, which can be used in custom analyzers, replace or add additional tokens based on user-defined rules, either at index time in order to store, for example, both variations of a word in an indexed document, or at query time in order to expand the query terms and to match more relevant documents. We’ll discuss some advantages and disadvantages of these two approaches a bit later on.

When to be mindful about using synonyms

Synonym filters are a very flexible tool, which leads people to overuse them in certain situations. For example, they’re sometimes used as a brute-force replacement of stemmers, with large synonyms files containing grammatical variations of verbs and nouns. While this approach is possible, performance is usually worse and maintenance is harder than when using real stemmers or lemmatizers. The same goes for correcting spelling errors. If there are only a handful of very common spelling mistakes, such as in an ecommerce setting, trying to correct these using synonyms is sometimes advisable. But if the problem is more general, then using fuzzy queries or using character ngram techniques are more sustainable approaches. Also consider the alternatives to synonym expansion in the analysis chain. Sometimes enhancing documents in an ingest pipeline or some other client-side process is more flexible and manageable than using synonyms in the more restricted analysis process. For example, you could detect named entities in your documents using common named entity recognition (NER) frameworks and encode them in unique identifiers in your pre-processing pipeline or at ingest time. If you then apply the same process to your user's queries before sending them to Elasticsearch, you get the same effect but usually gain more control.

Also, it is tempting to use synonyms for other notions of “sameness,” like grouping certain species of animals under a common term, or even building taxonomy support for your domain. This is where things get really interesting and there is much to explore, but keep in mind synonyms are not always the best choice and can lead to your system behaving in unexpected ways if not used carefully.

Index- vs. search-time synonyms

Synonyms are used in analyzers that can be used at index time or at search time. One of the most frequent questions around the use of synonym filters in Elasticsearch is, "should I use them at index time, search time, or both?" Let’s look at applying synonym filtering at index time first. This means terms in indexed documents are replaced or expanded once and for all, and the result is persisted in the search index.

Index-time synonyms have several disadvantages:

The index might get bigger, because all synonyms must be indexed.
Search scoring, which relies on term statistics, might suffer because synonyms are also counted, and the statistics for less common words become skewed.
Synonym rules can’t be changed for existing documents without reindexing.

The last two, especially, are a great disadvantage. The only potential advantage of index-time synonyms is performance, since you pay the cost for the expansion process upfront and don’t have to perform it each time again at query time, potentially resulting in more terms that need to get matched. This, however, usually isn’t a real issue in practice.

Using synonyms in search-time analyzers on the other hand doesn’t have many of the above mentioned problems:

The index size is unaffected.
The term statistics in the corpus stay the same.
Changes in the synonym rules don’t require reindexing of documents.

These advantages usually outweigh the only disadvantage of having to perform the synonym expansion each time at query time and potentially having more terms to match. On top of that, search-time synonym expansion allows for using the more sophisticated synonym_graph token filter, which can handle multi-word synonyms correctly and is designed to be used as part of a search analyzer only.

In general, the advantages of using synonyms at search time usually outweigh any slight performance gain you might get when using them at index time.

However, there used to be another caveat when using search-time synonyms. Although changing the synonym rules doesn’t require reindexing the documents, in order to change them you had to close and reopen the index temporarily. This was necessary because analyzers are instantiated at index creation time, when a node is restarted, or when a closed index is reopened. In order to make changes to a synonym rule file visible to the index, one had to first update the file on all nodes, then close and reopen the index. But this is no longer the case.

Synonyms, reloaded

Starting on Elasticsearch 8.10, you can use the new synonyms API for updating synonyms. See this blog post for more details. If you haven't upgraded yet to 8.10, but you are on version 7.3 or higher, you can still avoid reopening indices for making synonym changes. We added a new endpoint that makes it possible to trigger reloading of analyzer resources on demand. Calling this new endpoint will reload all analyzers of an index that have components in them that are marked as updateable. This, in turn, makes those components only usable at search time.

For synonym filters, marking them as updateable and calling the reload API makes changes to the synonyms configuration file on each node visible to the analysis process. Updating synonym rules that are part of the filter definition (via the synonyms parameter) isn’t possible, but those should be mostly used for ad-hoc testing purposes. In any case, configuring synonyms using a configuration file has several advantages:

They are easier to manage! In a production system, there can be many synonym rules, and since those affect search relevance a lot, they should be treated as an integral part of the configuration that needs to be version controlled and tested with any update.
Synonyms are often derived from other sources or created by an algorithm running on your data. Reading from files skips the need to put them into the filter configuration.
The same synonym file can be used in different filters.
Larger synonym rule sets take up much memory in the Elasticsearch cluster state that stores meta information about index settings. In order to not increase cluster size unnecessarily, it is advisable to store larger synonym rule sets in configuration files.

For demonstration purposes, let’s assume you put an initial my_synonyms.txt file containing the following single rule into the config directory of your Elasticsearch nodes. Let’s assume the file initially only contains the following rule:

universe, cosmos

Next, we need to define an analyzer that references this file in a synonym filter:

PUT /synonym_test
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "synonym_analyzer": {
            "tokenizer": "whitespace",
            "filter": ["my_synonyms"]
          }
        },
        "filter": {
          "my_synonyms": {
            "type": "synonym",
            "synonyms_path": "my_synonyms.txt",
            "updateable": true
          }
        }
      }
    }
  }
}

Note that we marked the synonym filter as updateable. This is important, because only updateable filters are reloaded when we call the new reloading endpoint, but this also has the side effect that analyzers that contain updateable filters are no longer allowed to be used at index time. But let’s check first that synonyms are applied correctly by running a quick test through the _analyze endpoint:

GET /synonym_test/_analyze
{
  "analyzer": "synonym_analyzer",
  "text": "cosmos"
}

This should return two tokens, one also being “universe,” as expected. Let’s add another rule to the synonyms.txt file by adding a second line:

lift, elevator

This is the point where previously you had to close and reopen the index again for these changes to show up. Now you can simply call the new endpoint:

POST /synonym_test/_reload_search_analyzers

The request doesn’t require a body but can be restricted to one or more indices using the typical index wildcard patterns. The response includes information about which analyzers have been reloaded and which nodes have been affected:

{
  [...],
  "reload_details": [{
    "index": "synonym_test",
    "reloaded_analyzers": ["synonym_analyzer"],
    "reloaded_node_ids": ["FXbmbgG_SsOrNRssrYcPow"]
  }]
}

Running the above _analyze request on the term “lift” now also returns “elevator” as a second synonym token.

A few things to note though. As mentioned above, a filter that is marked as updateable should be used at search time, so the correct way of using the synonym analyzer we defined above on a field would be:

POST /synonym_test/_mapping
{
  "properties": {
    "text_field": {
      "type": "text",
      "analyzer": "standard",
      "search_analyzer": "synonym_analyzer"
    }
  }
}

Also, reloading works only for synonyms that are loaded from files — changing the synonyms defined via settings in a filter is not supported. Lastly, in practice you need to make sure to apply updates to synonym files across all nodes of your cluster. If the analyzer on some nodes sees different versions of the file, you might get differing search results depending on which node is used in a search. If this happens in relation to a synonym, the first thing to check is that your synonym files are the same on each node and then retrigger the reload.

In summary, the new _reload_search_analyzer endpoint allows you to quickly revise and change query-time synonyms without the need to reopen your indices. For example, by examining your query logs you can determine if users search by different terms than exist in the indexed documents and apply those additions on the fly. Adding synonyms can have unexpected side effects on relevance scoring though, so it is advisable to perform some sort of testing (be it A/B testing or something like the ranking evaluation API) first before directly applying changes in production.

Being a part of the (analysis) chain gang

Another frequently asked question around synonym filters is their behavior in more complex analysis chains. In most scenarios you will put some common character or token filters in front of your synonym filter, such as a lowercase filter. This means all tokens passing the analysis chain will be lowercased before applying the synonym filter. Does this mean the input synonyms in your synonym rules need to be lowercased as well in order to match? Let's try this simple example:

PUT /test_index
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "synonym_analyzer": {
            "tokenizer": "whitespace",
            "filter": ["lowercase", "my_synonyms"]
          }
        },
        "filter": {
          "my_synonyms": {
            "type": "synonym",
            "synonyms": ["Eins, Uno, One", "Cosmos => Universe"]
          }
        }
      }
    }
  }
}
GET /test_index/_analyze
{
  "analyzer": "synonym_analyzer",
  "text": "one"
}

You can verify that the lowercase input text gets expanded to three tokens in the above example, which shows that the lowercasing is also applied to the synonym filter rules. Also the right-hand side of replace rules like the “Cosmos => Universe” rule above is rewritten as you can see by the lowercase output of:

GET /test_index/_analyze
{
  "analyzer": "synonym_analyzer",
  "text": "cosmos"
}

In general, synonym filters rewrite their inputs to the tokenizer and filters used in the preceding analysis chain. However, there are some notable exceptions to this: Several filters that output stacked tokens (such as common_grams or the phonetic filter) are not allowed to precede synonym filters and will throw errors if you try to do so. Others, like the word compound filters or synonym filters themselves are skipped when they precede another synonym filter in the chain. The latter rule is important to make chaining of synonym filters possible. We will see this in action in the following example.

So what happens if you put two or more synonym filters in a row? Will the output of the former be the input of the latter, making chaining of synonym filters somewhat of a transitive operation? Let’s try the following example:

PUT /synonym_chaining
{
  "settings": {
    "index": {
      "analysis": {
        "filter": {
          "first_synonyms": {
            "type": "synonym",
            "synonyms": ["a => b", "e => f"]
          },
          "second_synonyms": {
            "type": "synonym",
            "synonyms": ["b => c", "d => e"]
          }
        },
        "analyzer": {
          "synonym_analyzer": {
            "filter": [
              "first_synonyms",
              "second_synonyms"
            ],
            "tokenizer": "whitespace"
          }
        }
      }
    }
  }
}
GET /synonym_chaining/_analyze
{
  "analyzer": "synonym_analyzer",
  "text": "a"
}

The output token would be “c”, which shows that both filters are applied in consecutive order, with the first filter replacing “a” with “b”, and the second replacing this input with “c”. If instead you try “d” as input, it gets replaced with “e” (the first rule doesn’t get applied) but if you use “e” instead, the token gets replaced with “f” in the first filter, leaving the second filter nothing to match on.

Remember that we just talked about the exceptions to rewriting against preceding token filters? If the second_synonyms filter in the example above would have applied the rules of the first filter to its rule set, it would have changed its own d => e rule to d => f (because the preceding filter's e => f rule would have been applied). This behavior used to be a source of confusion in earlier versions of Elasticsearch, and is the reason why synonym filters are now skipped when processing the synonym rules of following filter. It will work as described in version 6.6 and later.

Back to the future

In this short blog, we just scratched the surface of what you can accomplish using synonyms and tried to answer some frequent questions around their usage. Synonyms are a powerful tool that can be leveraged to increase the recall of your search system, but there are many subtleties that are important to know and experiment with, especially in conjunction with systematic relevance testing.

The new synonyms API added in 8.10 and the reload search-time analyzers API added in Elasticsearch 7.3 make this kind of experimentation easier by not requiring that you close and reopen the index like in the past. It also provides ways of updating synonym rules that are applied at search time without needing to take your indices offline. This, however, is only one step in a series of improvements that we want to introduce to make managing synonyms across a large cluster friendlier for users. Let us know what you think, and drop us some feedback or questions in our Discuss forum. Until then, happy analyzing*!

* Ironically, there is no synonym for this usage of “analyzing”...