Tech Topics

Dictionary update behavior for CJK language analyzers

Every year, many visitors come to Japan to see sakura (Cherry blossoms) in spring. If you plan to visit Japan, you may want to search for “cherry blossom” in Japanese, or "さくら". When you search for "さくら" in Elasticsearch, you are using full-text search. Full-text search requires analyzers. Per this blog, we can see the default standard analyzer is not the best for CJK (Chinese, Japanese, and Korean) text, and we need to use language analyzers. Moreover, for language analyzers, dictionaries play an important role in deciding how a tokenizer will split a word into tokens.

For example, in the Japanese analyzer Kuromoji, by default, the word 東京スカイツリー will be divided into three parts: 東京, スカイ and ツリー. However, スカイツリー, which means Skytree, is a meaningful word referring to the Tokyo Skytree broadcasting tower in Japan. In this case, sky and tree should not be split into separate tokens. To achieve meaningful tokenization (word breaking), we can use the user dictionary provided in the Kuromoji tokenizer. Besides the user dictionary, there are also synonym dictionaries that can be applied for fine-grained tuning.

Recently we have received questions related to the behavior of dictionary updates in Elasticsearch indices. In this blog, we’d like to go over some general concepts to help users gain a better understanding of dictionary updates and improve their search experience.

This article is primarily oriented towards CJK analyzers, but the basic concept is the same for other languages as well.

Background knowledge

Analyzers work by applying, in order, a character filter, a tokenizer, and then a token filter.
By default, analyzer settings are applied at both indexing time and search time. The target of index time analysis is the source data, and the target of search time analysis is the query string. Thus, it's important to be aware that changes to the dictionary also affect both indexing and searching.

Dictionary update behavior

In Elasticsearch, the analyzer reads the dictionary when opening/loading the index, and in general will not read the settings again after loading. Thus, in order to reflect a dictionary update, you need to restart the node that has the dictionary files, or you can simply use the _close and then _open APIs of the target index.

However, it should be noted here that the dictionary update will not be applied to documents that have already been indexed. This is because these documents were indexed using the analyzer before the dictionary was updated. For details on how to reflect the change in existing indexed documents, see "How to reflect dictionary updates in existing indices" below.

Updates will take effect on documents that are indexed after the change, and on search queries as well. The search text will be newly tokenized according to the updated content defined in the dictionary.

Impact on search results

Depending on the situation, a dictionary update may or may not have an impact on search results (e.g., a no hit result).

The good side: there is no impact on search results

Suppose we add the following new definition to a dictionary to ensure that the terms japan and sushi are kept together as a single token:

japan, sushi

In documents indexed before the update, japan and sushi would be registered as separate tokens because these terms did not previously exist. But when a user searches for either of the terms japan or sushi, the search term will be looked up in a synonym dictionary, returning results for both terms.

The flip side: there is an impact on search results

As mentioned at the top of this article, some CJK language analyzers (e.g., Japanese Kuromoji analyzer, Korean Nori analyzer, Chinese IK analyzer) have a feature called a user dictionary (or a similar term), which can be used for tuning tokenization.

For instance, in the Japanese Kuromoji analyzer, the user can update the dictionary to prevent 東京大学 from being divided into the two tokens 東京 and 大学 (the default behavior) and keep 東京大学 as one token.

In this scenario, in existing documents indexed before the update, 東京大学 would be stored as two tokens, 東京 and 大学, and there would be no single 東京大学 token. Thus, after the dictionary update, when the user searches for 東京大学, the query will be performed with the single token 東京大学, and will return no hits.

How to reflect dictionary updates in existing indices

In Elasticsearch, when updating an existing indexed document, the document is deleted and re-created internally, at which time the language analyzer processes the document again. Therefore, we can update the existing document to let the dictionary update take effect. A handy way to do this is to use the Update By Query API to update whole indices or specific parts of an index. The document will then be analyzed again based on the updated dictionary.

I hope this brief walkthrough was helpful in understanding how to create and update a dictionary. You can read our document/manual for more details regarding the settings of different analyzers (Japanese Kuromoji analyzer, Korean Nori analyzer, Chinese IK analyzer). If you have any concerns or questions, please feel free to raise them via our discuss forum. Our engineers are happy to help you any time.