Engineering

How to implement Japanese full-text search in Elasticsearch

Full-text search is a common — but very important — part of great search experience. However, full-text search can be difficult to implement in some languages, which is the case with Japanese. In this blog, we’ll explore the challenges of implementing full-text search in Japanese and demonstrate some ways you can overcome these challenges in Elasticsearch.

So, what is full-text search?

Per wiki, here is the formal definition:

In text retrieval, full-text search refers to techniques for searching a single computer-stored document or a collection in a full-text database. Full-text search is distinguished from searches based on metadata or on parts of the original texts represented in databases (such as titles, abstracts, selected sections, or bibliographical references). 

In non-technical terms, full-text search is what powers a lot of the digital experiences you have today. It's the type of search that will try to find a word or phrase anywhere it could be hiding in a dataset. So when you are shopping online and search for "phone," full-text search would bring back any product that: has been classified as a phone, has phone in the description, has phone in the manufacturer's name, etc. This includes matches on "telephone," "saxophone," etc. 

This type of search is extremely convenient for finding what you want as quickly as possible. It's also extremely inconvenient if the search engine isn't properly configured and managed. When you search for "phone," you're probably not searching for a saxophone, and a well-maintained search engine knows that. And this is where things get interesting!

Precision and recall

Recall and precision are the two common ways to measure the quality of a full-text search system. Precision represents "how small the search omissions are" and recall represents "how small the search noise is." For example, a precise search may only return results for "phone" when an item is classified as a phone. A search with high recall would return all items that have the word "phone" in it, even saxophones.

There is always a trade-off between precision and recall, so you need to decide for yourself what's best for your use case, then test and adjust it incrementally. 

Inverted index and analyze

You are familiar with the index in the back of a book. The useful terms from the book are sorted and listed together with page numbers so you know where to find that term. The similar inverted index is used in Elasticsearch for full-text search. 

At index time, text strings are analyzed and added to the inverted index making them very easy to find. Also, the query string at search time is analyzed too. Different filters (e.g., lowercase, stop words, stemming, etc.) could be applied in analyzing process flow. Using the same analyzer for both search and indexing is a common use case. And for special use cases such as the Japanese full-text search we will talk about in this blog, different ones could be applied too.

This string analysis process creates a detailed index of words and phrases, making the data full-text searchable.

What’s different about full-text search in Japanese?

That's a good question. Before I answer it, let me ask it again… 

日本語での全文検索は英語と何が違いますか?

That's the same question, but it looks very different. And this is why full-text search is different too.

Word breaks don’t depend on whitespace

To analyze a text string, an analyzer is required. In most European languages (including English), words are separated with whitespace, which makes it easy to divide a sentence into words. However, in Japanese, individual words are not separated with whitespace. So how do we do it?

Two methods to analyze Japanese words

Since Japanese does not recognize word breaks on whitespace, the inverted index is mainly created by the following two methods.

  • n-gram analysis: Separate text strings by N characters
  • Morphological analysis: Divide into meaningful words using a dictionary

However, each of these on their own is not enough:

  • In n-gram, the index tends to be bloated. Processing based on part-of-speech information is impossible and there are many meaningless divisions. (Fewer search omissions, but more search noise.)
  • Morphological analysis is weak with new words (unknown words). In a dictionary-based case, words that are not in the dictionary cannot be detected. (Less search noise but more search omissions.)

As you can see, this is exactly the above-mentioned trade-off.

Japanese full-text search needs both!

Based on the above analysis gaps, Japanese full-text search should use both analysis types, with the strengths of each making up for their different weaknesses.

  • Fewer search omissions when using n-gram analysis
  • Less search noise when using morphological analysis

With these analyses working together, full-text search is possible.

Implementing Japanese full-text search using both morphological and n-gram analysis with Elasticsearch

Such situations can be handled by using Elasticsearch, thus making it possible to implement Japanese full-text search that emphasizes both precision and recall.

Implementation summary and example

In short, you will need to define two types of analyzers for n-gram and morphological analysis in your index mappings and settings, and then assign them to each field. At index time, you'll create the inverted indices via these two types of analyzers. At search time, you'll search both of the two inverted indices.

Before going into the detailed explanation, let's first look at an example of implementing full-text search in Japanese. 

Example implementation requirements

  • Japanese full-text search is possible. For example, if you search for "東京大学" (University of Tokyo), documents including "東京大学" will be displayed. Also, if you search for "米国," (US) documents that include "米国" will be displayed.
  • In addition, users can search for synonyms. For example, if you search for "東京大学," documents including "東大" (short way to say "東京大学") will be displayed. Also, if you search for "米国," documents that include "アメリカ" (America) will be displayed.

Preparation for example implementation

  • Install the analysis-icu and analysis-kuromoji plugins for analysis.
  • Create an index for full-text search. In this example, the index name is my_full text_search. Then use my_field as the field for full-text search.

Example: Configuration of index settings and mappings

PUT my_full text_search 
{ 
  "settings": { 
    "analysis": { 
      "char_filter": { 
        "normalize": { 
          "type": "icu_normalizer", 
          "name": "nfkc", 
          "mode": "compose" 
        } 
      }, 
      "tokenizer": { 
        "ja_kuromoji_tokenizer": { 
          "mode": "search", 
          "type": "kuromoji_tokenizer", 
          "discard_compound_token": true, 
          "user_dictionary_rules": [ 
            "東京スカイツリー,東京 スカイツリー,トウキョウ スカイツリー,カスタム名詞" 
          ] 
        }, 
        "ja_ngram_tokenizer": { 
          "type": "ngram", 
          "min_gram": 2, 
          "max_gram": 2, 
          "token_chars": [ 
            "letter", 
            "digit" 
          ] 
        } 
      }, 
      "filter": { 
        "ja_index_synonym": { 
          "type": "synonym", 
          "lenient": false, 
          "synonyms": [ 
          ] 
        }, 
        "ja_search_synonym": { 
          "type": "synonym_graph", 
          "lenient": false, 
          "synonyms": [ 
            "米国, アメリカ", 
            "東京大学, 東大" 
          ] 
        } 
      }, 
      "analyzer": { 
        "ja_kuromoji_index_analyzer": { 
          "type": "custom", 
          "char_filter": [ 
            "normalize" 
          ], 
          "tokenizer": "ja_kuromoji_tokenizer", 
          "filter": [ 
            "kuromoji_baseform", 
            "kuromoji_part_of_speech", 
            "ja_index_synonym", 
            "cjk_width", 
            "ja_stop", 
            "kuromoji_stemmer", 
            "lowercase" 
          ] 
        }, 
        "ja_kuromoji_search_analyzer": { 
          "type": "custom", 
          "char_filter": [ 
            "normalize" 
          ], 
          "tokenizer": "ja_kuromoji_tokenizer", 
          "filter": [ 
            "kuromoji_baseform", 
            "kuromoji_part_of_speech", 
            "ja_search_synonym", 
            "cjk_width", 
            "ja_stop", 
            "kuromoji_stemmer", 
            "lowercase" 
          ] 
        }, 
        "ja_ngram_index_analyzer": { 
          "type": "custom", 
          "char_filter": [ 
            "normalize" 
          ], 
          "tokenizer": "ja_ngram_tokenizer", 
          "filter": [ 
            "lowercase" 
          ] 
        }, 
        "ja_ngram_search_analyzer": { 
          "type": "custom", 
          "char_filter": [ 
            "normalize" 
          ], 
          "tokenizer": "ja_ngram_tokenizer", 
          "filter": [ 
            "ja_search_synonym", 
            "lowercase" 
          ] 
        } 
      } 
    } 
  }, 
  "mappings": { 
    "properties": { 
      "my_field": { 
        "type": "text", 
        "search_analyzer": "ja_kuromoji_search_analyzer", 
        "analyzer": "ja_kuromoji_index_analyzer", 
        "fields": { 
          "ngram": { 
            "type": "text", 
            "search_analyzer": "ja_ngram_search_analyzer", 
            "analyzer": "ja_ngram_index_analyzer" 
          } 
        } 
      } 
    } 
  } 
}

Example: Data preparation

As mentioned in the above requirements, prepare documents that include the words "米国" (US), "アメリカ" (America), "東京大学" (University of Tokyo), and "東大" (short way to say "東京大学).

# Data preparation 
POST _bulk 
{"index": {"_index": "my_full text_search", "_id": 1}} 
{"my_field": "アメリカ"} 
{"index": {"_index": "my_full text_search", "_id": 2}} 
{"my_field": "米国"} 
{"index": {"_index": "my_full text_search", "_id": 3}} 
{"my_field": "アメリカの大学"} 
{"index": {"_index": "my_full text_search", "_id": 4}} 
{"my_field": "東京大学"} 
{"index": {"_index": "my_full text_search", "_id": 5}} 
{"my_field": "帝京大学"} 
{"index": {"_index": "my_full text_search", "_id": 6}} 
{"my_field": "東京で夢の大学生活"} 
{"index": {"_index": "my_full text_search", "_id": 7}} 
{"my_field": "東京大学で夢の生活"} 
{"index": {"_index": "my_full text_search", "_id": 8}} 
{"my_field": "東大で夢の生活"} 
{"index": {"_index": "my_full text_search", "_id": 9}} 
{"my_field": "首都圏の大学 東京"}

Example: Search query preparation and search result

A. Search query and results with keyword "米国"

If you search for "米国" (US), documents that include "米国" will be displayed. Also, documents that include "アメリカ" (America) will also be displayed.

# search query 
GET my_full text_search/_search 
{ 
  "query": { 
    "bool": { 
      "must": [ 
        { 
          "multi_match": { 
            "query": "米国", 
            "fields": [ 
              "my_field.ngram^1" 
            ], 
            "type": "phrase" 
          } 
        } 
      ], 
      "should": [ 
        { 
          "multi_match": { 
            "query": "米国", 
            "fields": [ 
              "my_field^1" 
            ], 
            "type": "phrase" 
          } 
        } 
      ] 
    } 
  } 
} 
# result 
{
  "took" : 37,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 3,
      "relation" : "eq"
    },
    "max_score" : 10.823313,
    "hits" : [
      {
        "_index" : "my_fulltext_search",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 10.823313,
        "_source" : {
          "my_field" : "米国"
        }
      },
      {
        "_index" : "my_fulltext_search",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 9.0388565,
        "_source" : {
          "my_field" : "アメリカ"
        }
      },
      {
        "_index" : "my_fulltext_search",
        "_type" : "_doc",
        "_id" : "3",
        "_score" : 7.0624585,
        "_source" : {
          "my_field" : "アメリカの大学"
        }
      }
    ]
  }
}

B. Search query and results with keyword "東京大学"

If you search for "東京大学" (University of Tokyo), documents that include "東京大学" will be displayed. Also, documents that include "東大" (short way to say "東京大学”) will also be displayed.

# search query 
GET my_full text_search/_search 
{ 
  "query": { 
    "bool": { 
      "must": [ 
        { 
          "multi_match": { 
            "query": "東京大学", 
            "fields": [ 
              "my_field.ngram^1" 
            ], 
            "type": "phrase" 
          } 
        } 
      ], 
      "should": [ 
        { 
          "multi_match": { 
            "query": "東京大学", 
            "fields": [ 
              "my_field^1" 
            ], 
            "type": "phrase" 
          } 
        } 
      ] 
    } 
  } 
} 
# result 
{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 3,
      "relation" : "eq"
    },
    "max_score" : 8.391829,
    "hits" : [
      {
        "_index" : "my_fulltext_search",
        "_type" : "_doc",
        "_id" : "4",
        "_score" : 8.391829,
        "_source" : {
          "my_field" : "東京大学"
        }
      },
      {
        "_index" : "my_fulltext_search",
        "_type" : "_doc",
        "_id" : "8",
        "_score" : 6.73973,
        "_source" : {
          "my_field" : "東大で夢の生活"
        }
      },
      {
        "_index" : "my_fulltext_search",
        "_type" : "_doc",
        "_id" : "7",
        "_score" : 5.852869,
        "_source" : {
          "my_field" : "東京大学で夢の生活"
        }
      }
    ]
  }
}

Let’s take a deeper look at the mappings and analyzers we used.

Mapping configuration review

Here are the main decisions we made with our mappings:

  • In mapping design, first we need to prepare two fields for n-gram and morphological analysis. As mentioned above, this blog uses a field called my_field for morphological analysis. Also, under my_field, we use a multi-field called ngram (my_field.ngram) for n-gram analysis.
  • We configured analyzers for morphological analysis and n-gram analysis. We will explore the details of these analyzers later.
  • Regarding the analyzer for morphological analysis, as the processing contents at index time and search time are different, we defined an analyzer for index side (ja_kuromoji_index_analyzer) and an analyzer for search side (ja_kuromoji_search_analyzer).
  • The same applies to the n-gram analyzer. We defined an analyzer for index side (ja_ngram_index_analyzer) and an analyzer for search side (ja_ngram_search_analyzer).
  "mappings": { 
    "properties": { 
      "my_field": { 
        "type": "text", 
        "search_analyzer": "ja_kuromoji_search_analyzer", 
        "analyzer": "ja_kuromoji_index_analyzer", 
        "fields": { 
          "ngram": { 
            "type": "text", 
            "search_analyzer": "ja_ngram_search_analyzer", 
            "analyzer": "ja_ngram_index_analyzer" 
          } 
        } 
      } 
    } 
  }

Description of morphological analysis

  • The design of the morphological analysis is mainly based on kuromoji tokenizer.
  • Also some necessary normalizations are required before and after the tokenization.
  • Moreover, we placed a synonym graph token filter on the search side. 

Configuration for the index analyzer (ja_kuromoji_index_analyzer) 

  • We used the ICU normalizer (nfkc) in char_filter to absorb the variation of full-width and half-width notation for alphanumeric and katakana characters in the search target. For example, full-width "1" will be converted to half-width "1."
  • We define a custom tokenizer called ja_kuromoji_tokenizer. We will explore this later.
  • Regarding token filters, we use the default ones included in the kuromoji_analyzer.
        "ja_kuromoji_index_analyzer": { 
          "type": "custom", 
          "char_filter": [ 
            "normalize" 
          ], 
          "tokenizer": "ja_kuromoji_tokenizer", 
          "filter": [ 
            "kuromoji_baseform", 
            "kuromoji_part_of_speech", 
            "cjk_width", 
            "ja_stop", 
            "kuromoji_stemmer", 
            "lowercase" 
          ] 
        }

Configuration for the tokenizer (ja_kuromoji_tokenizer) used in index analyzer

  • We used the kuromoji tokenizer with search mode to divide words into smaller pieces.
  • The discard_compound_token option is available from version 7.9 (>=7.9). We set this option to true in this blog. The purpose of this is to avoid the synonym expansion failure due to the existence of the original compound. Please note that in versions before 7.9 (<7.9), this option is not supported. In that case, setting lenient to true in the synonym graph token filter is required to ignore the failure. Please read the discard_compound_token and kuromoji_tokenizer documentation for more details regarding how this option works if needed.
  • We defined a user dictionary. In this blog, we defined it inline. We use the sample rules described in the kuromoji tokenizer documentation
        "ja_kuromoji_tokenizer": { 
          "mode": "search", 
          "type": "kuromoji_tokenizer", 
          "discard_compound_token": true, 
          "user_dictionary_rules": [ 
            "東京スカイツリー,東京 スカイツリー,トウキョウ スカイツリー,カスタム名詞" 
          ] 
        }

Configuration for the search analyzer (ja_kuromoji_search_analyzer) 

  • The way of designing the search analyzer is very similar to the index analyzer. The only difference is we put a synonym graph token filter on the search side. 

The reason we use synonyms is that searching with a synonym dictionary is very common in Japanese full-text search use cases. The reason we put it only on search side is described in our Boosting the power of Elasticsearch with synonyms blog:

"the advantages of using synonyms at search time usually outweigh any slight performance gain you might get when using them at index time."

        "ja_kuromoji_search_analyzer": { 
          "type": "custom", 
          "char_filter": [ 
            "normalize" 
          ], 
          "tokenizer": "ja_kuromoji_tokenizer", 
          "filter": [ 
            "kuromoji_baseform", 
            "kuromoji_part_of_speech", 
            "ja_search_synonym", 
            "cjk_width", 
            "ja_stop", 
            "kuromoji_stemmer", 
            "lowercase" 
          ] 
        }

Below is the definition of synonym filter. To handle multi-word synonyms correctly, we use the synonym_graph token filter. Also, to ignore the potential synonym expansion failure, we set the lenient option to true.

        "ja_search_synonym": { 
          "type": "synonym_graph", 
          "lenient": true, 
          "synonyms": [ 
            "米国, アメリカ", 
            "東京大学, 東大" 
          ] 
        } 
      }

Configuration for the tokenizer (ja_kuromoji_tokenizer) used in search analyzer

Commonly, it is fine to use the same tokenizer for search analyzer. For special scenarios, such as those that require using different user dictionaries, users can also customize a different tokenizer from the one used in index analyzer.

Description about n-gram analysis

The design of the n-gram analysis is mainly based on ngram tokenizer.  Some necessary normalizations are required before and after the tokenization. Also, we placed a synonym graph token filter on the search side.

Configuration for index analyzer (ja_ngram_index_analyzer) 

  • We used the ICU normalizer (nfkc) in char_filter to absorb the variation of full-width and half-width notation for alphanumeric and katakana characters in the search target. For example, full-width "1" will be converted to half-width "1."
  • We used the lowercase token filter to normalize the alphabet to lowercase. For example, "TOKYO" will be converted to "tokyo."
        "ja_ngram_index_analyzer": { 
          "type": "custom", 
          "char_filter": [ 
            "normalize" 
          ], 
          "tokenizer": "ja_ngram_tokenizer", 
          "filter": [ 
            "lowercase" 
          ] 
        }

Configuration for the tokenizer (ja_ngram_tokenizer) used in index analyzer

In tokenizer design, we use bi-gram. In other words, set both minimum and maximum length of characters in a gram to 2. This is because bi-gram is the most common use case for Japanese language. However, bi-gram can sometimes be too trivial to be performant. In such performance-prioritized situations, the user can also use tri-grams (grams of length 3) for particular fields separately. Furthermore, to make the token more meaningful, we only pick up letter and digit.

        "ja_ngram_tokenizer": { 
          "type": "ngram", 
          "min_gram": 2, 
          "max_gram": 2, 
          "token_chars": [ 
            "letter", 
            "digit" 
          ] 
        }

bi-gram tokenize example

# request 
GET my_full text_search/_analyze 
{ 
  "tokenizer": "ja_ngram_tokenizer", 
  "text": ["日本語"] 
} 
# response 
{ 
  "tokens" : [ 
    { 
      "token" : "日本", 
      "start_offset" : 0, 
      "end_offset" : 2, 
      "type" : "word", 
      "position" : 0 
    }, 
    { 
      "token" : "本語", 
      "start_offset" : 1, 
      "end_offset" : 3, 
      "type" : "word", 
      "position" : 1 
    } 
  ] 
}

Configuration for the search analyzer (ja_ngram_search_analyzer) 

The way of designing the search analyzer is very similar to the index analyzer. The only difference is we put a synonym token filter on the search side. The reason for this difference is the same as the above morphological analysis.

        "ja_ngram_search_analyzer": { 
          "type": "custom", 
          "char_filter": [ 
            "normalize" 
          ], 
          "tokenizer": "ja_ngram_tokenizer", 
          "filter": [ 
            "ja_search_synonym", 
            "lowercase" 
          ] 
        }

Configuration for the tokenizer (ja_ngram_tokenizer) used in search analyzer

It is fine to use the same tokenizer for search analyzer.

Query design for Japanese full-text search

The main idea is to search both morphological analysis and n-gram analysis fields at the same time. For searching n-gram analysis fields, we use bool query with must clause to guarantee the result will match. For searching the morphological analysis field, we use bool query with should clause to boost the score, so that the more relevant query could be returned in a higher order.

The reason we use "must clause" which means "mandatory" for n-gram fields is that we would like to guarantee precision by screening out irrelevant results via ngram query. An alternative thinking is to use the "should" clause to not strictly limit the search range, to increase recall.

Regarding the improvements, the fact that the boosting of the fields should be tuned based on learning. Hence we start with my_field.ngram^1my_field^1 in our example.

In addition to the search query on one target field (with both morphological and n-gram analysis) that we show in this example, search on multiple fields (e.g., search title and body at once) is also a common use case. In such scenarios, users can add target fields to the fields parameter inside multi_match query. Users can also tune different boost values for different fields based on the requirements.

One more thing is to consider word order using match phrase query or multi_match query with type: phrase.

The following shows an example where the search keyword is "東京大学" (University of Tokyo).

GET my_full text_search/_search 
{ 
  "query": { 
    "bool": { 
      "must": [ 
        { 
          "multi_match": { 
            "query": "東京大学", 
            "fields": [ 
              "my_field.ngram^1" 
            ], 
            "type": "phrase" 
          } 
        } 
      ], 
      "should": [ 
        { 
          "multi_match": { 
            "query": "東京大学", 
            "fields": [ 
              "my_field^1" 
            ], 
            "type": "phrase" 
          } 
        } 
      ] 
    } 
  } 
}

Relevance tuning

Another way to boost performance is to add relevance tuning such as freshness boost into the query, to increase the visibility on the search results. Relevance is a big, complex (and fun!) topic that we won't go into in this blog. But if you'd like to start exploring tuning options to determine what's right for your use case, I'd encourage you to read our function score documentation, explore theories behind relevance in Elasticsearch: The Definitive Guide, or check out some of the many blogs written on relevance:

Conclusion

In this article, we examined why full-text search is much more difficult to implement in Japanese than in English. We then explored some considerations for implementing Japanese full-text search. Then we went through a detailed example of implementation, and wrapped up with thoughts on query design.

If topics like this interest you, I encourage you to read our Implementing Japanese autocomplete suggestions in Elasticsearch blog post. Just like we saw with full-text search, implementing autocomplete in Japanese has many considerations that English doesn't.

If you’d like to try this for yourself but don’t have an Elasticsearch cluster running, you can spin up a free trial of Elastic Cloud, add the analysis-icu and analysis-kuromoji plugins, and use the example configurations and data from this blog. When you do, please feel free to give us your feedback on our forums. Enjoy your Japanese full-text search journey!