24 May 2018 Engineering

How to Search Chinese, Japanese, and Korean Text with Elasticsearch 6.2 - Part 1: Analyzers

By Kiju Kim

92 countries joined Pyeongchang Winter Olympics last winter. 49 countries joined Pyeongchang Winter Paralympics. Elastic has people in 34 countries. Many people from many countries may write mails in their native languages. Now you want to find a mail using a Korean keyword “올림픽대회”.

https://www.elastic.co/guide/en/elasticsearch/guide/current/language-intro.html explains several aspects handling documents written in several languages but the examples are mostly in European languages which don’t require any custom plugins. This document introduces custom plugins (language analyzers) for Korean, Japanese, and Chinese and suggests an approach to use multi-fields to index and search multi-language text.

Standard Analyzer: The Default Analyzer

Full text search requires language analyzers. The default analyzer is the standard analyzer, which may not be the best especially for Chinese, Japanese, or Korean text. You can find the examples applying the standard analyzer to Korean, Japanese, and Chinese text respectively below. The texts are excerpts from https://www.pyeongchang2018.com/ko/about-the-games

[Korean]

POST _analyze
{
  "analyzer": "standard",
  "text": "제23회 동계올림픽대회는 대한민국 강원도 평창에서 2018년 2월 9일부터 25일까지 17일간 개최됩니다. 대한민국 평창은 세 번의 도전 끝에 지난 2011년 7월 6일 열린 제123차 IOC 총회에서 과반 표를 획득하며 2018년 동계올림픽 개최지로 선정되었습니다. 이로써 대한민국에서는 1988년 서울 올림픽 이후 30년 만에, 평창에서 개∙폐회식과 대부분의 설상 경기가 개최되며, 강릉에서는 빙상 종목 전 경기가, 그리고 정선에서는 알파인 스키 활강 경기가 개최될 예정입니다."
}
=>
{
  "tokens": [
    {
      "token": "제23회",
      "start_offset": 0,
      "end_offset": 4,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "동계올림픽대회는",
      "start_offset": 5,
      "end_offset": 13,
      "type": "<HANGUL>",
      "position": 1
    },
…

Korean is an agglutinative language and a noun (e.g. “동계올림픽대회”) is usually followed by a postposition (e.g. “는”). Postpositions should be separated or removed for better search but the standard analyzer keeps the noun and the postposition combined.

[Japanese]

POST _analyze
{
  "analyzer": "standard",
  "text": "第23回冬季オリンピック大会は大韓民国江原道平昌で2018年2月9日から25日までの17日間、開催されます。大韓民国・平昌は三度の挑戦の末、2011年7月7日に開かれた第123回IOC総会で過半数票を獲得し、2018年冬季オリンピック及びパラリンピックの開催地に選ばれました。これにより1988年ソウルオリンピック開催後30年の時を経てついに、大韓民国で最初の冬季パラリンピックの舞台が繰り広げられます。平昌で開・閉会式とほぼ全ての雪上競技が開催され、江陵では氷上種目全競技が、そして旌善ではアルペンスキー滑降競技が開催される予定です。"
}
=>
{
  "tokens": [
    {
      "token": "第",
      "start_offset": 0,
      "end_offset": 1,
      "type": "<IDEOGRAPHIC>",
      "position": 0
    },
    {
      "token": "23",
      "start_offset": 1,
      "end_offset": 3,
      "type": "<NUM>",
      "position": 1
    },
    {
      "token": "回",
      "start_offset": 3,
      "end_offset": 4,
      "type": "<IDEOGRAPHIC>",
      "position": 2
    },
    {
      "token": "冬",
      "start_offset": 4,
      "end_offset": 5,
      "type": "<IDEOGRAPHIC>",
      "position": 3
    },
    {
      "token": "季",
      "start_offset": 5,
      "end_offset": 6,
      "type": "<IDEOGRAPHIC>",
      "position": 4
    },
    {
      "token": "オリンピック",
      "start_offset": 6,
      "end_offset": 12,
      "type": "<KATAKANA>",
      "position": 5
    },
    {
      "token": "大",
      "start_offset": 12,
      "end_offset": 13,
      "type": "<IDEOGRAPHIC>",
      "position": 6
    },
    {
      "token": "会",
      "start_offset": 13,
      "end_offset": 14,
      "type": "<IDEOGRAPHIC>",
      "position": 7
    },
    {
      "token": "は",
      "start_offset": 14,
      "end_offset": 15,
      "type": "<HIRAGANA>",
      "position": 8
    },
…

Japanese is an agglutinative language, too, and a noun (e.g. “冬季オリンピック大会”) is usually followed by a postposition (e.g. “は”). Again, postpositions should be separated or removed for better search but the standard analyzer keeps the noun and the postposition combined. To make matters worse, Japanese nouns are composed of Kanji and Kana but the standard analyzer separated each Kanji character into a single token.

[Chinese]

POST _analyze
{
  "analyzer": "standard",
  "text": "第23届冬季奥运会将于2018年2月9日-25日在韩国江原道平昌展开。韩国平昌在第三次申奥之后,于2011年7月6日召开的第123届国际奥委会全会上被选定为2018年冬季奥运会的主办地。由此,韩国自1988年举办首尔夏季奥运会以后,时隔30年,将首次举办冬季奥运会。该届冬奥会的开·闭幕式以及大部分的雪上运动将在平昌进行,而所有冰上运动将在江陵、高山滑雪滑降比赛则将在旌善进行。"
}
=>
{
  "tokens": [
    {
      "token": "第",
      "start_offset": 0,
      "end_offset": 1,
      "type": "<IDEOGRAPHIC>",
      "position": 0
    },
    {
      "token": "23",
      "start_offset": 1,
      "end_offset": 3,
      "type": "<NUM>",
      "position": 1
    },
    {
      "token": "届",
      "start_offset": 3,
      "end_offset": 4,
      "type": "<IDEOGRAPHIC>",
      "position": 2
    },
    {
      "token": "冬",
      "start_offset": 4,
      "end_offset": 5,
      "type": "<IDEOGRAPHIC>",
      "position": 3
    },
    {
      "token": "季",
      "start_offset": 5,
      "end_offset": 6,
      "type": "<IDEOGRAPHIC>",
      "position": 4
    },
    {
      "token": "奥",
      "start_offset": 6,
      "end_offset": 7,
      "type": "<IDEOGRAPHIC>",
      "position": 5
    },
    {
      "token": "运",
      "start_offset": 7,
      "end_offset": 8,
      "type": "<IDEOGRAPHIC>",
      "position": 6
    },
    {
      "token": "会",
      "start_offset": 8,
      "end_offset": 9,
      "type": "<IDEOGRAPHIC>",
      "position": 7
    },
...

Again, the standard analyzer didn’t understand “冬季奥运会” means “Winter Olympics” and broke all characters into pieces.

Language Specific Analyzers: Custom Plugins

With language specific analyzers, we can get better search experience. Popular analyzers for Japanese, Chinese, and Korean include kuromoji, smartcn, and openkoreantext-analyzer respectively. You must install them on every node in the cluster before running the following examples. Note that you must use the full URL or download first to install openkoreantext-analyzer because it is not hosted in Elastic repository.

[Korean]

POST _analyze
{
  "analyzer": "openkoreantext-analyzer",
  "text": "제23회 동계올림픽대회는 대한민국 강원도 평창에서 2018년 2월 9일부터 25일까지 17일간 개최됩니다. 대한민국 평창은 세 번의 도전 끝에 지난 2011년 7월 6일 열린 제123차 IOC 총회에서 과반 표를 획득하며 2018년 동계올림픽 개최지로 선정되었습니다. 이로써 대한민국에서는 1988년 서울 올림픽 이후 30년 만에, 평창에서 개∙폐회식과 대부분의 설상 경기가 개최되며, 강릉에서는 빙상 종목 전 경기가, 그리고 정선에서는 알파인 스키 활강 경기가 개최될 예정입니다."
}
=>
{
  "tokens": [
    {
      "token": "제",
      "start_offset": 0,
      "end_offset": 1,
      "type": "Noun",
      "position": 0
    },
    {
      "token": "23회",
      "start_offset": 1,
      "end_offset": 4,
      "type": "Number",
      "position": 1
    },
    {
      "token": "동계올림픽",
      "start_offset": 5,
      "end_offset": 10,
      "type": "Noun",
      "position": 2
    },
    {
      "token": "대회",
      "start_offset": 10,
      "end_offset": 12,
      "type": "Noun",
      "position": 3
    },
...

You can see all the postpositions were removed now.

[Japanese]

POST _analyze
{
  "analyzer": "kuromoji",
  "text": "第23回冬季オリンピック大会は大韓民国江原道平昌で2018年2月9日から25日までの17日間、開催されます。大韓民国・平昌は三度の挑戦の末、2011年7月7日に開かれた第123回IOC総会で過半数票を獲得し、2018年冬季オリンピック及びパラリンピックの開催地に選ばれました。これにより1988年ソウルオリンピック開催後30年の時を経てついに、大韓民国で最初の冬季パラリンピックの舞台が繰り広げられます。平昌で開・閉会式とほぼ全ての雪上競技が開催され、江陵では氷上種目全競技が、そして旌善ではアルペンスキー滑降競技が開催される予定です。"
}
=>
{
  "tokens": [
    {
      "token": "第",
      "start_offset": 0,
      "end_offset": 1,
      "type": "word",
      "position": 0
    },
    {
      "token": "23",
      "start_offset": 1,
      "end_offset": 3,
      "type": "word",
      "position": 1
    },
    {
      "token": "回",
      "start_offset": 3,
      "end_offset": 4,
      "type": "word",
      "position": 2
    },
    {
      "token": "冬季",
      "start_offset": 4,
      "end_offset": 6,
      "type": "word",
      "position": 3
    },
    {
      "token": "オリンピック",
      "start_offset": 6,
      "end_offset": 12,
      "type": "word",
      "position": 4
    },
    {
      "token": "大会",
      "start_offset": 12,
      "end_offset": 14,
      "type": "word",
      "position": 5
    },
...

All the postpositions were removed and Kanji were not broken into pieces.

[Chinese]

POST _analyze
{
  "analyzer": "smartcn",
  "text": "第23届冬季奥运会将于2018年2月9日-25日在韩国江原道平昌展开。韩国平昌在第三次申奥之后,于2011年7月6日召开的第123届国际奥委会全会上被选定为2018年冬季奥运会的主办地。由此,韩国自1988年举办首尔夏季奥运会以后,时隔30年,将首次举办冬季奥运会。该届冬奥会的开·闭幕式以及大部分的雪上运动将在平昌进行,而所有冰上运动将在江陵、高山滑雪滑降比赛则将在旌善进行。"
}
=>
{
  "tokens": [
    {
      "token": "第",
      "start_offset": 0,
      "end_offset": 1,
      "type": "word",
      "position": 0
    },
    {
      "token": "23",
      "start_offset": 1,
      "end_offset": 3,
      "type": "word",
      "position": 1
    },
    {
      "token": "届",
      "start_offset": 3,
      "end_offset": 4,
      "type": "word",
      "position": 2
    },
    {
      "token": "冬季",
      "start_offset": 4,
      "end_offset": 6,
      "type": "word",
      "position": 3
    },
    {
      "token": "奥运会",
      "start_offset": 6,
      "end_offset": 9,
      "type": "word",
      "position": 4
    },
...

You can see the characters that forms a word are kept together.

Now we can analyze Korean, Japanese, and Chinese text. Let me show you how to index and query Korean, Japanese, and Chinese text at [How to Search Chinese, Japanese, and Korean Text - Part 2: Multi-fields].