Tech Topics

How to Search Chinese, Japanese, and Korean Text with Elasticsearch 6.2 - Part 2: Multi-fields

How to Search Chinese, Japanese, and Korean Text with Elasticsearch 6.2 - Part 1: Analyzers covered issues around searching documents written in multiple languages and language analyzers. Now we’ll look into the issues indexing multi-language documents with a single field and how to solve with multi-fields. The texts are excerpts from https://www.pyeongchang2018.com/en/about-the-games.


Single Field with the Standard Analyzer

Basically we can index any language text into a single field using the standard analyzer.


DELETE mono
PUT /mono
{  
  "mappings": {
    "docs": {
      "properties": {
        "body": {
          "type": "text"
        }
      }
    }
  }
}

PUT /mono/docs/1 
{
  "body" : "세계인의 축제, 제23회 동계올림픽대회는 대한민국 강원도 평창에서 2018년 2월 9일부터 25일까지 17일간 개최됩니다. 대한민국 평창은 세 번의 도전 끝에 지난 2011년 7월 6일 열린 제123차 IOC 총회에서 과반 표를 획득하며 2018년 동계올림픽 개최지로 선정되었습니다. 이로써 대한민국에서는 1988년 서울 올림픽 이후 30년 만에, 평창에서 개∙폐회식과 대부분의 설상 경기가 개최되며, 강릉에서는 빙상 종목 전 경기가, 그리고 정선에서는 알파인 스키 활강 경기가 개최될 예정입니다."
}
PUT /mono/docs/2 
{
  "body" : "The XXIII Olympic Winter Games will be held for 17 days from 9 to 25 February 2018 in PyeongChang, Gangwon Province, the Republic of Korea. PyeongChang was selected as the host city of the 2018 Olympic Winter Games after receiving a majority vote at the 123rd IOC Session held on 6 July 2011 after three consecutive bids. The Olympic Winter Games will be held in Korea for the first time in 30 years after the Seoul Olympic Games in 1988. PyeongChang will be the stage for the Opening and Closing Ceremonies and most snow sports. Alpine speed events will take place in Jeongseon, and all ice sports will be competed in the coastal city of Gangneung."
}
PUT /mono/docs/3
{
  "body" : "第23届冬季奥运会将于2018年2月9日-25日在韩国江原道平昌展开。韩国平昌在第三次申奥之后,于2011年7月6日召开的第123届国际奥委会全会上被选定为2018年冬季奥运会的主办地。由此,韩国自1988年举办首尔夏季奥运会以后,时隔30年,将首次举办冬季奥运会。该届冬奥会的开·闭幕式以及大部分的雪上运动将在平昌进行,而所有冰上运动将在江陵、高山滑雪滑降比赛则将在旌善进行。"}PUT /mono/docs/4{  "body" : "世界の人々の祝祭、第23回冬季オリンピック大会は大韓民国江原道平昌で2018年2月9日から25日までの17日間、開催されます。大韓民国・平昌は三度の挑戦の末、2011年7月7日に開かれた第123回IOC総会で過半数票を獲得し、2018年冬季オリンピック及びパラリンピックの開催地に選ばれました。これにより1988年ソウルオリンピック開催後30年の時を経てついに、大韓民国で最初の冬季パラリンピックの舞台が繰り広げられます。平昌で開・閉会式とほぼ全ての雪上競技が開催され、江陵では氷上種目全競技が、そして旌善ではアルペンスキー滑降競技が開催される予定です。"
}

Search on these documents sometimes works but sometimes doesn’t.  Korean and Japanese have postpositions, which can cause problems without better processing. Without an analyzer filtering postpositions from an eojeol (word), search may not work properly.


English

GET /mono/_search
{
  "query": {
    "multi_match": {
      "query": "Olympic Games",
      "fields": [
        "body"
      ]
    }
  }
}
=>
{
 ...
  "hits": {
    "total": 1,
    "max_score": 2.3414338,
    "hits": [
      {
        "_index": "mono",
        "_type": "docs",
        "_id": "2",
        "_score": 2.3414338,
        "_source": {
          "body": "The XXIII Olympic Winter Games will be held for 17 days from 9 to 25 February 2018 in PyeongChang, Gangwon Province, the Republic of Korea. PyeongChang was selected as the host city of the 2018 Olympic Winter Games after receiving a majority vote at the 123rd IOC Session held on 6 July 2011 after three consecutive bids. The Olympic Winter Games will be held in Korea for the first time in 30 years after the Seoul Olympic Games in 1988. PyeongChang will be the stage for the Opening and Closing Ceremonies and most snow sports. Alpine speed events will take place in Jeongseon, and all ice sports will be competed in the coastal city of Gangneung."
        }
      }
    ]
  }
}

Japanese

POST /mono/_search
{
  "query": {
    "multi_match": {
      "query": "オリンピック大会",
      "fields": [
        "body"
      ]
    }
  }
}
=>
{
...
  "hits": {
    "total": 2,
    "max_score": 2.8541033,
    "hits": [
      {
        "_index": "mono",
        "_type": "docs",
        "_id": "4",
        "_score": 2.8541033,
        "_source": {
          "body": "世界の人々の祝祭、第23回冬季オリンピック大会は大韓民国江原道平昌で2018年2月9日から25日までの17日間、開催されます。大韓民国・平昌は三度の挑戦の末、2011年7月7日に開かれた第123回IOC総会で過半数票を獲得し、2018年冬季オリンピック及びパラリンピックの開催地に選ばれました。これにより1988年ソウルオリンピック開催後30年の時を経てついに、大韓民国で最初の冬季パラリンピックの舞台が繰り広げられます。平昌で開・閉会式とほぼ全ての雪上競技が開催され、江陵では氷上種目全競技が、そして旌善ではアルペンスキー滑降競技が開催される予定です。"
        }
      },
      {
        "_index": "mono",
        "_type": "docs",
        "_id": "3",
        "_score": 0.82374656,
        "_source": {
          "body": "第23届冬季奥运会将于2018年2月9日-25日在韩国江原道平昌展开。韩国平昌在第三次申奥之后,于2011年7月6日召开的第123届国际奥委会全会上被选定为2018年冬季奥运会的主办地。由此,韩国自1988年举办首尔夏季奥运会以后,时隔30年,将首次举办冬季奥运会。该届冬奥会的开·闭幕式以及大部分的雪上运动将在平昌进行,而所有冰上运动将在江陵、高山滑雪滑降比赛则将在旌善进行。"          }
      }
    ]
  }
}

Chinese

POST /mono/_search
{
  "query": {
    "multi_match": {
      "query": "奥运会",
      "fields": [
        "body"
      ]
    }
  }
}
=>
{
...
  "hits": {
    "total": 2,
    "max_score": 1.6035628,
    "hits": [
      {
        "_index": "mono",
        "_type": "docs",
        "_id": "3",
        "_score": 1.6035628,
        "_source": {
          "body": "第23届冬季奥运会将于2018年2月9日-25日在韩国江原道平昌展开。韩国平昌在第三次申奥之后,于2011年7月6日召开的第123届国际奥委会全会上被选定为2018年冬季奥运会的主办地。由此,韩国自1988年举办首尔夏季奥运会以后,时隔30年,将首次举办冬季奥运会。该届冬奥会的开·闭幕式以及大部分的雪上运动将在平昌进行,而所有冰上运动将在江陵、高山滑雪滑降比赛则将在旌善进行。"
        }
      },
      {
        "_index": "mono",
        "_type": "docs",
        "_id": "4",
        "_score": 0.9687751,
        "_source": {
          "body": "世界の人々の祝祭、第23回冬季オリンピック大会は大韓民国江原道平昌で2018年2月9日から25日までの17日間、開催されます。大韓民国・平昌は三度の挑戦の末、2011年7月7日に開かれた第123回IOC総会で過半数票を獲得し、2018年冬季オリンピック及びパラリンピックの開催地に選ばれました。これにより1988年ソウルオリンピック開催後30年の時を経てついに、大韓民国で最初の冬季パラリンピックの舞台が繰り広げられます。平昌で開・閉会式とほぼ全ての雪上競技が開催され、江陵では氷上種目全競技が、そして旌善ではアルペンスキー滑降競技が開催される予定です。"
        }
      }
    ]
  }
}

Korean

POST /mono/_search
{
  "query": {
    "multi_match": {
      "query": "올림픽대회",
      "fields": [
        "body"
      ]
    }
  }
}
=>
{
...
  "hits": {
    "total": 0,
    "max_score": null,
    "hits": []
  }
}

You can see the English, the Japanese, and the Chinese queries return the results but the Korean query doesn’t. 

English is an inflectional language, doesn’t have postpositions, and can be handled by the standard analyzer. Japanese is an agglutinative language but the standard analyzer splits Kanji (“大”, “会”), Hiragana (“は”), and Katakana (“オリンピック”) (you can refer to “Standard Analyzer: The Default Analyzer” of Part 1: Analyzers) and Elasticsearch can still find the documents (_id: 4). The Chinese document (_id: 3) matches “会” and was included in the result with a lower score (0.82374656), too.


Multi-fields with Language-specific Analyzers

We can define multi-fields for multiple languages and set analyzers for each sub-field. You should install  kuromoji, smartcn, and openkoreantext-analyzer respectively.


DELETE /test
PUT /test
{
  "mappings": {
    "docs": {
      "properties": {
        "body": {
          "type": "text",
          "fields": {
            "korean_field": {
              "analyzer": "openkoreantext-analyzer",
              "type": "text"
            },
            "japanese_field": {
              "analyzer": "kuromoji",
              "type": "text"
            },
            "chinese_field": {
              "analyzer": "smartcn",
              "type": "text"
            }
          }
        }
      }
    }
  }
}

If we index a text, it will be analyzed by all 4 analyzers including the standard analyzer for the main “body” field.


PUT /test/docs/1
{
  "body" : "세계인의 축제, 제23회 동계올림픽대회는 대한민국 강원도 평창에서 2018년 2월 9일부터 25일까지 17일간 개최됩니다. 대한민국 평창은 세 번의 도전 끝에 지난 2011년 7월 6일 열린 제123차 IOC 총회에서 과반 표를 획득하며 2018년 동계올림픽 개최지로 선정되었습니다. 이로써 대한민국에서는 1988년 서울 올림픽 이후 30년 만에, 평창에서 개∙폐회식과 대부분의 설상 경기가 개최되며, 강릉에서는 빙상 종목 전 경기가, 그리고 정선에서는 알파인 스키 활강 경기가 개최될 예정입니다."
}
PUT /test/docs/2
{
  "body" : "The XXIII Olympic Winter Games will be held for 17 days from 9 to 25 February 2018 in PyeongChang, Gangwon Province, the Republic of Korea. PyeongChang was selected as the host city of the 2018 Olympic Winter Games after receiving a majority vote at the 123rd IOC Session held on 6 July 2011 after three consecutive bids. The Olympic Winter Games will be held in Korea for the first time in 30 years after the Seoul Olympic Games in 1988. PyeongChang will be the stage for the Opening and Closing Ceremonies and most snow sports. Alpine speed events will take place in Jeongseon, and all ice sports will be competed in the coastal city of Gangneung."
}
PUT /test/docs/3
{
  "body" : "第23届冬季奥运会将于2018年2月9日-25日在韩国江原道平昌展开。韩国平昌在第三次申奥之后,于2011年7月6日召开的第123届国际奥委会全会上被选定为2018年冬季奥运会的主办地。由此,韩国自1988年举办首尔夏季奥运会以后,时隔30年,将首次举办冬季奥运会。该届冬奥会的开·闭幕式以及大部分的雪上运动将在平昌进行,而所有冰上运动将在江陵、高山滑雪滑降比赛则将在旌善进行。"
}
PUT /test/docs/4
{
  "body" : "世界の人々の祝祭、第23回冬季オリンピック大会は大韓民国江原道平昌で2018年2月9日から25日までの17日間、開催されます。大韓民国・平昌は三度の挑戦の末、2011年7月7日に開かれた第123回IOC総会で過半数票を獲得し、2018年冬季オリンピック及びパラリンピックの開催地に選ばれました。これにより1988年ソウルオリンピック開催後30年の時を経てついに、大韓民国で最初の冬季パラリンピックの舞台が繰り広げられます。平昌で開・閉会式とほぼ全ての雪上競技が開催され、江陵では氷上種目全競技が、そして旌善ではアルペンスキー滑降競技が開催される予定です。"
}

We can use multi match queries so that either the main field (English) or a language-specific sub-field matches.


Korean

POST /test/_search
{
  "query": {
    "multi_match": {
      "query": "올림픽대회",
      "fields": [
        "body",
        "body.korean_field",
        "body.chinese_field",
        "body.japanese_field"
      ]
    }
  }
}
=>
{
...
  "hits": {
    "total": 1,
    "max_score": 2.207397,
    "hits": [
      {
        "_index": "test",
        "_type": "docs",
        "_id": "1",
        "_score": 2.207397,
        "_source": {
          "body": "세계인의 축제, 제23회 동계올림픽대회는 대한민국 강원도 평창에서 2018년 2월 9일부터 25일까지 17일간 개최됩니다. 대한민국 평창은 세 번의 도전 끝에 지난 2011년 7월 6일 열린 제123차 IOC 총회에서 과반 표를 획득하며 2018년 동계올림픽 개최지로 선정되었습니다. 이로써 대한민국에서는 1988년 서울 올림픽 이후 30년 만에, 평창에서 개∙폐회식과 대부분의 설상 경기가 개최되며, 강릉에서는 빙상 종목 전 경기가, 그리고 정선에서는 알파인 스키 활강 경기가 개최될 예정입니다."
        }
      }
    ]
  }
}

English

POST /test/_search
{
  "query": {
    "multi_match": {
      "query": "Olympic Games",
      "fields": [
        "body",
        "body.korean_field",
        "body.chinese_field",
        "body.japanese_field"
      ]
    }
  }
}
=>
{
...
  "hits": {
    "total": 1,
    "max_score": 2.3665998,
    "hits": [
      {
        "_index": "test",
        "_type": "docs",
        "_id": "2",
        "_score": 2.3665998,
        "_source": {
          "body": "The XXIII Olympic Winter Games will be held for 17 days from 9 to 25 February 2018 in PyeongChang, Gangwon Province, the Republic of Korea. PyeongChang was selected as the host city of the 2018 Olympic Winter Games after receiving a majority vote at the 123rd IOC Session held on 6 July 2011 after three consecutive bids. The Olympic Winter Games will be held in Korea for the first time in 30 years after the Seoul Olympic Games in 1988. PyeongChang will be the stage for the Opening and Closing Ceremonies and most snow sports. Alpine speed events will take place in Jeongseon, and all ice sports will be competed in the coastal city of Gangneung."
        }
      }
    ]
  }
}

Both Japanese and Chinese texts may have Chinese characters and a query may return both texts. You can either show all results or pick one with the highest score.


Japanese

POST /test/_search
{
  "query": {
    "multi_match": {
      "query": "オリンピック大会",
      "fields": [
        "body",
        "body.korean_field",
        "body.chinese_field",
        "body.japanese_field"
      ]
    }
  }
}
=>
{
...
  "hits": {
    "total": 2,
    "max_score": 7.3598623,
    "hits": [
      {
        "_index": "test",
        "_type": "docs",
        "_id": "4",
        "_score": 7.3598623,
        "_source": {
          "body": "世界の人々の祝祭、第23回冬季オリンピック大会は大韓民国江原道平昌で2018年2月9日から25日までの17日間、開催されます。大韓民国・平昌は三度の挑戦の末、2011年7月7日に開かれた第123回IOC総会で過半数票を獲得し、2018年冬季オリンピック及びパラリンピックの開催地に選ばれました。これにより1988年ソウルオリンピック開催後30年の時を経てついに、大韓民国で最初の冬季パラリンピックの舞台が繰り広げられます。平昌で開・閉会式とほぼ全ての雪上競技が開催され、江陵では氷上種目全競技が、そして旌善ではアルペンスキー滑降競技が開催される予定です。"
        }
      },
      {
        "_index": "test",
        "_type": "docs",
        "_id": "3",
        "_score": 0.82374656,
        "_source": {
          "body": "第23届冬季奥运会将于2018年2月9日-25日在韩国江原道平昌展开。韩国平昌在第三次申奥之后,于2011年7月6日召开的第123届国际奥委会全会上被选定为2018年冬季奥运会的主办地。由此,韩国自1988年举办首尔夏季奥运会以后,时隔30年,将首次举办冬季奥运会。该届冬奥会的开·闭幕式以及大部分的雪上运动将在平昌进行,而所有冰上运动将在江陵、高山滑雪滑降比赛则将在旌善进行。"
        }
      }
    ]
  }
}

Chinese

POST /test/_search
{
  "query": {
    "multi_match": {
      "query": "奥运会",
      "fields": [
        "body",
        "body.korean_field",
        "body.chinese_field",
        "body.japanese_field"
      ]
    }
  }
}
=>
{
...
  "hits": {
    "total": 2,
    "max_score": 1.6035628,
    "hits": [
      {
        "_index": "test",
        "_type": "docs",
        "_id": "3",
        "_score": 1.6035628,
        "_source": {
          "body": "第23届冬季奥运会将于2018年2月9日-25日在韩国江原道平昌展开。韩国平昌在第三次申奥之后,于2011年7月6日召开的第123届国际奥委会全会上被选定为2018年冬季奥运会的主办地。由此,韩国自1988年举办首尔夏季奥运会以后,时隔30年,将首次举办冬季奥运会。该届冬奥会的开·闭幕式以及大部分的雪上运动将在平昌进行,而所有冰上运动将在江陵、高山滑雪滑降比赛则将在旌善进行。"
        }
      },
      {
        "_index": "test",
        "_type": "docs",
        "_id": "4",
        "_score": 0.9687751,
        "_source": {
          "body": "世界の人々の祝祭、第23回冬季オリンピック大会は大韓民国江原道平昌で2018年2月9日から25日までの17日間、開催されます。大韓民国・平昌は三度の挑戦の末、2011年7月7日に開かれた第123回IOC総会で過半数票を獲得し、2018年冬季オリンピック及びパラリンピックの開催地に選ばれました。これにより1988年ソウルオリンピック開催後30年の時を経てついに、大韓民国で最初の冬季パラリンピックの舞台が繰り広げられます。平昌で開・閉会式とほぼ全ての雪上競技が開催され、江陵では氷上種目全競技が、そして旌善ではアルペンスキー滑降競技が開催される予定です。"
        }
      }
    ]
  }
}

Is there a way to get only documents written in a language same as the keyword in the query? Should we really analyze a text with 4 analyzers? Finally, How to Search Chinese, Japanese, and Korean Text with Elasticsearch 6.2 - Part 3: Language Detector introduces the language detector and describes how we can improve.