IMPORTANT: No additional bug fixes or documentation updates will be released for this version. For the latest information, see the current release documentation.

« CJK bigram token filter Classic token filter »

› › ›

CJK width token filter

edit

IMPORTANT: This documentation is no longer updated. Refer to Elastic's version policy and the latest documentation.

CJK width token filter

edit

Normalizes width differences in CJK (Chinese, Japanese, and Korean) characters as follows:

Folds full-width ASCII character variants into the equivalent basic Latin characters
Folds half-width Katakana character variants into the equivalent Kana characters

This filter is included in Elasticsearch’s built-in CJK language analyzer. It uses Lucene’s CJKWidthFilter.

This token filter can be viewed as a subset of NFKC/NFKD Unicode normalization. See the analysis-icu plugin for full normalization support.

Example

edit

GET /_analyze
{
  "tokenizer" : "standard",
  "filter" : ["cjk_width"],
  "text" : "ｼｰｻｲﾄﾞﾗｲﾅｰ"
}

The filter produces the following token:

シーサイドライナー

Add to an analyzer

edit

The following create index API request uses the CJK width token filter to configure a new custom analyzer.

PUT /cjk_width_example
{
  "settings": {
    "analysis": {
      "analyzer": {
        "standard_cjk_width": {
          "tokenizer": "standard",
          "filter": [ "cjk_width" ]
        }
      }
    }
  }
}

« CJK bigram token filter Classic token filter »