IMPORTANT: No additional bug fixes or documentation updates will be released for this version. For the latest information, see the current release documentation.

› › ›

ICU Normalization Character Filter

edit

IMPORTANT: This documentation is no longer updated. Refer to Elastic's version policy and the latest documentation.

ICU Normalization Character Filter

edit

Normalizes characters as explained here. It registers itself as the icu_normalizer character filter, which is available to all indices without any further configuration. The type of normalization can be specified with the name parameter, which accepts nfc, nfkc, and nfkc_cf (default). Set the mode parameter to decompose to convert nfc to nfd or nfkc to nfkd respectively:

Which letters are normalized can be controlled by specifying the unicodeSetFilter parameter, which accepts a UnicodeSet.

Here are two examples, the default usage and a customised character filter:

PUT icu_sample
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "nfkc_cf_normalized": { 
            "tokenizer": "icu_tokenizer",
            "char_filter": [
              "icu_normalizer"
            ]
          },
          "nfd_normalized": { 
            "tokenizer": "icu_tokenizer",
            "char_filter": [
              "nfd_normalizer"
            ]
          }
        },
        "char_filter": {
          "nfd_normalizer": {
            "type": "icu_normalizer",
            "name": "nfc",
            "mode": "decompose"
          }
        }
      }
    }
  }
}

	Uses the default `nfkc_cf` normalization.
	Uses the customized `nfd_normalizer` token filter, which is set to use `nfc` normalization with decomposition.

« ICU Analysis Plugin ICU Tokenizer »