IMPORTANT: No additional bug fixes or documentation updates will be released for this version. For the latest information, see the current release documentation.

« ICU Folding Token Filter ICU Transform Token Filter »

› › ›

ICU Collation Token Filter

edit

IMPORTANT: This documentation is no longer updated. Refer to Elastic's version policy and the latest documentation.

ICU Collation Token Filter

edit

Collations are used for sorting documents in a language-specific word order. The icu_collation token filter is available to all indices and defaults to using the DUCET collation, which is a best-effort attempt at language-neutral sorting.

Below is an example of how to set up a field for sorting German names in “phonebook” order:

PUT /my_index
{
  "settings": {
    "analysis": {
      "filter": {
        "german_phonebook": {
          "type":     "icu_collation",
          "language": "de",
          "country":  "DE",
          "variant":  "@collation=phonebook"
        }
      },
      "analyzer": {
        "german_phonebook": {
          "tokenizer": "keyword",
          "filter":  [ "german_phonebook" ]
        }
      }
    }
  },
  "mappings": {
    "user": {
      "properties": {
        "name": { 
          "type": "string",
          "fields": {
            "sort": { 
              "type":     "string",
              "analyzer": "german_phonebook"
            }
          }
        }
      }
    }
  }
}

GET _search 
{
  "query": {
    "match": {
      "name": "Fritz"
    }
  },
  "sort": "name.sort"
}

	The `name` field uses the `standard` analyzer, and so support full text queries.
	The `name.sort` field uses the `keyword` analyzer to preserve the name as a single token, and applies the `german_phonebook` token filter to index the value in German phonebook sort order.
	An example query which searches the `name` field and sorts on the `name.sort` field.

Collation options

edit

strength: The strength property determines the minimum level of difference considered significant during comparison. Possible values are : primary, secondary, tertiary, quaternary or identical. See the ICU Collation documentation for a more detailed explanation for each value. Defaults to tertiary unless otherwise specified in the collation.
decomposition: Possible values: no (default, but collation-dependent) or canonical. Setting this decomposition property to canonical allows the Collator to handle unnormalized text properly, producing the same results as if the text were normalized. If no is set, it is the user’s responsibility to insure that all text is already in the appropriate form before a comparison or before getting a CollationKey. Adjusting decomposition mode allows the user to select between faster and more complete collation behavior. Since a great many of the world’s languages do not require text normalization, most locales set no as the default decomposition mode.

The following options are expert only:

alternate: Possible values: shifted or non-ignorable. Sets the alternate handling for strength quaternary to be either shifted or non-ignorable. Which boils down to ignoring punctuation and whitespace.
caseLevel: Possible values: true or false (default). Whether case level sorting is required. When strength is set to primary this will ignore accent differences.
caseFirst: Possible values: lower or upper. Useful to control which case is sorted first when case is not ignored for strength tertiary. The default depends on the collation.
numeric: Possible values: true or false (default) . Whether digits are sorted according to their numeric representation. For example the value egg-9 is sorted before the value egg-21.
variableTop: Single character or contraction. Controls what is variable for alternate.
hiraganaQuaternaryMode: Possible values: true or false. Distinguishing between Katakana and Hiragana characters in quaternary strength.

« ICU Folding Token Filter ICU Transform Token Filter »