IMPORTANT: No additional bug fixes or documentation updates
will be released for this version. For the latest information, see the
current release documentation.
CJK Bigram Token Filter
edit
IMPORTANT: This documentation is no longer updated. Refer to Elastic's version policy and the latest documentation.
CJK Bigram Token Filter
editThe cjk_bigram token filter forms bigrams out of the CJK
terms that are generated by the standard tokenizer
or the icu_tokenizer (see analysis-icu plugin).
By default, when a CJK character has no adjacent characters to form a bigram,
it is output in unigram form. If you always want to output both unigrams and
bigrams, set the output_unigrams flag to true. This can be used for a
combined unigram+bigram approach.
Bigrams are generated for characters in han, hiragana, katakana and
hangul, but bigrams can be disabled for particular scripts with the
ignored_scripts parameter. All non-CJK input is passed through unmodified.
{
"index" : {
"analysis" : {
"analyzer" : {
"han_bigrams" : {
"tokenizer" : "standard",
"filter" : ["han_bigrams_filter"]
}
},
"filter" : {
"han_bigrams_filter" : {
"type" : "cjk_bigram",
"ignored_scripts": [
"hiragana",
"katakana",
"hangul"
],
"output_unigrams" : true
}
}
}
}
}