kuromoji analyzer consists of the following tokenizer and token filters:
It supports the
user_dictionary settings from
Normalize full-width charactersedit
kuromoji_tokenizer tokenizer uses characters from the MeCab-IPADIC
dictionary to split text into tokens. The dictionary includes some full-width
characters, such as
ｆ. If a text contains full-width characters,
the tokenizer can produce unexpected tokens.
For example, the
kuromoji_tokenizer tokenizer converts the text
Ｃｕｌｔｕｒｅ ｏｆ Ｊａｐａｎ to the tokens
[ culture, o, f, japan ]
[ culture, of, japan ].
To avoid this, add the
character filter to a custom analyzer based on the
kuromoji analyzer. The
icu_normalizer character filter converts full-width characters to their normal
First, duplicate the
kuromoji analyzer to create the basis for a custom
analyzer. Then add the
icu_normalizer character filter to the custom analyzer.