kuromoji_tokenizer
editkuromoji_tokenizer
editThe kuromoji_tokenizer accepts the following settings:
-
mode -
The tokenization mode determines how the tokenizer handles compound and unknown words. It can be set to:
-
normal -
Normal segmentation, no decomposition for compounds. Example output:
関西国際空港 アブラカダブラ
-
search -
Segmentation geared towards search. This includes a decompounding process for long nouns, also including the full compound token as a synonym. Example output:
関西, 関西国際空港, 国際, 空港 アブラカダブラ
-
extended -
Extended mode outputs unigrams for unknown words. Example output:
関西, 国際, 空港 ア, ブ, ラ, カ, ダ, ブ, ラ
-
-
discard_punctuation -
Whether punctuation should be discarded from the output. Defaults to
true. -
user_dictionary -
The Kuromoji tokenizer uses the MeCab-IPADIC dictionary by default. A
user_dictionarymay be appended to the default dictionary. The dictionary should have the following CSV format:<text>,<token 1> ... <token n>,<reading 1> ... <reading n>,<part-of-speech tag>
As a demonstration of how the user dictionary can be used, save the following
dictionary to $ES_HOME/config/userdict_ja.txt:
東京スカイツリー,東京 スカイツリー,トウキョウ スカイツリー,カスタム名詞
Then create an analyzer as follows:
PUT kuromoji_sample
{
"settings": {
"index": {
"analysis": {
"tokenizer": {
"kuromoji_user_dict": {
"type": "kuromoji_tokenizer",
"mode": "extended",
"discard_punctuation": "false",
"user_dictionary": "userdict_ja.txt"
}
},
"analyzer": {
"my_analyzer": {
"type": "custom",
"tokenizer": "kuromoji_user_dict"
}
}
}
}
}
}
POST kuromoji_sample/_analyze?analyzer=my_analyzer&text=東京スカイツリー
The above analyze request returns the following:
# Result
{
"tokens" : [ {
"token" : "東京",
"start_offset" : 0,
"end_offset" : 2,
"type" : "word",
"position" : 1
}, {
"token" : "スカイツリー",
"start_offset" : 2,
"end_offset" : 8,
"type" : "word",
"position" : 2
} ]
}