ICU Tokenizer
editICU Tokenizer
editTokenizes text into words on word boundaries, as defined in
UAX #29: Unicode Text Segmentation.
It behaves much like the standard tokenizer,
but adds better support for some Asian languages by using a dictionary-based
approach to identify words in Thai, Lao, Chinese, Japanese, and Korean, and
using custom rules to break Myanmar and Khmer text into syllables.
PUT icu_sample
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"my_icu_analyzer": {
"tokenizer": "icu_tokenizer"
}
}
}
}
}
}
Rules customization
editThis functionality is in technical preview and may be changed or removed in a future release. Elastic will work to fix any issues, but features in technical preview are not subject to the support SLA of official GA features.
You can customize the icu-tokenizer behavior by specifying per-script rule files, see the
RBBI rules syntax reference
for a more detailed explanation.
To add icu tokenizer rules, set the rule_files settings, which should contain a comma-separated list of
code:rulefile pairs in the following format:
four-letter ISO 15924 script code,
followed by a colon, then a rule file name. Rule files are placed ES_HOME/config directory.
As a demonstration of how the rule files can be used, save the following user file to $ES_HOME/config/KeywordTokenizer.rbbi:
.+ {200};
Then create an analyzer to use this rule file as follows:
PUT icu_sample
{
"settings": {
"index":{
"analysis":{
"tokenizer" : {
"icu_user_file" : {
"type" : "icu_tokenizer",
"rule_files" : "Latn:KeywordTokenizer.rbbi"
}
},
"analyzer" : {
"my_analyzer" : {
"type" : "custom",
"tokenizer" : "icu_user_file"
}
}
}
}
}
}
POST icu_sample/_analyze
{
"analyzer": "my_analyzer",
"text": "Elasticsearch. Wow!"
}
The above analyze request returns the following:
{
"tokens": [
{
"token": "Elasticsearch. Wow!",
"start_offset": 0,
"end_offset": 19,
"type": "<ALPHANUM>",
"position": 0
}
]
}