ICU Tokenizeredit
Tokenizes text into words on word boundaries, as defined in
UAX #29: Unicode Text Segmentation.
It behaves much like the standard
tokenizer,
but adds better support for some Asian languages by using a dictionary-based
approach to identify words in Thai, Lao, Chinese, Japanese, and Korean, and
using custom rules to break Myanmar and Khmer text into syllables.
PUT icu_sample { "settings": { "index": { "analysis": { "analyzer": { "my_icu_analyzer": { "tokenizer": "icu_tokenizer" } } } } } }
Rules customizationedit
This functionality is in technical preview and may be changed or removed in a future release. Elastic will apply best effort to fix any issues, but features in technical preview are not subject to the support SLA of official GA features.
You can customize the icu-tokenizer
behavior by specifying per-script rule files, see the
RBBI rules syntax reference
for a more detailed explanation.
To add icu tokenizer rules, set the rule_files
settings, which should contain a comma-separated list of
code:rulefile
pairs in the following format:
four-letter ISO 15924 script code,
followed by a colon, then a rule file name. Rule files are placed ES_HOME/config
directory.
As a demonstration of how the rule files can be used, save the following user file to $ES_HOME/config/KeywordTokenizer.rbbi
:
.+ {200};
Then create an analyzer to use this rule file as follows:
PUT icu_sample { "settings": { "index":{ "analysis":{ "tokenizer" : { "icu_user_file" : { "type" : "icu_tokenizer", "rule_files" : "Latn:KeywordTokenizer.rbbi" } }, "analyzer" : { "my_analyzer" : { "type" : "custom", "tokenizer" : "icu_user_file" } } } } } } POST icu_sample/_analyze?analyzer=my_analyzer&text=Elasticsearch. Wow!
The above analyze
request returns the following:
{ "tokens": [ { "token": "Elasticsearch. Wow!", "start_offset": 0, "end_offset": 19, "type": "<ALPHANUM>", "position": 0 } ] }