We are working on updating this book for the latest version. Some content might be out of date.
What is interesting is the algorithm that is used to identify words. The
whitespace tokenizer simply breaks on whitespace—spaces, tabs, line
feeds, and so forth—and assumes that contiguous nonwhitespace characters form a
single token. For instance:
GET /_analyze?tokenizer=whitespace You're the 1st runner home!
This request would return the following terms:
standard tokenizer uses the Unicode Text Segmentation algorithm (as
defined in Unicode Standard Annex #29) to
find the boundaries between words, and emits everything in-between. Its
knowledge of Unicode allows it to successfully tokenize text containing a
mixture of languages.
GET /_analyze?tokenizer=standard You're my 'favorite'.
In this example, the apostrophe in
You're is treated as part of the
word, while the single quotes in
'favorite' are not, resulting in the
uax_url_email tokenizer works in exactly the same way as the
tokenizer, except that it recognizes email addresses and URLs and emits them as
single tokens. The
standard tokenizer, on the other hand, would try to
break them into individual words. For instance, the email address
email@example.com would result in the tokens
standard tokenizer is a reasonable starting point for tokenizing most
languages, especially Western languages. In fact, it forms the basis of most
of the language-specific analyzers like the
analyzers. Its support for Asian languages, however, is limited, and you should consider
icu_tokenizer instead, which is available in the ICU plug-in.