We are working on updating this book for the latest version. Some content might be out of date.
Breaking text into tokens is
only half the job. To make those
tokens more easily searchable, they need to go through a normalization
process to remove insignificant differences between otherwise identical words,
such as uppercase versus lowercase. Perhaps we also need to remove significant
differences, to make
está all searchable as the same
word. Would you search for
déjà vu, or just for
This is the job of the token filters, which receive a stream of tokens from the tokenizer. You can have multiple token filters, each doing its particular job. Each receives the new token stream as output by the token filter before it.