Text analysis enables Elasticsearch to perform full-text search, where the search returns all relevant results rather than just exact matches.
If you search for
Quick fox jumps, you probably want the document that
A quick brown fox jumps over the lazy dog, and you might also want
documents that contain related words like
fast fox or
Analysis makes full-text search possible through tokenization: breaking a text down into smaller chunks, called tokens. In most cases, these tokens are individual words.
If you index the phrase
the quick brown fox jumps as a single string and the
user searches for
quick fox, it isn’t considered a match. However, if you
tokenize the phrase and index each word separately, the terms in the query
string can be looked up individually. This means they can be matched by searches
fox brown, or other variations.
Tokenization enables matching on individual terms, but each token is still matched literally. This means:
A search for
Quickwould not match
quick, even though you likely want either term to match the other
foxesshare the same root word, a search for
foxeswould not match
foxor vice versa.
A search for
jumpswould not match
leaps. While they don’t share a root word, they are synonyms and have a similar meaning.
To solve these problems, text analysis can normalize these tokens into a standard format. This allows you to match tokens that are not exactly the same as the search terms, but similar enough to still be relevant. For example:
Quickcan be lowercased:
foxescan be stemmed, or reduced to its root word:
leapare synonyms and can be indexed as a single word:
To ensure search terms match these words as intended, you can apply the same
tokenization and normalization rules to the query string. For example, a search
Foxes leap can be normalized to a search for
Text analysis is performed by an analyzer, a set of rules that govern the entire process.
Elasticsearch includes a default analyzer, called the standard analyzer, which works well for most use cases right out of the box.
If you want to tailor your search experience, you can choose a different built-in analyzer or even configure a custom one. A custom analyzer gives you control over each step of the analysis process, including:
- Changes to the text before tokenization
- How text is converted to tokens
- Normalization changes made to tokens before indexing or search