Stemming is the process of reducing a word to its root form. This ensures variants of a word match during a search.
walked can be stemmed to the same root word:
walk. Once stemmed, an occurrence of either word would match the other in a
Stemming is language-dependent but often involves removing prefixes and suffixes from words.
In some cases, the root form of a stemmed word may not be a real word. For
jumpiness can both be stemmed to
isn’t a real English word, it doesn’t matter for search; if all variants of a
word are reduced to the same root form, they will match correctly.
In Elasticsearch, stemming is handled by stemmer token filters. These token filters can be categorized based on how they stem words:
Because stemming changes tokens, we recommend using the same stemmer token filters during index and search analysis.
Algorithmic stemmers apply a series of rules to each word to reduce it to its
root form. For example, an algorithmic stemmer for English may remove the
-es prefixes from the end of plural words.
Algorithmic stemmers have a few advantages:
- They require little setup and usually work well out of the box.
- They use little memory.
- They are typically faster than dictionary stemmers.
However, most algorithmic stemmers only alter the existing text of a word. This means they may not work well with irregular words that don’t contain their root form, such as:
The following token filters use algorithmic stemming:
stemmer, which provides algorithmic stemming for several languages, some with additional variants.
kstem, a stemmer for English that combines algorithmic stemming with a built-in dictionary.
porter_stem, our recommended algorithmic stemmer for English.
snowball, which uses Snowball-based stemming rules for several languages.
Dictionary stemmers look up words in a provided dictionary, replacing unstemmed word variants with stemmed words from the dictionary.
In theory, dictionary stemmers are well suited for:
- Stemming irregular words
Discerning between words that are spelled similarly but not related conceptually, such as:
In practice, algorithmic stemmers typically outperform dictionary stemmers. This is because dictionary stemmers have the following disadvantages:
A dictionary stemmer is only as good as its dictionary. To work well, these dictionaries must include a significant number of words, be updated regularly, and change with language trends. Often, by the time a dictionary has been made available, it’s incomplete and some of its entries are already outdated.
Size and performance
Dictionary stemmers must load all words, prefixes, and suffixes from its dictionary into memory. This can use a significant amount of RAM. Low-quality dictionaries may also be less efficient with prefix and suffix removal, which can slow the stemming process significantly.
You can use the
hunspell token filter to
perform dictionary stemming.
If available, we recommend trying an algorithmic stemmer for your language
before using the
hunspell token filter.
Sometimes stemming can produce shared root words that are spelled similarly but
not related conceptually. For example, a stemmer may reduce both
skiing to the same root word:
To prevent this and better control stemming, you can use the following token filters: