We are working on updating this book for the latest version. Some content might be out of date.
For the sake of completeness, we will finish this chapter by explaining how to index stemmed words into the same field as unstemmed words. As an example, analyzing the sentence The quick foxes jumped would produce the following terms:
Read Is Stemming in situ a Good Idea before using this approach.
To achieve stemming in situ, we will use the
token filter, which, like the
keyword_marker token filter (see
Preventing Stemming), marks each term as a keyword to prevent the subsequent
stemmer from touching it. However, it also repeats the term in the same
position, and this repeated term is stemmed.
keyword_repeat token filter alone would result in the following:
To prevent the useless repetition of terms that are the same in their stemmed
and unstemmed forms, we add the
unique token filter into the mix:
People like the idea of stemming in situ: “Why use an unstemmed field and a stemmed field if I can just use one combined field?” But is it a good idea? The answer is almost always no. There are two problems.
The first is the inability to separate exact matches from inexact matches. In
this chapter, we have seen that words with different meanings are often
conflated to the same stem word:
organization both stem to
In Using Language Analyzers, we demonstrated how to combine a query on a stemmed field (to increase recall) with a query on an unstemmed field (to improve relevance). When the stemmed and unstemmed fields are separate, the contribution of each field can be tuned by boosting one field over another (see Prioritizing Clauses). If, instead, the stemmed and unstemmed forms appear in the same field, there is no way to tune your search results.
The second issue has to do with how the
relevance score is calculated. In
What Is Relevance?, we explained that part of the calculation depends on the
inverse document frequency — how often a word appears in all the documents
in our index.
Using in situ stemming for a document that contains the text
jump jumped jumps would result in these terms:
Pos 1: (jump) Pos 2: (jumped,jump) Pos 3: (jumps,jump)
jumps appear once each and so would have the correct IDF,
jump appears three times, greatly reducing its value as a search term in
comparison with the unstemmed forms.
For these reasons, we recommend against using stemming in situ.