We are working on updating this book for the latest version. Some content might be out of date.
The example we
showed in Formatting Synonyms, used
u s a as a synonym. Why
did we use that instead of
U.S.A.? The reason is that the
filter sees only the terms that the previous token filter or tokenizer has
Imagine that we have an analyzer that consists of the
lowercase token filter followed by a
synonym token filter. The
analysis process for the text
U.S.A. would look like this:
original string → "U.S.A." standard tokenizer → (U),(S),(A) lowercase token filter → (u),(s),(a) synonym token filter → (usa)
If we had specified the synonym as
U.S.A., it would never match anything
because, by the time
my_synonym_filter sees the terms, the periods have been
removed and the letters have been lowercased.
This is an important point to consider. What if we want to combine synonyms
with stemming, so that
leap are all indexed as the single term
could place the synonyms
filter before the stemmer and list all inflections:
"jumps,jumped,leap,leaps,leaped => jump"
But the more concise way would be to place the synonyms filter after the stemmer, and to list just the root words that would be emitted by the stemmer:
"leap => jump"
Normally, synonym filters are placed after the
lowercase token filter and so
all synonyms are
written in lowercase, but sometimes that can lead to odd
conflations. For instance, a
CAT scan and a
cat are quite different, as
PET (positron emission tomography) and a
pet. For that matter, the
Little is distinct from the adjective
little (although if a
sentence starts with the adjective, it will be uppercased anyway).
If you need use case to distinguish between word senses, you will need to
place your synonym filter before the
lowercase filter. Of course, that means
that your synonym rules would need to list all of the case variations that you
want to match (for example,
Instead of that, you could have two synonym filters: one to catch the case-sensitive synonyms and one for all the case-insensitive synonyms. For instance, the case-sensitive rules could look like this:
"CAT,CAT scan => cat_scan" "PET,PET scan => pet_scan" "Johnny Little,J Little => johnny_little" "Johnny Small,J Small => johnny_small"
And the case-insensitive rules could look like this:
"cat => cat,pet" "dog => dog,pet" "cat scan,cat_scan scan => cat_scan" "pet scan,pet_scan scan => pet_scan" "little,small"
The case-sensitive rules would
CAT scan but would match only the
CAT scan. For this reason, we have the odd-looking rule
scan in the case-insensitive list to catch bad replacements.
You can see how quickly it can get complicated. As always, the
is your friend—use it to check that your analyzers are configured
correctly. See Testing Analyzers.