Token filter that generates bigrams for frequently occuring terms. Single terms are still indexed. It can be used as an alternative to the Stop Token Filter when we don’t want to completely ignore common terms.
For example, the text "the quick brown is a fox" will be tokenized as "the", "the_quick", "quick", "brown", "brown_is", "is_a", "a_fox", "fox". Assuming "the", "is" and "a" are common words.
query_mode is enabled, the token filter removes common words and
single terms followed by a common word. This parameter should be enabled
in the search analyzer.
For example, the query "the quick brown is a fox" will be tokenized as "the_quick", "quick", "brown_is", "is_a", "a_fox", "fox".
The following are settings that can be set:
A list of common words to use.
A path (either relative to
If true, common words matching will be case insensitive
Generates bigrams then removes common words and single
terms followed by a common word (defaults to
common_words_path field is required.
Here is an example:
index : analysis : analyzer : index_grams : tokenizer : whitespace filter : [common_grams] search_grams : tokenizer : whitespace filter : [common_grams_query] filter : common_grams : type : common_grams common_words: [a, an, the] common_grams_query : type : common_grams query_mode: true common_words: [a, an, the]