WARNING: Version 5.2 of Elasticsearch has passed its EOL date.

This documentation is no longer being maintained and may be removed. If you are running this version, we strongly advise you to upgrade. For the latest information, see the current release documentation.

« Hunspell Token Filter Normalization Token Filter »

› › ›

Common Grams Token Filter

edit

IMPORTANT: This documentation is no longer updated. Refer to Elastic's version policy and the latest documentation.

Common Grams Token Filter

edit

Token filter that generates bigrams for frequently occurring terms. Single terms are still indexed. It can be used as an alternative to the Stop Token Filter when we don’t want to completely ignore common terms.

For example, the text "the quick brown is a fox" will be tokenized as "the", "the_quick", "quick", "brown", "brown_is", "is_a", "a_fox", "fox". Assuming "the", "is" and "a" are common words.

When query_mode is enabled, the token filter removes common words and single terms followed by a common word. This parameter should be enabled in the search analyzer.

For example, the query "the quick brown is a fox" will be tokenized as "the_quick", "quick", "brown_is", "is_a", "a_fox", "fox".

The following are settings that can be set:

Setting	Description
`common_words`	A list of common words to use.
`common_words_path`	A path (either relative to `config` location, or absolute) to a list of common words. Each word should be in its own "line" (separated by a line break). The file must be UTF-8 encoded.
`ignore_case`	If true, common words matching will be case insensitive (defaults to `false`).
`query_mode`	Generates bigrams then removes common words and single terms followed by a common word (defaults to `false`).

Note, common_words or common_words_path field is required.

Here is an example:

index :
    analysis :
        analyzer :
            index_grams :
                tokenizer : whitespace
                filter : [common_grams]
            search_grams :
                tokenizer : whitespace
                filter : [common_grams_query]
        filter :
            common_grams :
                type : common_grams
                common_words: [a, an, the]
            common_grams_query :
                type : common_grams
                query_mode: true
                common_words: [a, an, the]

« Hunspell Token Filter Normalization Token Filter »