Stop token filter
editStop token filter
editRemoves stop words from a token stream.
When not customized, the filter removes the following English stop words by default:
a
, an
, and
, are
, as
, at
, be
, but
, by
, for
, if
, in
,
into
, is
, it
, no
, not
, of
, on
, or
, such
, that
, the
,
their
, then
, there
, these
, they
, this
, to
, was
, will
, with
In addition to English, the stop
filter supports predefined
stop word lists for several
languages. You can also specify your own stop words as an array or file.
The stop
filter uses Lucene’s
StopFilter.
Example
editThe following analyze API request uses the stop
filter to remove the stop words
a
and the
from a quick fox jumps over the lazy dog
:
GET /_analyze { "tokenizer": "standard", "filter": [ "stop" ], "text": "a quick fox jumps over the lazy dog" }
The filter produces the following tokens:
[ quick, fox, jumps, over, lazy, dog ]
Add to an analyzer
editThe following create index API request uses the stop
filter to configure a new custom analyzer.
PUT /my-index-000001 { "settings": { "analysis": { "analyzer": { "my_analyzer": { "tokenizer": "whitespace", "filter": [ "stop" ] } } } } }
Configurable parameters
edit-
stopwords
-
(Optional, string or array of strings) Language value, such as
_arabic_
or_thai_
. Defaults to_english_
.Each language value corresponds to a predefined list of stop words in Lucene. See Stop words by language for supported language values and their stop words.
Also accepts an array of stop words.
For an empty list of stop words, use
_none_
. -
stopwords_path
-
(Optional, string) Path to a file that contains a list of stop words to remove.
This path must be absolute or relative to the
config
location, and the file must be UTF-8 encoded. Each stop word in the file must be separated by a line break. -
ignore_case
-
(Optional, Boolean)
If
true
, stop word matching is case insensitive. For example, iftrue
, a stop word ofthe
matches and removesThe
,THE
, orthe
. Defaults tofalse
. -
remove_trailing
-
(Optional, Boolean) If
true
, the last token of a stream is removed if it’s a stop word. Defaults totrue
.This parameter should be
false
when using the filter with a completion suggester. This would ensure a query likegreen a
matches and suggestsgreen apple
while still removing other stop words.
Customize
editTo customize the stop
filter, duplicate it to create the basis
for a new custom token filter. You can modify the filter using its configurable
parameters.
For example, the following request creates a custom case-insensitive stop
filter that removes stop words from the _english_
stop
words list:
PUT /my-index-000001 { "settings": { "analysis": { "analyzer": { "default": { "tokenizer": "whitespace", "filter": [ "my_custom_stop_words_filter" ] } }, "filter": { "my_custom_stop_words_filter": { "type": "stop", "ignore_case": true } } } } }
You can also specify your own list of stop words. For example, the following
request creates a custom case-insensitive stop
filter that removes only the stop
words and
, is
, and the
:
PUT /my-index-000001 { "settings": { "analysis": { "analyzer": { "default": { "tokenizer": "whitespace", "filter": [ "my_custom_stop_words_filter" ] } }, "filter": { "my_custom_stop_words_filter": { "type": "stop", "ignore_case": true, "stopwords": [ "and", "is", "the" ] } } } } }
Stop words by language
editThe following list contains supported language values for the stopwords
parameter and a link to their predefined stop words in Lucene.
-
_arabic_
- Arabic stop words
-
_armenian_
- Armenian stop words
-
_basque_
- Basque stop words
-
_bengali_
- Bengali stop words
-
_brazilian_
(Brazilian Portuguese) - Brazilian Portuguese stop words
-
_bulgarian_
- Bulgarian stop words
-
_catalan_
- Catalan stop words
-
_cjk_
(Chinese, Japanese, and Korean) - CJK stop words
-
_czech_
- Czech stop words
-
_danish_
- Danish stop words
-
_dutch_
- Dutch stop words
-
_english_
- English stop words
-
_estonian_
- Estonian stop words
-
_finnish_
- Finnish stop words
-
_french_
- French stop words
-
_galician_
- Galician stop words
-
_german_
- German stop words
-
_greek_
- Greek stop words
-
_hindi_
- Hindi stop words
-
_hungarian_
- Hungarian stop words
-
_indonesian_
- Indonesian stop words
-
_irish_
- Irish stop words
-
_italian_
- Italian stop words
-
_latvian_
- Latvian stop words
-
_lithuanian_
- Lithuanian stop words
-
_norwegian_
- Norwegian stop words
-
_persian_
- Persian stop words
-
_portuguese_
- Portuguese stop words
-
_romanian_
- Romanian stop words
-
_russian_
- Russian stop words
-
_sorani_
- Sorani stop words
-
_spanish_
- Spanish stop words
-
_swedish_
- Swedish stop words
-
_thai_
- Thai stop words
-
_turkish_
- Turkish stop words