Keep types token filteredit

Keeps or removes tokens of a specific type. For example, you can use this filter to change 3 quick foxes to quick foxes by keeping only <ALPHANUM> (alphanumeric) tokens.

Token types

Token types are set by the tokenizer when converting characters to tokens. Token types can vary between tokenizers.

For example, the standard tokenizer can produce a variety of token types, including <ALPHANUM>, <HANGUL>, and <NUM>. Simpler analyzers, like the lowercase tokenizer, only produce the word token type.

Certain token filters can also add token types. For example, the synonym filter can add the <SYNONYM> token type.

Some tokenizers don’t support this token filter, for example keyword, simple_pattern, and simple_pattern_split tokenizers, as they don’t support setting the token type attribute.

This filter uses Lucene’s TypeTokenFilter.

Include exampleedit

The following analyze API request uses the keep_types filter to keep only <NUM> (numeric) tokens from 1 quick fox 2 lazy dogs.

response = client.indices.analyze(
  body: {
    tokenizer: 'standard',
    filter: [
      {
        type: 'keep_types',
        types: [
          '<NUM>'
        ]
      }
    ],
    text: '1 quick fox 2 lazy dogs'
  }
)
puts response
GET _analyze
{
  "tokenizer": "standard",
  "filter": [
    {
      "type": "keep_types",
      "types": [ "<NUM>" ]
    }
  ],
  "text": "1 quick fox 2 lazy dogs"
}

The filter produces the following tokens:

[ 1, 2 ]

Exclude exampleedit

The following analyze API request uses the keep_types filter to remove <NUM> tokens from 1 quick fox 2 lazy dogs. Note the mode parameter is set to exclude.

response = client.indices.analyze(
  body: {
    tokenizer: 'standard',
    filter: [
      {
        type: 'keep_types',
        types: [
          '<NUM>'
        ],
        mode: 'exclude'
      }
    ],
    text: '1 quick fox 2 lazy dogs'
  }
)
puts response
GET _analyze
{
  "tokenizer": "standard",
  "filter": [
    {
      "type": "keep_types",
      "types": [ "<NUM>" ],
      "mode": "exclude"
    }
  ],
  "text": "1 quick fox 2 lazy dogs"
}

The filter produces the following tokens:

[ quick, fox, lazy, dogs ]

Configurable parametersedit

types
(Required, array of strings) List of token types to keep or remove.
mode

(Optional, string) Indicates whether to keep or remove the specified token types. Valid values are:

include
(Default) Keep only the specified token types.
exclude
Remove the specified token types.

Customize and add to an analyzeredit

To customize the keep_types filter, duplicate it to create the basis for a new custom token filter. You can modify the filter using its configurable parameters.

For example, the following create index API request uses a custom keep_types filter to configure a new custom analyzer. The custom keep_types filter keeps only <ALPHANUM> (alphanumeric) tokens.

response = client.indices.create(
  index: 'keep_types_example',
  body: {
    settings: {
      analysis: {
        analyzer: {
          my_analyzer: {
            tokenizer: 'standard',
            filter: [
              'extract_alpha'
            ]
          }
        },
        filter: {
          extract_alpha: {
            type: 'keep_types',
            types: [
              '<ALPHANUM>'
            ]
          }
        }
      }
    }
  }
)
puts response
PUT keep_types_example
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "standard",
          "filter": [ "extract_alpha" ]
        }
      },
      "filter": {
        "extract_alpha": {
          "type": "keep_types",
          "types": [ "<ALPHANUM>" ]
        }
      }
    }
  }
}