Pattern replace token filter

Pattern replace token filteredit

Uses a regular expression to match and replace token substrings.

The pattern_replace filter uses Java’s regular expression syntax. By default, the filter replaces matching substrings with an empty substring (""). Replacement substrings can use Java’s $g syntax to reference capture groups from the original token text.

A poorly-written regular expression may run slowly or return a StackOverflowError, causing the node running the expression to exit suddenly.

This filter uses Lucene’s PatternReplaceFilter.

Exampleedit

The following analyze API request uses the pattern_replace filter to prepend watch to the substring dog in foxes jump lazy dogs.

response = client.indices.analyze(
  body: {
    tokenizer: 'whitespace',
    filter: [
      {
        type: 'pattern_replace',
        pattern: '(dog)',
        replacement: 'watch$1'
      }
    ],
    text: 'foxes jump lazy dogs'
  }
)
puts response

GET /_analyze
{
  "tokenizer": "whitespace",
  "filter": [
    {
      "type": "pattern_replace",
      "pattern": "(dog)",
      "replacement": "watch$1"
    }
  ],
  "text": "foxes jump lazy dogs"
}

The filter produces the following tokens.

[ foxes, jump, lazy, watchdogs ]

Configurable parametersedit

all: (Optional, Boolean) If true, all substrings matching the pattern parameter’s regular expression are replaced. If false, the filter replaces only the first matching substring in each token. Defaults to true.
pattern: (Required, string) Regular expression, written in Java’s regular expression syntax. The filter replaces token substrings matching this pattern with the substring in the replacement parameter.
replacement: (Optional, string) Replacement substring. Defaults to an empty substring ("").

Customize and add to an analyzeredit

To customize the pattern_replace filter, duplicate it to create the basis for a new custom token filter. You can modify the filter using its configurable parameters.

The following create index API request configures a new custom analyzer using a custom pattern_replace filter, my_pattern_replace_filter.

The my_pattern_replace_filter filter uses the regular expression [£|€] to match and remove the currency symbols £ and €. The filter’s all parameter is false, meaning only the first matching symbol in each token is removed.

response = client.indices.create(
  index: 'my-index-000001',
  body: {
    settings: {
      analysis: {
        analyzer: {
          my_analyzer: {
            tokenizer: 'keyword',
            filter: [
              'my_pattern_replace_filter'
            ]
          }
        },
        filter: {
          my_pattern_replace_filter: {
            type: 'pattern_replace',
            pattern: '[£|€]',
            replacement: '',
            all: false
          }
        }
      }
    }
  }
)
puts response

PUT /my-index-000001
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "keyword",
          "filter": [
            "my_pattern_replace_filter"
          ]
        }
      },
      "filter": {
        "my_pattern_replace_filter": {
          "type": "pattern_replace",
          "pattern": "[£|€]",
          "replacement": "",
          "all": false
        }
      }
    }
  }
}

« Pattern capture token filter Phonetic token filter »