﻿---
title: Pattern analyzer
description: The pattern analyzer uses a regular expression to split the text into terms. The regular expression should match the token separators  not the tokens...
url: https://www.elastic.co/docs/reference/text-analysis/analysis-pattern-analyzer
products:
  - Elasticsearch
---

# Pattern analyzer
The `pattern` analyzer uses a regular expression to split the text into terms. The regular expression should match the **token separators**  not the tokens themselves. The regular expression defaults to `\W+` (or all non-word characters).
<admonition title="Beware of Pathological Regular Expressions">
  The pattern analyzer uses [Java Regular Expressions](https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.md).A badly written regular expression could run very slowly or even throw a StackOverflowError and cause the node it is running on to exit suddenly.Read more about [pathological regular expressions and how to avoid them](https://www.regular-expressions.info/catastrophic.html).
</admonition>


## Example output

```json

{
  "analyzer": "pattern",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
```

The above sentence would produce the following terms:
```text
[ the, 2, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ]
```


## Configuration

The `pattern` analyzer accepts the following parameters:
<definitions>
  <definition term="pattern">
    A [Java regular expression](https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.md), defaults to `\W+`.
  </definition>
  <definition term="flags">
    Java regular expression [flags](https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.md#field.summary). Flags should be pipe-separated, eg `"CASE_INSENSITIVE|COMMENTS"`.
  </definition>
  <definition term="lowercase">
    Should terms be lowercased or not. Defaults to `true`.
  </definition>
  <definition term="stopwords">
    A pre-defined stop words list like `_english_` or an array containing a list of stop words. Defaults to `_none_`.
  </definition>
  <definition term="stopwords_path">
    The path to a file containing stop words.
  </definition>
</definitions>

See the [Stop Token Filter](https://www.elastic.co/docs/reference/text-analysis/analysis-stop-tokenfilter) for more information about stop word configuration.

## Example configuration

In this example, we configure the `pattern` analyzer to split email addresses on non-word characters or on underscores (`\W|_`), and to lower-case the result:
```json

{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_email_analyzer": {
          "type":      "pattern",
          "pattern":   "\\W|_", <1>
          "lowercase": true
        }
      }
    }
  }
}


{
  "analyzer": "my_email_analyzer",
  "text": "John_Smith@foo-bar.com"
}
```

The above example produces the following terms:
```text
[ john, smith, foo, bar, com ]
```


### CamelCase tokenizer

The following more complicated example splits CamelCase text into tokens:
```json

{
  "settings": {
    "analysis": {
      "analyzer": {
        "camel": {
          "type": "pattern",
          "pattern": "([^\\p{L}\\d]+)|(?<=\\D)(?=\\d)|(?<=\\d)(?=\\D)|(?<=[\\p{L}&&[^\\p{Lu}]])(?=\\p{Lu})|(?<=\\p{Lu})(?=\\p{Lu}[\\p{L}&&[^\\p{Lu}]])"
        }
      }
    }
  }
}


{
  "analyzer": "camel",
  "text": "MooseX::FTPClass2_beta"
}
```

The above example produces the following terms:
```text
[ moose, x, ftp, class, 2, beta ]
```

The regex above is easier to understand as:
```text
  ([^\p{L}\d]+)                
| (?<=\D)(?=\d)                
| (?<=\d)(?=\D)                
| (?<=[ \p{L} && [^\p{Lu}]])   
  (?=\p{Lu})                   
| (?<=\p{Lu})                  
  (?=\p{Lu}                    
    [\p{L}&&[^\p{Lu}]]         
  )
```


## Definition

The `pattern` analyzer consists of:
<definitions>
  <definition term="Tokenizer">
    - [Pattern Tokenizer](https://www.elastic.co/docs/reference/text-analysis/analysis-pattern-tokenizer)
  </definition>
  <definition term="Token Filters">
    - [Lower Case Token Filter](https://www.elastic.co/docs/reference/text-analysis/analysis-lowercase-tokenfilter)
    - [Stop Token Filter](https://www.elastic.co/docs/reference/text-analysis/analysis-stop-tokenfilter) (disabled by default)
  </definition>
</definitions>

If you need to customize the `pattern` analyzer beyond the configuration parameters then you need to recreate it as a `custom` analyzer and modify it, usually by adding token filters. This would recreate the built-in `pattern` analyzer and you can use it as a starting point for further customization:
```json

{
  "settings": {
    "analysis": {
      "tokenizer": {
        "split_on_non_word": {
          "type":       "pattern",
          "pattern":    "\\W+" <1>
        }
      },
      "analyzer": {
        "rebuilt_pattern": {
          "tokenizer": "split_on_non_word",
          "filter": [
            "lowercase"       <2>
          ]
        }
      }
    }
  }
}
```