Pattern Analyzeredit

An analyzer of type pattern that can flexibly separate text into terms via a regular expression. Accepts the following settings:

The following are settings that can be set for a pattern analyzer type:

lowercase

Should terms be lowercased or not. Defaults to true.

pattern

The regular expression pattern, defaults to \W+.

flags

The regular expression flags.

stopwords

A list of stopwords to initialize the stop filter with. Defaults to an empty stopword list Check Stop Analyzer for more details.

IMPORTANT: The regular expression should match the token separators, not the tokens themselves.

Flags should be pipe-separated, eg "CASE_INSENSITIVE|COMMENTS". Check Java Pattern API for more details about flags options.

Pattern Analyzer Examplesedit

In order to try out these examples, you should delete the test index before running each example.

Whitespace tokenizeredit
DELETE test

PUT /test
{
  "settings": {
    "analysis": {
      "analyzer": {
        "whitespace": {
          "type": "pattern",
          "pattern": "\\s+"
        }
      }
    }
  }
}

GET /test/_analyze?analyzer=whitespace&text=foo,bar baz

# "foo,bar", "baz"
Non-word character tokenizeredit
DELETE test

PUT /test
{
  "settings": {
    "analysis": {
      "analyzer": {
        "nonword": {
          "type": "pattern",
          "pattern": "[^\\w]+" 
        }
      }
    }
  }
}

GET /test/_analyze?analyzer=nonword&text=foo,bar baz
# "foo,bar baz" becomes "foo", "bar", "baz"

GET /test/_analyze?analyzer=nonword&text=type_1-type_4
# "type_1","type_4"
CamelCase tokenizeredit
DELETE test

PUT /test?pretty=1
{
  "settings": {
    "analysis": {
      "analyzer": {
        "camel": {
          "type": "pattern",
          "pattern": "([^\\p{L}\\d]+)|(?<=\\D)(?=\\d)|(?<=\\d)(?=\\D)|(?<=[\\p{L}&&[^\\p{Lu}]])(?=\\p{Lu})|(?<=\\p{Lu})(?=\\p{Lu}[\\p{L}&&[^\\p{Lu}]])"
        }
      }
    }
  }
}

GET /test/_analyze?analyzer=camel&text=MooseX::FTPClass2_beta
# "moose","x","ftp","class","2","beta"

The regex above is easier to understand as:

  ([^\p{L}\d]+)                 # swallow non letters and numbers,
| (?<=\D)(?=\d)                 # or non-number followed by number,
| (?<=\d)(?=\D)                 # or number followed by non-number,
| (?<=[ \p{L} && [^\p{Lu}]])    # or lower case
  (?=\p{Lu})                    #   followed by upper case,
| (?<=\p{Lu})                   # or upper case
  (?=\p{Lu}                     #   followed by upper case
    [\p{L}&&[^\p{Lu}]]          #   then lower case
  )