Pattern Analyzeredit

An analyzer of type pattern that can flexibly separate text into terms via a regular expression. Accepts the following settings:

The following are settings that can be set for a pattern analyzer type:

Setting Description

lowercase

Should terms be lowercased or not. Defaults to true.

pattern

The regular expression pattern, defaults to \W+.

flags

The regular expression flags.

stopwords

A list of stopwords to initialize the stop filter with. Defaults to an empty stopword list. [1.0.0.RC1] Added in 1.0.0.RC1. Previously defaulted to the English stopwords list . Check Stop Analyzer for more details.

IMPORTANT: The regular expression should match the token separators, not the tokens themselves.

Flags should be pipe-separated, eg "CASE_INSENSITIVE|COMMENTS". Check Java Pattern API for more details about flags options.

Pattern Analyzer Examplesedit

In order to try out these examples, you should delete the test index before running each example:

    curl -XDELETE localhost:9200/test
Whitespace tokenizeredit
    curl -XPUT 'localhost:9200/test' -d '
    {
        "settings":{
            "analysis": {
                "analyzer": {
                    "whitespace":{
                        "type": "pattern",
                        "pattern":"\\\\s+"
                    }
                }
            }
        }
    }'

    curl 'localhost:9200/test/_analyze?pretty=1&analyzer=whitespace' -d 'foo,bar baz'
    # "foo,bar", "baz"
Non-word character tokenizeredit
    curl -XPUT 'localhost:9200/test' -d '
    {
        "settings":{
            "analysis": {
                "analyzer": {
                    "nonword":{
                        "type": "pattern",
                        "pattern":"[^\\\\w]+"
                    }
                }
            }
        }
    }'

    curl 'localhost:9200/test/_analyze?pretty=1&analyzer=nonword' -d 'foo,bar baz'
    # "foo,bar baz" becomes "foo", "bar", "baz"

    curl 'localhost:9200/test/_analyze?pretty=1&analyzer=nonword' -d 'type_1-type_4'
    # "type_1","type_4"
CamelCase tokenizeredit
    curl -XPUT 'localhost:9200/test?pretty=1' -d '
    {
        "settings":{
            "analysis": {
                "analyzer": {
                    "camel":{
                        "type": "pattern",
                        "pattern":"([^\\\\p{L}\\\\d]+)|(?<=\\\\D)(?=\\\\d)|(?<=\\\\d)(?=\\\\D)|(?<=[\\\\p{L}&&[^\\\\p{Lu}]])(?=\\\\p{Lu})|(?<=\\\\p{Lu})(?=\\\\p{Lu}[\\\\p{L}&&[^\\\\p{Lu}]])"
                    }
                }
            }
        }
    }'

    curl 'localhost:9200/test/_analyze?pretty=1&analyzer=camel' -d '
        MooseX::FTPClass2_beta
    '
    # "moose","x","ftp","class","2","beta"

The regex above is easier to understand as:

      ([^\\p{L}\\d]+)                 # swallow non letters and numbers,
    | (?<=\\D)(?=\\d)                 # or non-number followed by number,
    | (?<=\\d)(?=\\D)                 # or number followed by non-number,
    | (?<=[ \\p{L} && [^\\p{Lu}]])    # or lower case
      (?=\\p{Lu})                    #   followed by upper case,
    | (?<=\\p{Lu})                   # or upper case
      (?=\\p{Lu}                     #   followed by upper case
        [\\p{L}&&[^\\p{Lu}]]          #   then lower case
      )