Word delimiter graph token filter
editWord delimiter graph token filter
editSplits tokens at non-alphanumeric characters. The word_delimiter_graph filter
also performs optional token normalization based on a set of rules. By default,
the filter uses the following rules:
-
Split tokens at non-alphanumeric characters.
The filter uses these characters as delimiters.
For example:
Super-Duper→Super,Duper -
Remove leading or trailing delimiters from each token.
For example:
XL---42+'Autocoder'→XL,42,Autocoder -
Split tokens at letter case transitions.
For example:
PowerShot→Power,Shot -
Split tokens at letter-number transitions.
For example:
XL500→XL,500 -
Remove the English possessive (
's) from the end of each token. For example:Neil's→Neil
The word_delimiter_graph filter uses Lucene’s
WordDelimiterGraphFilter.
The word_delimiter_graph filter was designed to remove punctuation from
complex identifiers, such as product IDs or part numbers. For these use cases,
we recommend using the word_delimiter_graph filter with the
keyword tokenizer.
Avoid using the word_delimiter_graph filter to split hyphenated words, such as
wi-fi. Because users often search for these words both with and without
hyphens, we recommend using the
synonym_graph filter instead.
Example
editThe following analyze API request uses the
word_delimiter_graph filter to split Neil's-Super-Duper-XL500--42+AutoCoder
into normalized tokens using the filter’s default rules:
response = client.indices.analyze(
body: {
tokenizer: 'keyword',
filter: [
'word_delimiter_graph'
],
text: "Neil's-Super-Duper-XL500--42+AutoCoder"
}
)
puts response
GET /_analyze
{
"tokenizer": "keyword",
"filter": [ "word_delimiter_graph" ],
"text": "Neil's-Super-Duper-XL500--42+AutoCoder"
}
The filter produces the following tokens:
[ Neil, Super, Duper, XL, 500, 42, Auto, Coder ]
Add to an analyzer
editThe following create index API request uses the
word_delimiter_graph filter to configure a new
custom analyzer.
PUT /my-index-000001
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "keyword",
"filter": [ "word_delimiter_graph" ]
}
}
}
}
}
Avoid using the word_delimiter_graph filter with tokenizers that remove
punctuation, such as the standard tokenizer.
This could prevent the word_delimiter_graph filter from splitting tokens
correctly. It can also interfere with the filter’s configurable parameters, such
as catenate_all or
preserve_original. We
recommend using the keyword or
whitespace tokenizer instead.
Configurable parameters
edit-
adjust_offsets -
(Optional, Boolean) If
true, the filter adjusts the offsets of split or catenated tokens to better reflect their actual position in the token stream. Defaults totrue.Set
adjust_offsetstofalseif your analyzer uses filters, such as thetrimfilter, that change the length of tokens without changing their offsets. Otherwise, theword_delimiter_graphfilter could produce tokens with illegal offsets.
-
catenate_all -
(Optional, Boolean) If
true, the filter produces catenated tokens for chains of alphanumeric characters separated by non-alphabetic delimiters. For example:super-duper-xl-500→ [superduperxl500,super,duper,xl,500]. Defaults tofalse.Setting this parameter to
trueproduces multi-position tokens, which are not supported by indexing.If this parameter is
true, avoid using this filter in an index analyzer or use theflatten_graphfilter after this filter to make the token stream suitable for indexing.When used for search analysis, catenated tokens can cause problems for the
match_phrasequery and other queries that rely on token position for matching. Avoid setting this parameter totrueif you plan to use these queries.
-
catenate_numbers -
(Optional, Boolean) If
true, the filter produces catenated tokens for chains of numeric characters separated by non-alphabetic delimiters. For example:01-02-03→ [010203,01,02,03]. Defaults tofalse.Setting this parameter to
trueproduces multi-position tokens, which are not supported by indexing.If this parameter is
true, avoid using this filter in an index analyzer or use theflatten_graphfilter after this filter to make the token stream suitable for indexing.When used for search analysis, catenated tokens can cause problems for the
match_phrasequery and other queries that rely on token position for matching. Avoid setting this parameter totrueif you plan to use these queries.
-
catenate_words -
(Optional, Boolean) If
true, the filter produces catenated tokens for chains of alphabetical characters separated by non-alphabetic delimiters. For example:super-duper-xl→ [superduperxl,super,duper,xl]. Defaults tofalse.Setting this parameter to
trueproduces multi-position tokens, which are not supported by indexing.If this parameter is
true, avoid using this filter in an index analyzer or use theflatten_graphfilter after this filter to make the token stream suitable for indexing.When used for search analysis, catenated tokens can cause problems for the
match_phrasequery and other queries that rely on token position for matching. Avoid setting this parameter totrueif you plan to use these queries. -
generate_number_parts -
(Optional, Boolean)
If
true, the filter includes tokens consisting of only numeric characters in the output. Iffalse, the filter excludes these tokens from the output. Defaults totrue. -
generate_word_parts -
(Optional, Boolean)
If
true, the filter includes tokens consisting of only alphabetical characters in the output. Iffalse, the filter excludes these tokens from the output. Defaults totrue. -
ignore_keywords -
(Optional, Boolean)
If
true, the filter skips tokens with akeywordattribute oftrue. Defaults tofalse.
-
preserve_original -
(Optional, Boolean) If
true, the filter includes the original version of any split tokens in the output. This original version includes non-alphanumeric delimiters. For example:super-duper-xl-500→ [super-duper-xl-500,super,duper,xl,500]. Defaults tofalse.Setting this parameter to
trueproduces multi-position tokens, which are not supported by indexing.If this parameter is
true, avoid using this filter in an index analyzer or use theflatten_graphfilter after this filter to make the token stream suitable for indexing. -
protected_words - (Optional, array of strings) Array of tokens the filter won’t split.
-
protected_words_path -
(Optional, string) Path to a file that contains a list of tokens the filter won’t split.
This path must be absolute or relative to the
configlocation, and the file must be UTF-8 encoded. Each token in the file must be separated by a line break. -
split_on_case_change -
(Optional, Boolean)
If
true, the filter splits tokens at letter case transitions. For example:camelCase→ [camel,Case]. Defaults totrue. -
split_on_numerics -
(Optional, Boolean)
If
true, the filter splits tokens at letter-number transitions. For example:j2se→ [j,2,se]. Defaults totrue. -
stem_english_possessive -
(Optional, Boolean)
If
true, the filter removes the English possessive ('s) from the end of each token. For example:O'Neil's→ [O,Neil]. Defaults totrue. -
type_table -
(Optional, array of strings) Array of custom type mappings for characters. This allows you to map non-alphanumeric characters as numeric or alphanumeric to avoid splitting on those characters.
For example, the following array maps the plus (
+) and hyphen (-) characters as alphanumeric, which means they won’t be treated as delimiters:[ "+ => ALPHA", "- => ALPHA" ]Supported types include:
-
ALPHA(Alphabetical) -
ALPHANUM(Alphanumeric) -
DIGIT(Numeric) -
LOWER(Lowercase alphabetical) -
SUBWORD_DELIM(Non-alphanumeric delimiter) -
UPPER(Uppercase alphabetical)
-
-
type_table_path -
(Optional, string) Path to a file that contains custom type mappings for characters. This allows you to map non-alphanumeric characters as numeric or alphanumeric to avoid splitting on those characters.
For example, the contents of this file may contain the following:
# Map the $, %, '.', and ',' characters to DIGIT # This might be useful for financial data. $ => DIGIT % => DIGIT . => DIGIT \\u002C => DIGIT # in some cases you might not want to split on ZWJ # this also tests the case where we need a bigger byte[] # see https://en.wikipedia.org/wiki/Zero-width_joiner \\u200D => ALPHANUM
Supported types include:
-
ALPHA(Alphabetical) -
ALPHANUM(Alphanumeric) -
DIGIT(Numeric) -
LOWER(Lowercase alphabetical) -
SUBWORD_DELIM(Non-alphanumeric delimiter) -
UPPER(Uppercase alphabetical)
This file path must be absolute or relative to the
configlocation, and the file must be UTF-8 encoded. Each mapping in the file must be separated by a line break. -
Customize
editTo customize the word_delimiter_graph filter, duplicate it to create the basis
for a new custom token filter. You can modify the filter using its configurable
parameters.
For example, the following request creates a word_delimiter_graph
filter that uses the following rules:
-
Split tokens at non-alphanumeric characters, except the hyphen (
-) character. - Remove leading or trailing delimiters from each token.
- Do not split tokens at letter case transitions.
- Do not split tokens at letter-number transitions.
-
Remove the English possessive (
's) from the end of each token.
PUT /my-index-000001
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "keyword",
"filter": [ "my_custom_word_delimiter_graph_filter" ]
}
},
"filter": {
"my_custom_word_delimiter_graph_filter": {
"type": "word_delimiter_graph",
"type_table": [ "- => ALPHA" ],
"split_on_case_change": false,
"split_on_numerics": false,
"stem_english_possessive": true
}
}
}
}
}
Differences between word_delimiter_graph and word_delimiter
editBoth the word_delimiter_graph and
word_delimiter filters produce tokens
that span multiple positions when any of the following parameters are true:
However, only the word_delimiter_graph filter assigns multi-position tokens a
positionLength attribute, which indicates the number of positions a token
spans. This ensures the word_delimiter_graph filter always produces valid
token graphs.
The word_delimiter filter does not assign multi-position tokens a
positionLength attribute. This means it produces invalid graphs for streams
including these tokens.
While indexing does not support token graphs containing multi-position tokens,
queries, such as the match_phrase query, can
use these graphs to generate multiple sub-queries from a single query string.
To see how token graphs produced by the word_delimiter and
word_delimiter_graph filters differ, check out the following example.
Example
Both the word_delimiter and word_delimiter_graph produce the following token
graph for PowerShot2000 when the following parameters are false:
This graph does not contain multi-position tokens. All tokens span only one position.
word_delimiter_graph graph with a multi-position token
The word_delimiter_graph filter produces the following token graph for
PowerShot2000 when catenate_words is true.
This graph correctly indicates the catenated PowerShot token spans two
positions.
word_delimiter graph with a multi-position token
When catenate_words is true, the word_delimiter filter produces
the following token graph for PowerShot2000.
Note that the catenated PowerShot token should span two positions but only
spans one in the token graph, making it invalid.