Elision token filter
editElision token filter
editRemoves specified elisions from
the beginning of tokens. For example, you can use this filter to change
l'avion to avion.
When not customized, the filter removes the following French elisions by default:
l', m', t', qu', n', s', j', d', c', jusqu', quoiqu',
lorsqu', puisqu'
Customized versions of this filter are included in several of Elasticsearch’s built-in language analyzers:
This filter uses Lucene’s ElisionFilter.
Example
editThe following analyze API request uses the elision
filter to remove j' from j’examine près du wharf:
response = client.indices.analyze(
body: {
tokenizer: 'standard',
filter: [
'elision'
],
text: 'j’examine près du wharf'
}
)
puts response
GET _analyze
{
"tokenizer" : "standard",
"filter" : ["elision"],
"text" : "j’examine près du wharf"
}
The filter produces the following tokens:
[ examine, près, du, wharf ]
Add to an analyzer
editThe following create index API request uses the
elision filter to configure a new
custom analyzer.
response = client.indices.create(
index: 'elision_example',
body: {
settings: {
analysis: {
analyzer: {
whitespace_elision: {
tokenizer: 'whitespace',
filter: [
'elision'
]
}
}
}
}
}
)
puts response
PUT /elision_example
{
"settings": {
"analysis": {
"analyzer": {
"whitespace_elision": {
"tokenizer": "whitespace",
"filter": [ "elision" ]
}
}
}
}
}
Configurable parameters
edit-
articles -
(Required*, array of string) List of elisions to remove.
To be removed, the elision must be at the beginning of a token and be immediately followed by an apostrophe. Both the elision and apostrophe are removed.
For custom
elisionfilters, either this parameter orarticles_pathmust be specified. -
articles_path -
(Required*, string) Path to a file that contains a list of elisions to remove.
This path must be absolute or relative to the
configlocation, and the file must be UTF-8 encoded. Each elision in the file must be separated by a line break.To be removed, the elision must be at the beginning of a token and be immediately followed by an apostrophe. Both the elision and apostrophe are removed.
For custom
elisionfilters, either this parameter orarticlesmust be specified. -
articles_case -
(Optional, Boolean)
If
true, elision matching is case insensitive. Iffalse, elision matching is case sensitive. Defaults tofalse.
Customize
editTo customize the elision filter, duplicate it to create the basis
for a new custom token filter. You can modify the filter using its configurable
parameters.
For example, the following request creates a custom case-insensitive elision
filter that removes the l', m', t', qu', n', s',
and j' elisions:
response = client.indices.create(
index: 'elision_case_insensitive_example',
body: {
settings: {
analysis: {
analyzer: {
default: {
tokenizer: 'whitespace',
filter: [
'elision_case_insensitive'
]
}
},
filter: {
elision_case_insensitive: {
type: 'elision',
articles: [
'l',
'm',
't',
'qu',
'n',
's',
'j'
],
articles_case: true
}
}
}
}
}
)
puts response
PUT /elision_case_insensitive_example
{
"settings": {
"analysis": {
"analyzer": {
"default": {
"tokenizer": "whitespace",
"filter": [ "elision_case_insensitive" ]
}
},
"filter": {
"elision_case_insensitive": {
"type": "elision",
"articles": [ "l", "m", "t", "qu", "n", "s", "j" ],
"articles_case": true
}
}
}
}
}