HTML strip character filter
editHTML strip character filter
editStrips HTML elements from a text and replaces HTML entities with their decoded
value (e.g, replaces & with &).
The html_strip filter uses Lucene’s
HTMLStripCharFilter.
Example
editThe following analyze API request uses the
html_strip filter to change the text <p>I'm so <b>happy</b>!</p> to
\nI'm so happy!\n.
GET /_analyze
{
"tokenizer": "keyword",
"char_filter": [
"html_strip"
],
"text": "<p>I'm so <b>happy</b>!</p>"
}
The filter produces the following text:
[ \nI'm so happy!\n ]
Add to an analyzer
editThe following create index API request uses the
html_strip filter to configure a new
custom analyzer.
PUT /my-index-000001
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "keyword",
"char_filter": [
"html_strip"
]
}
}
}
}
}
Configurable parameters
edit-
escaped_tags -
(Optional, array of strings)
Array of HTML elements without enclosing angle brackets (
< >). The filter skips these HTML elements when stripping HTML from the text. For example, a value of[ "p" ]skips the<p>HTML element.
Customize
editTo customize the html_strip filter, duplicate it to create the basis for a new
custom character filter. You can modify the filter using its configurable
parameters.
The following create index API request
configures a new custom analyzer using a custom
html_strip filter, my_custom_html_strip_char_filter.
The my_custom_html_strip_char_filter filter skips the removal of the <b>
HTML element.
PUT my-index-000001
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "keyword",
"char_filter": [
"my_custom_html_strip_char_filter"
]
}
},
"char_filter": {
"my_custom_html_strip_char_filter": {
"type": "html_strip",
"escaped_tags": [
"b"
]
}
}
}
}
}