HTML strip character filteredit

Strips HTML elements from a text and replaces HTML entities with their decoded value (e.g, replaces & with &).

The html_strip filter uses Lucene’s HTMLStripCharFilter.

Exampleedit

The following analyze API request uses the html_strip filter to change the text <p>I&apos;m so <b>happy</b>!</p> to \nI'm so happy!\n.

response = client.indices.analyze(
  body: {
    tokenizer: 'keyword',
    char_filter: [
      'html_strip'
    ],
    text: 'I&apos;m so happy</b>!</p>'
  }
)
puts response
GET /_analyze
{
  "tokenizer": "keyword",
  "char_filter": [
    "html_strip"
  ],
  "text": "<p>I&apos;m so <b>happy</b>!</p>"
}

The filter produces the following text:

[ \nI'm so happy!\n ]

Add to an analyzeredit

The following create index API request uses the html_strip filter to configure a new custom analyzer.

response = client.indices.create(
  index: 'my-index-000001',
  body: {
    settings: {
      analysis: {
        analyzer: {
          my_analyzer: {
            tokenizer: 'keyword',
            char_filter: [
              'html_strip'
            ]
          }
        }
      }
    }
  }
)
puts response
PUT /my-index-000001
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "keyword",
          "char_filter": [
            "html_strip"
          ]
        }
      }
    }
  }
}

Configurable parametersedit

escaped_tags
(Optional, array of strings) Array of HTML elements without enclosing angle brackets (< >). The filter skips these HTML elements when stripping HTML from the text. For example, a value of [ "p" ] skips the <p> HTML element.

Customizeedit

To customize the html_strip filter, duplicate it to create the basis for a new custom character filter. You can modify the filter using its configurable parameters.

The following create index API request configures a new custom analyzer using a custom html_strip filter, my_custom_html_strip_char_filter.

The my_custom_html_strip_char_filter filter skips the removal of the <b> HTML element.

response = client.indices.create(
  index: 'my-index-000001',
  body: {
    settings: {
      analysis: {
        analyzer: {
          my_analyzer: {
            tokenizer: 'keyword',
            char_filter: [
              'my_custom_html_strip_char_filter'
            ]
          }
        },
        char_filter: {
          my_custom_html_strip_char_filter: {
            type: 'html_strip',
            escaped_tags: [
              'b'
            ]
          }
        }
      }
    }
  }
)
puts response
PUT my-index-000001
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "keyword",
          "char_filter": [
            "my_custom_html_strip_char_filter"
          ]
        }
      },
      "char_filter": {
        "my_custom_html_strip_char_filter": {
          "type": "html_strip",
          "escaped_tags": [
            "b"
          ]
        }
      }
    }
  }
}