Using the Attachment Processor in a Pipelineedit

Table 1. Attachment options

Name Required Default Description

field

yes

-

The field to get the base64 encoded field from

target_field

no

attachment

The field that will hold the attachment information

indexed_chars

no

100000

The number of chars being used for extraction to prevent huge fields. Use -1 for no limit.

properties

no

all properties

 Array of properties to select to be stored. Can be content, title, name, author, keywords, date, content_type, content_length, language

ignore_missing

no

false

If true and field does not exist, the processor quietly exits without modifying the document

For example, this:

PUT _ingest/pipeline/attachment
{
  "description" : "Extract attachment information",
  "processors" : [
    {
      "attachment" : {
        "field" : "data"
      }
    }
  ]
}
PUT my_index/my_type/my_id?pipeline=attachment
{
  "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="
}
GET my_index/my_type/my_id

Returns this:

{
  "found": true,
  "_index": "my_index",
  "_type": "my_type",
  "_id": "my_id",
  "_version": 1,
  "_source": {
    "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=",
    "attachment": {
      "content_type": "application/rtf",
      "language": "ro",
      "content": "Lorem ipsum dolor sit amet",
      "content_length": 28
    }
  }
}

To specify only some fields to be extracted:

PUT _ingest/pipeline/attachment
{
  "description" : "Extract attachment information",
  "processors" : [
    {
      "attachment" : {
        "field" : "data",
        "properties": [ "content", "title" ]
      }
    }
  ]
}

Extracting contents from binary data is a resource intensive operation and consumes a lot of resources. It is highly recommended to run pipelines using this processor in a dedicated ingest node.