2016年11月17日 エンジニアリング

The Future of Attachments for Elasticsearch and .NET

By Russ Cam

For a long time, Elasticsearch has supported the indexing of attachments through the mapper-attachments plugin. Installing this plugin has provided the capability to index Word documents, PDFs as well as many other text-based document attachments, extracting the content from each file including metadata such as content type, author and keywords and making it searchable in Elasticsearch.

The Not Too Distant Past

Working with the mapper-attachments plugin has historically been a slightly awkward affair with NEST, the official .NET Elasticsearch high level client. From NEST 2.3.3 onwards, we've introduced an Attachment type to make working with attachments a much smoother experience. In this post, we'll walk through some typical use cases in working with the plugin and the Attachment type with Elasticsearch 5.0, and provide an introduction to the ingest-attachment processor plugin, one of the processors available in the suite of processors for the ingest node. Since the mapper-attachments plugin is deprecated in 5.0 and will be removed in 6.0, the ingest-attachment processor plugin is the recommended way to index attachments in Elasticsearch 5.0.

Installation

To get started with indexing attachments, the first step is to install the mapper-attachments plugin. For the purposes of this post, I'm going to use Elasticsearch 5.0. As with all plugins in Elasticsearch, installation is handled by calling the elasticsearch-plugin.bat script within the Elasticsearch bin directory

elasticsearch-plugin.bat install mapper-attachments

After successfully installing the plugin, it will be available to use when the node is started, or shutdown and restarted. If you're using the plugin with a version of Elasticsearch prior to 2.2, a specific version of the mapper-attachments plugin will be needed; consult the legacy documentation to understand which version needs to be installed for your environment.

Document Definition and Mapping

Once the plugin is installed and our node is running, we're all ready to index our first attachment. To keep things simple, we'll use a simple Word document saved in .docx format whose content contains the following

attachment_screenshot.png

Our document Plain Old CLR Object (POCO) type looks like the following

public class Document
{
  public int Id { get; set; }
  public string Path { get; set; }
  public Attachment Attachment { get; set; }
}

It contains an id to uniquely identify the document, a path specifying where the original file is located on a file share and finally, the attachment that will be indexed.

Now that we have a POCO type definition for the document, let's create an index and a mapping for it. For working with Elasticsearch 5.0 from .NET, we can use the 5.x release candidate of NEST

var documentsIndex = "documents";
var connectionSettings = new ConnectionSettings()
  .InferMappingFor<Document>(m => m
    .IndexName(documentsIndex)
  );
var client = new ElasticClient(connectionSettings);
var indexResponse = client.CreateIndex(documentsIndex, c => c
  .Settings(s => s
    .Analysis(a => a
      .Analyzers(ad => ad
        .Custom("windows_path_hierarchy_analyzer", ca => ca
          .Tokenizer("windows_path_hierarchy_tokenizer")
        )
      )
      .Tokenizers(t => t
        .PathHierarchy("windows_path_hierarchy_tokenizer", ph => ph
          .Delimiter('\\')
        )
      )
    )
  )
  .Mappings(m => m
    .Map<Document>(mp => mp
      .AutoMap()
      .AllField(all => all
        .Enabled(false)
      )
      .Properties(ps => ps
        .Text(s => s
          .Name(n => n.Path)
          .Analyzer("windows_path_hierarchy_analyzer")
        )
        .Attachment(a => a
          .Name(n => n.Attachment)
          .NameField(nf => nf
            .Name(n => n.Attachment.Name)
            .Store()
          )
          .FileField(ff => ff
            .Name(n => n.Attachment.Content)
            .Store()
          )
          .ContentTypeField(ct => ct
            .Name(n => n.Attachment.ContentType)
            .Store()
          )
          .ContentLengthField(clf => clf
            .Name(n => n.Attachment.ContentLength)
            .Store()
          )
          .DateField(df => df
            .Name(n => n.Attachment.Date)
            .Store()
          )
          .AuthorField(af => af
            .Name(n => n.Attachment.Author)
            .Store()
          )
          .TitleField(tf => tf
            .Name(n => n.Attachment.Title)
            .Store()
          )
          .KeywordsField(kf => kf
            .Name(n => n.Attachment.Keywords)
            .Store()
          )
        )
      )
    )
  )
);

The connection settings use a neat feature of the NEST client that allows a POCO type to be associated with a particular index name; that is, when a Document type is specified as the generic type to be indexed, searched, etc., NEST will use the inferred index name specified on connection settings if no index name is specified on the request. Document type names can also be inferred in this way if you want to use a different type name to the camel cased POCO name that NEST will infer from the POCO name by default.

The mapping for the Document type defines a custom analyzer for the Path property that uses the path_hierarchy tokenizer to provide search across path hierarchies. Since this example is running on Windows, the tokenizer uses the \ character as the path delimiter. Additionally, the _all field has been disabled within the mapping as it is not needed in our example. Finally, the metadata fields that we are interested in are mapped for the attachment type.

After the create index request is executed, the index will be created. The mapping for the Document type can be inspected with the following

var mappingResponse = client.GetMapping<Document>();

This returns the following, demonstrating that the mapping has been created as expected

{
  "documents" : {
    "mappings" : {
      "document" : {
        "_all" : {
          "enabled" : false
        },
        "properties" : {
          "attachment" : {
            "type" : "attachment",
            "fields" : {
              "content" : {
                "type" : "text",
                "store" : true
              },
              "author" : {
                "type" : "text",
                "store" : true
              },
              "title" : {
                "type" : "text",
                "store" : true
              },
              "name" : {
                "type" : "text",
                "store" : true
              },
              "date" : {
                "type" : "date",
                "store" : true
              },
              "keywords" : {
                "type" : "text",
                "store" : true
              },
              "content_type" : {
                "type" : "text",
                "store" : true
              },
              "content_length" : {
                "type" : "float",
                "store" : true
              },
              "language" : {
                "type" : "text"
              }
            }
          },
          "id" : {
            "type" : "integer"
          },
          "path" : {
            "type" : "text",
            "analyzer" : "windows_path_hierarchy_analyzer"
          }
        }
      }
    }
  }
}

Indexing and Searching our first Attachment

Now that the index and mapping are in place, it's time to index the attachment

var directory = Directory.GetCurrentDirectory();
var base64File = Convert.ToBase64String(File.ReadAllBytes(Path.Combine(directory, "example_one.docx")));
client.Index(new Document
{
  Id = 1,
  Path = @"\\share\documents\examples\example_one.docx",
  Attachment = new Attachment
  {
    Content = base64File
  }
});

This is synonymous with the following curl request

curl -XPUT "http://localhost:9200/documents/document/1" -d'
{
  "id": 1,
  "path": "\\\\share\\documents\\examples\\example_one.docx",
  "attachment": "... base64 encoded attachment ..."
}'

Once indexed, searching the content of the attachment is a straightforward affair

var searchResponse = client.Search<Document>(s => s
  .Query(q => q
    .Match(m => m
      .Field(a => a.Attachment.Content)
      .Query("NEST")
    )
  )
);

Using NEST, a document field within Elasticsearch can be referenced using a member access lambda expression against the respective POCO type property name. The search result returned for the query is as follows

{
  "took" : 31,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.2568969,
    "hits" : [
      {
        "_index" : "documents",
        "_type" : "document",
        "_id" : "1",
        "_score" : 0.2568969,
        "_source" : {
          "id" : 1,
          "path" : "\\\\share\\documents\\examples\\example_one.docx",
          "attachment" : "... base64 encoded attachment ..."
        }
      }
    ]
  }
}

and a Document instance constructed from the _source can be accessed via searchResponse.Documents.First().

We can also search on metadata extracted from the attachment

searchResponse = client.Search<Document>(s => s
  .StoredFields(f => f
    .Field(d => d.Attachment.Content)
    .Field(d => d.Attachment.ContentType)
    .Field(d => d.Attachment.ContentLength)
    .Field(d => d.Attachment.Author)
    .Field(d => d.Attachment.Title)
    .Field(d => d.Attachment.Date)
  )
  .Query(q => q
    .Match(m => m
      .Field(a => a.Attachment.ContentType)
      .Query("application")
    )
  )
);

Since all of the fields in the attachment have store enabled in the mapping, the extracted metadata field values can be returned as above using the stored_fields parameter on the search request. Setting the content field with store enabled in the mapping is useful in scenarios where you want to retrieve the extracted content or perform highlighting on it.

The search result for the previous query is as follows

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.25316024,
    "hits" : [
      {
        "_index" : "documents",
        "_type" : "document",
        "_id" : "1",
        "_score" : 0.25316024,
        "fields" : {
          "attachment.date" : [
            "2016-08-30T05:50:00.000Z"
          ],
          "attachment.content" : [
            "The Present and Future of Attachments\n\nThis is a sample document to demonstrate indexing attachments using NEST and the new Attachment type\n\n"
          ],
          "attachment.content_type" : [
            "application/vnd.openxmlformats-officedocument.wordprocessingml.document"
          ],
          "attachment.author" : [
            "Russ Cam"
          ],
          "attachment.content_length" : [
            11726.0
          ]
        }
      }
    ]
  }
}

And it is possible to construct Attachment instances from the hit fields values using

var attachments = searchResponse.Hits.Select(h =>
  new Attachment
  {
    Author = h.Fields.ValueOf<Document, string>(d => d.Attachment.Author),
    Content = h.Fields.ValueOf<Document, string>(d => d.Attachment.Content),
    ContentLength = h.Fields.ValueOf<Document, long?>(d => d.Attachment.ContentLength),
    ContentType = h.Fields.ValueOf<Document, string>(d => d.Attachment.ContentType),
    Date = h.Fields.ValueOf<Document, DateTime?>(d => d.Attachment.Date),
    Title = h.Fields.ValueOf<Document, string>(d => d.Attachment.Title)
  }
);

The Attachment type within NEST takes care of accessing the correct values from the fields property in the response, based on member access lambda expressions on the properties of the Attachment type.

Explicit Metadata fields

The mapper-attachments plugin also allows explicit metadata fields to be sent at index time, along with the base64 encoded attachment. This can be useful in cases where you don't want to rely on a metadata value extracted from the attachment. For example, we may be indexing Microsoft Word documents in both the older .doc and newer .docx formats and wish to explicitly control the content type for the latter to align it with the content type of the former. The NEST Attachment type can handle this for us

client.Index(new Document
{
  Id = 1,
  Path = @"\\share\documents\examples\example_one.docx",
  Attachment = new Attachment
  {
    Content = base64File,
    ContentType = "application/msword"
  }
});

Search can then be performed on content type

var searchResponse = client.Search<Document>(s => s
  .Query(q => q
    .Match(m => m
      .Field(a => a.Attachment.ContentType)
      .Query("msword")
    )
  )
);

The document _source with explicit metadata fields now contains two properties, _content and _content_type, the names used when explicit metadata fields are sent for both content and content type, respectively.

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.25811607,
    "hits" : [ {
      "_index" : "documents",
      "_type" : "document",
      "_id" : "1",
      "_score" : 0.25811607,
      "_source" : {
        "id" : 1,
        "path" : "\\\\share\\documents\\examples\\example_one.docx",
        "attachment" : {
          "_content" : "... base64 encoded attachment ...",
          "_content_type" : "application/msword"
        }
      }
    } ]
  }
}

Again, the Attachment type takes care of deserializing the source of an attachment into an Attachment instance, with properties set to those explicitly specified in the source.

Gotchas

Whilst the mapper-attachment plugin works well, there are some gotchas to be aware of. For example, if no attachment fields are mapped with store enabled and a document is indexed only with the base64 encoded attachment sent, then a search query to return metadata fields using the stored_fields parameter will return the original base64 encoded attachment as the value for each requested field. One can exclude the original base64 encoded content from the _source document using source exclude feature to mitigate this.

The Future

As previously mentioned, the mapper-attachments plugin is deprecated for Elasticsearch 5.0 and will be removed in 6.0. But fear not however for the future is bright! The ingest-attachment processor plugin, part of the ingest node in Elasticsearch 5.0, replaces the mapper-attachments plugin, providing a more predictable experience over its predecessor. Since the extraction process now happens before indexing of the document takes place within an index request, the extracted metadata fields are now stored in the source field and returned with the rest of the source document in a search request. Let's see an example.

Installation

Similarly to the mapper-attachments plugin, installation of the ingest-attachment plugin is handled by calling the elasticsearch-plugin.bat script within the Elasticsearch bin directory

elasticsearch-plugin.bat install ingest-attachment

Again, start or restart your node after installing this plugin. Now, with at least one ingest node in the Elasticsearch cluster, we're ready to start working with attachments.

Mappings and Pipelines

Mapping with ingest-attachment is a little different compared to how attachments need to be mapped with mapper-attachments plugin. Gone is the need to map the attachment using the bespoke attachment type and instead, we can specify the field that we are going to send the base64 encoded attachment to Elasticsearch in, along with an object mapping that will receive the extracted attachment metadata from the ingest-attachment processor pipeline.

With a slightly updated POCO, the mapping now looks like the following

public class Document
{
  public int Id { get; set; }
  public string Path { get; set; }
  public string Content { get; set; }
  public Attachment Attachment { get; set; }
}
var indexResponse = client.CreateIndex(documentsIndex, c => c
  .Settings(s => s
    .Analysis(a => a
      .Analyzers(ad => ad
        .Custom("windows_path_hierarchy_analyzer", ca => ca
          .Tokenizer("windows_path_hierarchy_tokenizer")
        )
      )
      .Tokenizers(t => t
        .PathHierarchy("windows_path_hierarchy_tokenizer", ph => ph
          .Delimiter('\\')
        )
      )
    )
  )
  .Mappings(m => m
    .Map<Document>(mp => mp
      .AllField(all => all
        .Enabled(false)
      )
      .Properties(ps => ps
        .Number(n => n
          .Name(nn => nn.Id)
        )
        .Text(s => s
          .Name(n => n.Path)
          .Analyzer("windows_path_hierarchy_analyzer")
        )
        .Object<Attachment>(a => a
          .Name(n => n.Attachment)
          .Properties(p => p
            .Text(t => t
              .Name(n => n.Name)
            )
            .Text(t => t
              .Name(n => n.Content)
            )
            .Text(t => t
              .Name(n => n.ContentType)
            )
            .Number(n => n
              .Name(nn => nn.ContentLength)
            )
            .Date(d => d
              .Name(n => n.Date)
            )
            .Text(t => t
              .Name(n => n.Author)
            )
            .Text(t => t
              .Name(n => n.Title)
            )
            .Text(t => t
              .Name(n => n.Keywords)
            )
          )
        )
      )
    )
  )
);

The mapping uses the text field to map all of the string properties so that they are analyzed. In fact, we can take advantage of a feature within NEST known as automapping to simplify this mapping further; Automapping will infer the document mapping to send to Elasticsearch based on the types of the properties on the POCO. The simpler mapping is

var indexResponse = client.CreateIndex(documentsIndex, c => c
  .Settings(s => s
    .Analysis(a => a
      .Analyzers(ad => ad
        .Custom("windows_path_hierarchy_analyzer", ca => ca
          .Tokenizer("windows_path_hierarchy_tokenizer")
        )
      )
      .Tokenizers(t => t
        .PathHierarchy("windows_path_hierarchy_tokenizer", ph => ph
          .Delimiter('\\')
        )
      )
    )
  )
  .Mappings(m => m
    .Map<Document>(mp => mp
      .AutoMap()
      .AllField(all => all
        .Enabled(false)
      )
      .Properties(ps => ps
        .Text(s => s
          .Name(n => n.Path)
          .Analyzer("windows_path_hierarchy_analyzer")
        )
        .Object<Attachment>(a => a
          .Name(n => n.Attachment)
          .AutoMap()
        )
      )
    )
  )
);

As before, the mapping can be checked with

var mappingResponse = client.GetMapping<Document>();

Now that the mapping is in place, an ingest pipeline can be created to use for attachment processing

client.PutPipeline("attachments", p => p
  .Description("Document attachment pipeline")
  .Processors(pr => pr
    .Attachment<Document>(a => a
      .Field(f => f.Content)
      .TargetField(f => f.Attachment)
    )
    .Remove<Document>(r => r
      .Field(f => f.Content)
    )
  )
);

This is akin to the following curl request

curl -XPUT "http://localhost:9200/_ingest/pipeline/attachments" -d'
{
  "description": "Document attachment pipeline",
  "processors": [
    {
      "attachment": {
        "field": "content",
        "target_field": "attachment"
      }
    },
    {
      "remove": {
        "field": "content"
      }
    }
  ]
}'

The attachment processor configuration allows control over which properties to extract, extracting all properties by default. A remove processor is also added to the ingest pipeline to remove and hence not store the base64 encoded attachment sent in the content field; since the file exists on a file share already and the extracted content will be indexed into the content field of the attachment field, keeping the original attachment content around in Elasticsearch is superfluous to our use case. In fact, we could simplify this example further by sending the base64 encoded attachment as the value of the attachment field to Elasticsearch instead of using the content field, still specifying the attachment as the target field as before, and removing the content field altogether. A series of blog posts will be diving deeper into ingest node and pipelines if you're eager for more details and use cases; for a teaser, take a look at Ingesting and Exploring Scientific Papers with the ingest-attachment processor plugin and our Elasticsearch as a service offering, Elastic Cloud.

Indexing and Searching in the Brave New World

We're now ready to roll with indexing our attachment! The base64 encoded attachment is now passed in the Content field on the Document POCO and the id of the pipeline to use is also specified on the request.

var directory = Directory.GetCurrentDirectory();
var base64File = Convert.ToBase64String(File.ReadAllBytes(Path.Combine(directory, "example_one.docx")));
client.Index(new Document
{
  Id = 1,
  Path = @"\\share\documents\examples\example_one.docx",
  Content = base64File
}, i => i.Pipeline("attachments"));

For this to all work, our Elasticsearch cluster needs to have at least one ingest node in it and, if you need to process lots of attachments, it is recommended to have dedicated ingest nodes since the extraction process can be a resource intensive operation.

With our document indexed, searching is as straightforward as before

var searchResponse = client.Search<Document>(s => s
  .Query(q => q
    .Match(m => m
      .Field(a => a.Attachment.Content)
      .Query("NEST")
    )
  )
);

which returns the following search result

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.2568969,
    "hits" : [
      {
        "_index" : "documents",
        "_type" : "document",
        "_id" : "1",
        "_score" : 0.2568969,
        "_source" : {
          "path" : "\\\\share\\documents\\examples\\example_one.docx",
          "attachment" : {
            "date" : "2016-08-30T05:48:00Z",
            "content_type" : "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
            "author" : "Russ Cam",
            "language" : "en",
            "content" : "The Present and Future of Attachments\n\nThis is a sample document to demonstrate indexing attachments using NEST and the new Attachment type",
            "content_length" : 141
          },
          "id" : 1
        }
      }
    ]
  }
}

All of the extracted metadata from the attachment appears within the _source field under the attachment field and the NEST Attachment type takes care of correctly deserializing this into an Attachment instance on our Document POCO.

Conclusion

PDFs, Word Documents, Powerpoint presentations, Excel spreadsheets and the like, brace yourselves, Ingest is here! No longer will your content remain locked away.