Content extraction | Enterprise Search documentation [8.8]

IMPORTANT: No additional bug fixes or documentation updates will be released for this version. For the latest information, see the current release documentation.

› ›

« Connector sync rules Elastic Azure Blob Storage connector reference »

Content extractionedit

Connectors use the Elastic ingest attachment processor to extract file contents. The processor extracts files using the Apache Tika text extraction library. The logic for content extraction is defined in utils.py.

While intended primarily for PDF and Microsoft Office formats, you can use any of the supported formats.

Enterprise Search uses an Elasticsearch ingest pipeline to power the web crawler’s binary content extraction. The default pipeline, ent-search-generic-ingestion, is automatically created when Enterprise Search first starts.

You can view this pipeline in Kibana. Customizing your pipeline usage is also an option. See Index-specific ingest pipelines.

Supported file typesedit

The following file types are supported:

.txt
.py
.rst
.html
.markdown
.json
.xml
.csv
.md
.ppt
.rtf
.docx
.odt
.xls
.xlsx
.rb
.paper
.sh
.pptx
.pdf
.doc

The ingest attachment processor does not support compressed files, e.g., an archive file containing a set of PDFs. Expand the archive file and make individual uncompressed files available for the connector to process.

« Connector sync rules Elastic Azure Blob Storage connector reference »