Content extraction


Content extractionedit

The 3rd party services you sync with Workplace Search, such as Dropbox or Google Drive, usually contain a wide variety of documents and file types. Workplace Search will try to extract the content of these files, to transform the source document into a searchable document.

To make the document searchable, the Workplace Search connector tries to extract text content into fields, and images into thumbnail previews. Full text content extraction is available for many types of documents, including PDFs and most Office365 and GSuite formats. Thumbnail extraction is available for certain image formats.

Nevertheless, you might be surprised that some of your documents are not having their content extracted, or that the extraction is not perfect. The following documentation covers the file extensions and media types supported by Workplace Search, as well as how to troubleshoot surprising results.

Content extraction limitsedit

There are some important facts and figures to note up front:

  • The maximum file size for content extraction is 20MB. This is not configurable.
  • The resulting text will be truncated if it exceeds 100KB.
  • Encrypted documents are skipped by the extractor.
  • Content extraction from binary formats (e.g., images, audio, videos) is currently not supported.
  • Thumbnail extraction is automatically disabled when less than 2GB of Heap is available. For maximum performance and stability, ensure that Enterprise Search has at least 4GB of Heap.
Disabling thumbnail or full-text extractionedit

You have the option to toggle off extracting thumbnails and/or full-text from files, if you need to save RAM. However, these toggle options are only available after you create the content source. When you create a content source this triggers a full sync immediately, including thumbnail and full-text extraction.

If you want to avoid any thumbnail or full-text extraction you need to switch the toggle immediately after creating the content source. This ensures that the toggle is registered during the initial, metadata-only phase of the first full sync.

Enabling/disabling thumbnail extraction or full-text syncs does not interrupt or restart jobs. These settings changes are only picked up for new jobs.

The toggle disables/enables any subsequent thumbnail or full-text extraction. Existing thumbnail or full-text data will not be removed.

File typesedit

Media, or MIME types, are an internet standard for describing file formats.

Per this guide, the type represents the general category into which the data type falls, such as video or text. The subtype identifies the exact kind of data the type represents.

Workplace Search analyzes the file to determine its type, since file extensions are not reliable. Workplace Search leverages the industry standard open-source Apache Tika toolkit for detecting and extracting text and metadata. The Elasticsearch ingest attachment plugin also uses Apache Tika and can return a number of metadata fields.

You can add search checkbox filters for file extensions and MIME types in your search experience. Learn more about customizing content source filters in Workplace Search.

Full text content extractionedit

The following file types are supported for full text extraction:

  • .doc
  • .docx
  • .html
  • .odt
  • .one
  • .md
  • .markdown
  • .paper
  • .pdf
  • .ppt
  • .pptx
  • .rtf
  • .txt
  • .xls
  • .xlsx

Formatted text files are normalized to decrease whitespace and minimize storage costs:

  • .md
  • .markdown
  • .paper
  • .rtf
  • .txt

Workplace Search supports these MIME types for text files:

  • application/msword
  • application/pdf
  • application/vnd.openxmlformats-officedocument.presentationml.presentation
  • application/vnd.openxmlformats-officedocument.wordprocessingml.document

Unstructured textual data has the highest likelihood of benefiting from content extraction. Structured documents such as excel spreadsheets, html, or csv files, do not lend themselves to well-ordered text extraction.

Google Docs, Sheets and Slidesedit

Content/text extraction from Google Docs, Sheets and Slides is also supported. Google docs do not have a native download format, and in order to extract their content, Workplace Search exports these files as PDFs.

However, extracting text from PDFs is not always a perfectly lossless process, and can lead to unexpected results in some cases.

You might be surprised that some of your PDFs are being transformed into searchable documents, while others are not. That’s because there are different types of PDFs.

The easiest way to tell if a particular PDF document supports full text extraction is to try to copy and paste from the document. If this works, Workplace Search can extract the file’s content.

If you cannot select the text, this means the PDF is actually an image. You will have to use a 3rd party OCR (optical character recognition) engine to scan the image for text and ingest via a custom source. This process can be hit and miss, depending on the quality of the image and the font used.

Thumbnail extractionedit

Workplace Search provides document thumbnail previews for certain file types, to help you quickly find exactly what you need.

The Workplace Search thumbnail extractor supports specific media or MIME types:

  • image/gif
  • image/jpeg
  • image/png

Thumbnail generation can be quite memory-intensive, requiring at least 2GB of JVM Heap to run. Even then, thumbnail generation may suspend if the available heap becomes insufficient.

If document thumbnails are missing, a good first step is to increase the available RAM to your server.

Content extraction from binary formatsedit

Searching binary formats (e.g., images, audio, videos) is currently not supported. You can use a 3rd party content extractor and add the content via a custom source.