Web crawler schema | Enterprise Search documentation [master]

You are looking at preliminary documentation for a future release. Not what you want? See the current release documentation.

› ›

« Web crawler events logs reference App Search and Workplace Search »

Web crawler schemaedit

The web crawler indexes search documents using the following schema. All fields are strings or arrays of strings.

additional_urls: The URLs of additional pages with the same content.
body_content: The content of the page’s <body> tag with all HTML tags removed. Truncated to crawler.extraction.body_size.limit.
domains: The domains in which this content appears.
full_html: The full HTML of the page in string form. This is disabled by default. If the setting is disabled, the document will not have a full_html field at all.
headings: The text of the page’s HTML headings (h1 - h6 elements). Limited by crawler.extraction.headings_count.limit.
id: The unique identifier for the page.
last_crawled_at: The date and time when the page was last crawled.
links: Links found on the page. Limited by crawler.extraction.indexed_links_count.limit.
meta_description: The page’s description, taken from the <meta name="description"> tag. Truncated to crawler.extraction.description_size.limit.
meta_keywords: The page’s keywords, taken from the <meta name="keywords"> tag. Truncated to crawler.extraction.keywords_size.limit.
title: The title of the page, taken from the <title> tag. Truncated to crawler.extraction.title_size.limit.
url: The URL of the page.
url_host: The hostname or IP from the page’s URL.
url_path: The full pathname from the page’s URL.
url_path_dir1: The first segment of the pathname from the page’s URL.
url_path_dir2: The second segment of the pathname from the page’s URL.
url_path_dir3: The third segment of the pathname from the page’s URL.
url_port: The port number from the page’s URL (as a string).
url_scheme: The scheme of the page’s URL.

In addition to these predefined fields, you can also extract custom fields via meta tags and attributes.

« Web crawler events logs reference App Search and Workplace Search »