IMPORTANT: No additional bug fixes or documentation updates
will be released for this version. For the latest information, see the
current release documentation.
Web crawler schema
edit
A newer version is available. Check out the latest documentation.
Web crawler schema
editThe web crawler indexes search documents using the following schema. All fields are strings or arrays of strings.
-
additional_urls - The URLs of additional pages with the same content.
-
body_content -
The content of the page’s
<body>tag with all HTML tags removed. Truncated tocrawler.extraction.body_size.limit. -
domains - The domains in which this content appears.
-
full_html -
The full HTML of the page in string form.
This is disabled by default.
If the setting is disabled, the document will not have a
full_htmlfield at all. -
headings -
The text of the page’s HTML headings (
h1-h6elements). Limited bycrawler.extraction.headings_count.limit. -
id - The unique identifier for the page.
-
last_crawled_at - The date and time when the page was last crawled.
-
links -
Links found on the page.
Limited by
crawler.extraction.indexed_links_count.limit. -
meta_description -
The page’s description, taken from the
<meta name="description">tag. Truncated tocrawler.extraction.description_size.limit. -
meta_keywords -
The page’s keywords, taken from the
<meta name="keywords">tag. Truncated tocrawler.extraction.keywords_size.limit. -
title -
The title of the page, taken from the
<title>tag. Truncated tocrawler.extraction.title_size.limit. -
url - The URL of the page.
-
url_host - The hostname or IP from the page’s URL.
-
url_path - The full pathname from the page’s URL.
-
url_path_dir1 - The first segment of the pathname from the page’s URL.
-
url_path_dir2 - The second segment of the pathname from the page’s URL.
-
url_path_dir3 - The third segment of the pathname from the page’s URL.
-
url_port - The port number from the page’s URL (as a string).
-
url_scheme - The scheme of the page’s URL.
In addition to these predefined fields, you can also extract custom fields via meta tags and attributes.