ℹ️ For new users, we recommend using our native Elasticsearch tools, rather than the standalone App Search product. We are actively developing new features and capabilities in the Elastic Stack to help you build powerful search applications. Refer to this blog for more information.

« Analytics Tags Guide Crawl a private network using a web crawler on Elastic Cloud »

› ›

Content type	URL
Website	`https://example.com`
Blog	`https://example.com/blog`
Ecommerce application	`https://shop.example.com`
Ecommerce administrative dashboard	`https://shop.example.com/admin`

Problem	Description	Solution
External domain	The web crawler does not follow links that go outside the domains configured for each crawl.	Manage domains for your crawl to add any missing domains.
Disallowed path	The web crawler does not follow links whose paths are disallowed by a domain’s crawl rules or robots.txt directives.	Manage crawl rules and robots.txt files for each domain to ensure paths are allowed.
No incoming links	The web crawler cannot find pages that have no incoming links, unless you provide the path as an entry point. See Content discovery for an explanation of how the web crawler discovers content.	Add links to the content from other content that the web crawler has already discovered, or explicitly add the URL as an entry point or within a sitemap.
Nofollow links	The web crawler does not follow nofollow links.	Remove the nofollow link to allow content discovery.
`nofollow` robots meta tag	If a page contains a `nofollow` robots meta tag, the web crawler will not follows links from that page.	Remove the meta tag from your page.
Page too large	The web crawler does not parse HTTP responses larger than `crawler.http.response_size.limit`.	Reduce the size of your page. Or, increase the limit for your deployment. Increasing the limit may increase crawl durations and resource consumption, and could reduce crawl stability.
Too many redirects	The web crawler does not follow redirect chains longer than `crawler.http.redirects.limit`.	Reduce the number of redirects for the page. Or, increase the limit for your deployment. Increasing the limit may increase crawl durations and resource consumption, and could reduce crawl stability.
Network latency	The web crawler fails requests that exceed the following network timeouts: `crawler.http.connection_timeout`, `crawler.http.read_timeout`, `crawler.http.request_timeout`.	Reduce network latency. Or, increase these timeouts for your deployment. Increasing the timeouts may increase crawl durations and resource consumption, and could reduce crawl stability.
HTTP errors	The web crawler cannot discover and index content if it cannot fetch HTML pages from a domain. The web crawler will not index pages that respond with a `4xx` or `5xx` response code.	Fix HTTP server errors. Ensure correct HTTP response codes.
HTML errors	The web crawler cannot parse extremely broken HTML pages. In that case, the web crawler cannot index the page, and cannot discover links coming from that page.	Use the W3C markup validation service to identify and resolve HTML errors in your content.
Security	The web crawler cannot access content requiring authentication or authorization.	Remove the security to allow access to the web crawler.
Non-HTTP protocol	The web crawler recognizes only the HTTP and HTTPS protocols.	Publish your content at URLs using HTTP or HTTPS protocols.
Invalid SSL certificate	The web crawler will not crawl HTTPS pages with invalid certificates.	Replace invalid certificates with valid certificates.

Problem	Description	Solution
Duplicate content	If your website contains pages with duplicate content, those pages are stored as a single document within your engine. The document’s `additional_urls` field indicates the URLs that contain the same content.	Use a canonical URL link tag within any document containing duplicate content.
Non-HTML content missing	Since 8.3.0, the web crawler can index non-html, downloadable files.	Ensure that the feature is enabled, and the MIME type of the missing file(s) is included in the configured MIME types. For full details, see the binary content extraction reference.
`noindex` robots meta tag	The web crawler will not index pages that include a `noindex` robots meta tag.	Remove the meta tag from your page.
Page too large	The web crawler does not parse HTTP responses larger than `crawler.http.response_size.limit`.	Reduce the size of your page. Or, increase the limit for your deployment. Increasing the limit may increase crawl durations and resource consumption, and could reduce crawl stability.
Truncated fields	The web crawler truncates some fields before indexing the document, according to the following limits: `crawler.extraction.body_size.limit` `crawler.extraction.description_size.limit`, `crawler.extraction.headings_count.limit`, `crawler.extraction.indexed_links_count.limit`, `crawler.extraction.keywords_size.limit`, `crawler.extraction.title_size.limit`.	Reduce the length of these fields within your content. Or, increase these limits for your deployment. Increasing the limits may increase crawl durations and resource consumption, and could reduce crawl stability.
Broken HTML	The web crawler cannot parse extremely broken HTML pages.	Use the W3C markup validation service to identify and resolve HTML errors in your content.

Crawl web content

Crawl web content

Identify web content

Create engine

Manage crawl

Manage domains

Manage entry points

Manage crawl rules

Manage robots.txt files

Manage sitemaps

Manage duplicate document handling

Embed web crawler instructions within content

Start crawl

Cancel crawl

Monitor crawl

View crawl status

View crawl request ID

View web crawler events logs

View web crawler events by crawl ID and URL

View web crawler system logs

View indexed documents

Troubleshoot crawl

Troubleshoot specific errors

Troubleshoot crawl stability

Troubleshoot content discovery

Troubleshoot content extraction and indexing

Re-crawl web content using a full crawl

Re-crawl web content using a partial crawl

Re-apply crawl rules

Automatic crawling

Schedule crawls