IMPORTANT: No additional bug fixes or documentation updates will be released for this version. For the latest information, see the current release documentation.

« Web crawler (beta) Web crawler (beta) reference »

› ›

Web crawler (beta) FAQ

edit

IMPORTANT: This documentation is no longer updated. Refer to Elastic's version policy and the latest documentation.

Web crawler (beta) FAQ

edit

The Elastic Enterprise Search web crawler is a beta feature. Beta features are subject to change and are not covered by the support SLA of general release (GA) features. Elastic plans to promote this feature to GA in a future release.

View frequently asked questions about the Enterprise Search web crawler:

See Web crawler (beta) reference for detailed technical information about the web crawler.

We also welcome your feedback.

What functionality is supported?

edit

Crawling publicly-accessible HTTP/HTTPS websites
Support for crawling multiple domains per-Engine
Robots meta tag support
Robots "nofollow" support

Includes robots meta tags set to "nofollow" and links with rel="nofollow" attributes.
Robots.txt support

The web crawler honors directives within robots.txt files.
Sitemap support

The web crawler honors XML sitemaps, and fetches sitemaps identified within robots.txt files.
Basic content extraction

The web crawler will extract content for a predefined, unconfigurable set of fields from each page it visits.
"Entry points"

Entry points allow customers to specify where the web crawler begins crawling each domain.
"Crawl rules"

Crawl rules allow customers to control whether each URL the web crawler encounters will be visited and indexed.
Logging of each crawl

Logs are representative of an entire crawl, which encompasses all domains in an engine.
User interfaces for managing domains, entry points, and crawl rules

What functionality is not supported?

edit

Automatic or scheduled crawling

Start crawls manually from the UI or use the crawler API to schedule a crawl on-demand.
Single-page app (SPA) support

The crawler cannot currently crawl pages that are pure JavaScript single-page apps.
Configurable content extraction

Content extraction is currently limited to an unconfigurable, predefined set of fields.
Crawling private websites or websites behind authentication
Crawl persistence

If a crawl is unexpectedly stopped before it finishes, it will not be able to restart where it left off. You can restart a crawl again from the beginning. The crawler will not duplicate documents that it already indexed.
Extracting content from files

Currently, the web crawler will only extract content from HTML content.

« Web crawler (beta) Web crawler (beta) reference »