The Elastic Enterprise Search web crawler is a beta feature. Beta features are subject to change and are not covered by the support SLA of general release (GA) features. Elastic plans to promote this feature to GA in a future release.
View frequently asked questions about the Enterprise Search web crawler:
See Web crawler (beta) reference for detailed technical information about the web crawler.
We also welcome your feedback.
- Crawling publicly-accessible HTTP/HTTPS websites
- Support for crawling multiple domains per-Engine
- Robots meta tag support
Robots "nofollow" support
Includes robots meta tags set to "nofollow" and links with rel="nofollow" attributes.
The web crawler honors directives within robots.txt files.
The web crawler honors XML sitemaps, and fetches sitemaps identified within robots.txt files. Additional sitemaps can also be managed on the domain through the domain dashboard.
Configurable content extraction
The web crawler will extract a predefined, set of fields (url, body content, etc) from each page it visits. In addition to this, the crawler also supports extracting dynamic fields from meta tags.
Entry points allow customers to specify where the web crawler begins crawling each domain.
Crawl rules allow customers to control whether each URL the web crawler encounters will be visited and indexed.
Logging of each crawl
Logs are representative of an entire crawl, which encompasses all domains in an engine.
Configure the cadence for new crawls to start automatically if there isn’t an active crawl.
- User interfaces for managing domains, entry points, and crawl rules
Crawler uses Elasticsearch to maintain its state during an active crawl, allowing crawls to be migrated between instances in case of an instance failure or a restart of an Enterprise Search instance running a crawl. Each unique URL is only visited once thanks to the Seen URLs list persisted in Elasticsearch. Crawl-specific indexes are automatically cleaned up after a crawl is finished.
Single-page app (SPA) support
- Crawling private websites or websites behind authentication
Extracting content from files
Currently, the web crawler will only extract content from HTML content.