IMPORTANT: No additional bug fixes or documentation updates will be released for this version. For the latest information, see the current release documentation.

« Elastic Zoom connector reference Managing crawls in Kibana »

›

Elastic web crawler

edit

IMPORTANT: This documentation is no longer updated. Refer to Elastic's version policy and the latest documentation.

Elastic web crawler

edit

Looking for the App Search web crawler? See the App Search documentation.

To compare the web crawler with the App Search web crawler, see the reference table on this page.

Use the web crawler to programmatically discover, extract, and index searchable content from websites and knowledge bases. When you ingest data with the web crawler a search-optimized Elasticsearch index is created to hold and sync webpage content.

The web crawler is a native Elasticsearch solution. It reads and writes directly to Elasticsearch indices in a format that enables developers to build intuitive, relevant search experiences using App Search engines and the Search UI library.

In the Kibana UI, go to Search > Content > Web crawlers to create new web crawlers and to manage and monitor crawls.

Availability and prerequisites

edit

The Elastic web crawler was introduced in Elastic version 8.4.0.

The crawler is available to all Elastic Cloud deployments.

Your deployment must include the Elasticsearch, Kibana, and Enterprise Search services. Your Enterprise Search service should have at least 4 GB RAM per zone. See Infrastructure requirements to learn how to verify and change the RAM for your Enterprise Search service.

The web crawler is also available to self-managed deployments when the subscription requirements are satisfied. View the requirements for this feature under the Elastic Search section of the Elastic Stack subscriptions page.

Web crawler documentation

edit

Website search tutorial: Concrete guide to building a website search experience, using the crawler UI
Managing crawls: Detailed reference for managing crawls using the Kibana UI Learn how to:
- Manage duplicated documents
- Extract binary content such as PDFs from webpages.
- Schedule automated crawls
Optimizing web content: Optimize your web content source files for the web crawler, to manage webpage discovery and content extraction Learn about:
Custom field values using ingest pipeline: How to customize crawler field values using an ingest pipeline
Troubleshooting crawls: Detailed troubleshooting reference
Web crawler events logs reference: Detailed web crawler events logs reference
View web crawler events logs: How to view web crawler events logs in Kibana

Version history

edit

The following is a list of significant changes affecting this feature:

8.5.0: The web crawler’s default ingest pipeline changes

Since version 8.5.0, newly created Elastic web crawler indices use a new default pipeline that indexes extracted binary content into the body field. This differs from the usual body_content field that HTML content is indexed into, and may result in unexpected search results. This change does not affect existing Elastic web crawler indices created prior to 8.5.0.

The following workarounds may apply:
- Search experiences that expect content only in the body_content field can be updated to search across the body field as well.
- You may "Copy and customize" the default pipeline of your crawler index, adding a set processor to copy the body field into the body_content field, or vice versa as needed.
- Any App Search engines that are built on top of an Elastic web crawler index should double check that boosts and weights applied to the body_content field have also been applied to the body field, where applicable.
8.4.0: Web crawler is generally available (GA)

Web crawler and App Search web crawler feature comparison

edit

	App Search web crawler	Web crawler
Interface	GUI / API	GUI-only
Binary content extraction	Yes	Yes
Search	App Search search APIs	Elasticsearch search APIs
Ingest pipelines	Yes	Yes
ML inference pipelines	No	Yes
Monitoring	Yes	Yes
APM	Yes	Yes
Audit logging	Yes	No
Event logging	Yes	Yes
Public REST API	Yes	No
Content extraction rules	No	Yes

« Elastic Zoom connector reference Managing crawls in Kibana »