Managing crawls in Kibanaedit

This documentation contains all the information you need for managing crawls using the Kibana UI.

If you’re looking to run some initial, small-scale test crawls using the UI, see Getting started. If you need to learn how to optimize source files for the crawler, see Optimizing web content.

Overviewedit

It’s important to understand the primary crawl management tools and how they influence your crawls:

  • Domains set crawl boundaries.
  • Entry points and Sitemaps set starting points within domains.
  • Crawl rules and robots.txt directives set additional rules for crawling, beyond the starting points.

Here you’ll learn about discovering content, extracting content, crawling manually and scheduling crawls.

Discovering content:

Extracting content:

Running manual crawls:

Scheduling automated crawls:

Domainsedit

A domain is a website or property you’d like to crawl. You must associate one or more domains to your index’s web crawler. The web crawler cannot discover and index content outside of specified domains.

Each domain has a domain URL that identifies the domain using a protocol and hostname. The domain URL can not include a path. If a path is provided, it will automatically be removed from the domain URL, and instead added as an entry point.

Each unique combination of protocol and hostname is a separate domain. This can be a source of confusion. Note that each of the following is its own domain:

  • http://example.com
  • https://example.com
  • http://www.example.com
  • https://www.example.com
  • http://shop.example.com
  • https://shop.example.com

Each domain has:

  • One or more entry points.
  • One or more crawl rules.
  • Zero or one robots.txt files.
  • Zero or more sitemaps.

Manage the domains for a crawl in the Kibana UI. Add your first domain on the getting started screen. From there, you can view, add, manage, and delete domains.

Entry points and sitemapsedit

Entry pointsedit

Each domain must have at least one entry point. Entry points are the paths from which the crawler will start each crawl. Ensure entry points for each domain are allowed by the domain’s crawl rules, and the directives within the domain’s robots.txt file. See robots.txt files to learn about managing robots.txt files.

Add multiple entries, if some pages are not discoverable from the first entry point. For example, if your domain contains an “island” page that is not linked from other pages, simply add that full URL as an entry point. If your domain has many pages that are not linked from other pages, it may be easier to reference them all via a sitemap.

Sitemapsedit

If the website you are crawling uses sitemaps, you can specify the sitemap URLs. Note that you can choose to submit URLs to the web crawler using sitemaps, entry points, or a combination of both.

You can manage the sitemaps for a domain through the Kibana UI:

  • Navigate to Enterprise Search → Content → Elasticsearch indices → your-index → Manage domains.
  • Select a domain.
  • Click Add sitemap.

From here, you can view, add, edit, and delete sitemaps. To add a sitemap to a domain you manage, you can specify it within a robots.txt file. At the start of each crawl, the web crawler fetches and processes each domain’s robots.txt file and each sitemap specified within those files.

You may prefer to use sitemaps over entry points, because you have already published sitemaps for other web crawlers.

See Sitemaps if you are editing and managing sitemap source files.

Crawl rulesedit

A crawl rule is a crawler instruction to allow or disallow specific paths within a domain. Learn the basics of crawl rules in our getting started documentation. Remember that order matters and each URL is evaluated according to the first match. The web crawler will crawl only those paths that are allowed by the crawl rules for the domain and the directives within the robots.txt file for the domain. Ensure entry points for each domain are allowed.

The web crawler will crawl only those paths that are allowed by the crawl rules for the domain and the directives within the robots.txt file for the domain. See robots.txt files to learn about using robots.txt files to allow/disallow paths.

Crawl rule logic (rules)edit

The logic for each rule is as follows:

Begins with

The path pattern is a literal string except for the character *, which is a meta character that will match anything.

The rule matches when the path pattern matches the beginning of the path (which always begins with /).

If using this rule, begin your path pattern with /.

Ends with

The path pattern is a literal string except for the character *, which is a meta character that will match anything.

The rule matches when the path pattern matches the end of the path.

Contains

The path pattern is a literal string except for the character *, which is a meta character that will match anything.

The rule matches when the path pattern matches anywhere within the path.

Regex

The path pattern is a regular expression compatible with the Ruby language regular expression engine. In addition to literal characters, the path pattern may include metacharacters, character classes, and repetitions. You can test Ruby regular expressions using Rubular.

The rule matches when the path pattern matches the beginning of the path (which always begins with /).

If using this rule, begin your path pattern with \/ or a metacharacter or character class that matches /.

Crawl rule matchingedit

The following table provides various examples of crawl rule matching:

URL path Rule Path pattern Match?

/foo/bar

Begins with

/foo

YES

/foo/bar

Begins with

/*oo

YES

/bar/foo

Begins with

/foo

NO

/foo/bar

Begins with

foo

NO

/blog/posts/hello-world

Ends with

world

YES

/blog/posts/hello-world

Ends with

hello-*

YES

/blog/world-hello

Ends with

world

NO

/blog/world-hello

Ends with

*world

NO

/fruits/bananas

Contains

banana

YES

/fruits/apples

Contains

banana

NO

/2020

Regex

\/[0-9]{3,5}

YES

/20

Regex

\/[0-9]{3,5}

NO

/2020

Regex

[0-9]{3,5}

NO

Restricting paths using crawl rulesedit

The domain dashboard adds a default crawl rule to each domain: Allow if Regex .*. You cannot delete or re-order this rule through the dashboard.

This rule is permissive, allowing all paths within the domain. To restrict paths, use either of the following techniques:

Add rules that disallow specific paths (e.g. disallow the blog):

Policy Rule Path pattern

Disallow

Begins with

/blog

Allow

Regex

.*

Or, add rules that allow specific paths and disallow all others (e.g. allow only the blog):

Policy Rule Path pattern

Allow

Begins with

/blog

Disallow

Regex

.*

Allow

Regex

.*

When you restrict a crawl to specific paths, be sure to add entry points that allow the crawler to discover those paths. For example, if your crawl rules restrict the crawler to /blog, add /blog as an entry point. If you leave only the default entry point /, the crawl will end immediately, since / is disallowed.

Duplicate document handlingedit

By default, the web crawler identifies groups of duplicate web documents and stores each group as a single document in your index. The document’s url and additional_urls fields represent all the URLs where the web crawler discovered the document’s content — or a sample of URLs if more than 100. The url field represents the canonical URL, or the first discovered URL if no canonical URL is defined. If you manage your site’s HTML source files, see Canonical URL link tags to learn how to embed canonical URL link tag elements in pages that duplicate the content of other pages.

The crawler identifies duplicate content intelligently, ignoring insignificant differences such as navigation, whitespace, style, and scripts. More specifically, the crawler combines the values of specific fields, and it hashes the result to create a unique "fingerprint" to represent the content of the web document.

The web crawler then checks your index for an existing document with the same content hash. If it doesn’t find one, it saves a new document to the index. If it does exist, the crawler updates the existing document instead of saving a new one. The crawler adds the additional URL at which the content was discovered.

You can manage which fields the web crawler uses to create the content hash. You can also disable this feature and allow duplicate documents.

Set the default fields for all domains using the following configuration setting: connector.crawler.extraction.default_deduplication_fields.

Manage these settings for each domain within the web crawler UI.

Manage duplicate document handlingedit

After extracting the content of a web document, the web crawler compares that content to your existing documents, to check for duplication. To compare documents, the web crawler examines specific fields.

Manage these fields for each domain within the web crawler UI:

  1. Navigate to Enterprise Search → Content → Indices → your-index-name → domain name.
  2. Locate the section named Duplicate document handling.
  3. Select or deselect the fields you’d like the crawler to use. Alternatively, allow duplicate documents for a domain by deselecting Prevent duplicate documents.

If you want to manage duplicate documents by editing your HTML content, see Canonical URL link tags.

Binary content extractionedit

The web crawler can extract content from downloadable binary files, such as PDF and DOCX files. To use this feature, you must:

  • Enable binary content extraction with the configuration: connector.crawler.content_extraction.enabled: true.
  • Select which MIME types should have their contents extracted. For example: connector.crawler.content_extraction.mime_types: ["application/pdf", "application/msword"].

    • The MIME type is determined by the HTTP response’s Content-Type header when downloading a given file.
    • While intended primarily for PDF and Microsoft Office formats, you can use any of the supported formats documented by Apache Tika.
    • No default mime_types are defined. You must configure at least one MIME type in order to extract non-HTML content.

The ingest attachment processor does not support compressed files, e.g., an archive file containing a set of PDFs. Expand the archive file and make individual uncompressed files available for the web crawler to process.

Enterprise Search uses an Elasticsearch ingest pipeline to power the web crawler’s binary content extraction. This pipeline, named ent_search_crawler, is automatically created when Enterprise Search first starts.

You can view and update this pipeline in Kibana or with Elasticsearch APIs.

If you make changes to the default ent_search_crawler ingest pipeline, these will not be overwritten when you upgrade Enterprise Search, provided you have incremented its version above the upgrade version.

Running manual crawlsedit

Manual crawls are useful for testing and debugging the web crawler. Your first crawl will be manual by default.

Other use cases for manual crawls include:

  • Crawling content only once for a specific purpose: For example, crawling a website you don’t control to make it easier to search its pages.
  • Crawling content that changes infrequently: For example, it might make sense to only run manual crawls when content is updated.
  • Your team needs to closely manage usage costs: For example, you only run crawls when needed, such as after updating a website.
How to run a manual crawledit

To run a manual crawl, follow these steps in the web crawler UI:

  1. Navigate to your crawler index in ContentIndicesindex-name.
  2. Click on Crawl.
  3. You have 3 options for manual crawls:

    • Crawl all domains on this index
    • Crawl with custom settings
    • Reapply crawl rules
Crawl with custom settingsedit

Set up a one-time crawl with custom settings. In Getting started with the web crawler, we recommend using this option for tests, because it allows you to further restrict which pages are crawled.

Crawl with custom settings gives you the option to:

  • Set a maximum crawl depth, to specify how many pages deep the crawler traverses.

    • Set the value to 1, for example, to limit the crawl to only entry points.
  • Crawl select domains.
  • Define seed URLs with sitemaps and entry points.
Reapply crawl rulesedit

If you’ve modified crawl rules, you can apply the updated rules to existing documents without running a full crawl. The web crawler will remove all existing documents that are no longer allowed by your current crawl rules. This operation is called a process crawl.

We recommend cancelling any active web crawls, before opting to re-apply crawl rules. A web crawl running concurrently with a process crawl may continue to index fresh documents with out-of-date configuration. Changes in crawl rule configuration will only apply to documents indexed at the time of the request.

Scheduling automated crawlsedit

In 8.4.0, 8.4.1, and 8.4.2, the Elastic web crawler does not respect a crawl schedule, when configured. Crawls must be manually triggered in the UI.

You can schedule new crawls to start automatically. New crawls will be skipped if there is an active crawl.

For example, consider a crawl that completes on a Tuesday. If the crawl is configured to run every 7 days, the next crawl would start on the following Tuesday. If the crawl is configured to run every 3 days, then the next crawl would start on Friday.

Scheduling a crawl does not necessarily run the crawl immediately.

To manage automated crawls within the UI:

  1. Navigate to your index and select the Scheduling tab.
  2. Schedule crawls under Automated Crawl Scheduling.
  3. Toggle Crawl automatically.
  4. Set the crawl frequency.
  5. Save your settings.

The crawl schedule will perform a full crawl on every domain on this index. Note that a scheduled crawl does not necessarily run immediately.

Next stepsedit

See Troubleshooting crawls to learn how to troubleshoot issues with your crawls.