Getting started with the Elastic web crawleredit

Looking for the App Search web crawler? See the App Search documentation.

This page is a concrete guide to getting started with the Elastic web crawler. See the general Enterprise Search getting started page for a high level overview of all Enterprise Search tools.

This page walks you through your first targeted test crawls using the Kibana UI. See Managing crawls for a full reference.

The Elastic web crawler discovers and extracts web content, transforms it into structured documents, and indexes those documents into Elasticsearch.

Before you get started, decide which web content you’d like to crawl. The iterative feedback cycle looks like this:

  • Defining the crawl
  • Monitoring the results
  • Troubleshooting

Repeat, as necessary, every time you make changes to the crawl configuration. Get started by limiting your crawl to a single domain and a small number of documents. When you are happy with the results at a small scale, you can gradually add more complexity.

Here you will learn how to:

  • Create an Elasticsearch index
  • Add and validate domains
  • Define entry points and sitemaps
  • Define crawl rules
  • Monitor crawls and perform basic troubleshooting
  • Verify your documents

Prerequisitesedit

To use the web crawler, you need an Elastic subscription, an Elastic deployment, and the credentials for an Elastic user who can access the web crawler UI within Kibana.

Get started quickly with a free trial on Elastic Cloud. After creating an account, you’ll have an active subscription, and you’ll be prompted to create your first deployment. When you open that deployment, you’ll already be logged in (using your Elastic Cloud user account). You’ll have access to the web crawler and all other Enterprise Search features within that deployment.

Other options are available, such as using an existing deployment, or creating a local development environment. You may need to start a trial subscription to enable all features. See Elastic subscription and Elastic deployment for details.

The web crawler was introduced in Enterprise Search 8.4, so be sure to use version 8.4 or later.

Create an Elasticsearch indexedit

The first step is to create an Elasticsearch index. This is where the crawler will store indexed webpage content.

Follow these steps in Kibana:

  • Navigate to Enterprise SearchContentIndices.
  • Click Create new index.
  • Use the web crawler as ingestion method.
  • Name your new index.
  • Choose your document language.

You are now ready to add domains.

Add domainsedit

It’s time to specify the domains you want to crawl. We recommend getting started with a single domain.

Follow these steps to validate and add domains:

  • If you’ve just created your index you will be automatically navigated to Manage domains for that index.

    To navigate there manually go to Enterprise SearchContentIndicesyour-crawler-indexManage domains.

  • Enter a domain to be crawled. The web crawler will not crawl any webpages outside of your defined domains.
  • Review any warnings flagged by the crawler domain validation process.
  • Inspect the domain’s robots.txt file, if it exists. The instructions within this file, also called directives, communicate which paths within that domain are disallowed for crawling.
  • Add the domain.

Define entry points and sitemapsedit

Entry pointsedit

Each domain must have at least one entry point. Entry points are the paths from which the crawler will start each crawl. Ensure entry points for each domain are allowed by the domain’s crawl rules, and the directives within the domain’s robots.txt file.

Add multiple entries, if some pages are not discoverable from the first entry point. For example, if your domain contains an “island” page that is not linked from other pages, simply add that full URL as an entry point.

Sitemapsedit

If the website you are crawling uses sitemaps, you can specify the sitemap URLs. Note that you can choose to submit URLs to the web crawler using sitemaps, entry points, or a combination of both.

To add a sitemap to a domain you manage, you can specify it within a robots.txt file. At the start of each crawl, the web crawler fetches and processes each domain’s robots.txt file and each sitemap specified within those files.

You may prefer to use sitemaps over entry points, because you have already published sitemaps for other web crawlers.

Define crawl rulesedit

This is enough configuration to start crawling, but we recommend first defining a few crawl rules. A crawl rule is a crawler instruction to allow or disallow specific paths within a domain. It makes sense to start with test crawls across a limited subset of paths.

To define a crawl rule follow these steps:

  1. Choose one of two policies:

    • Allow
    • Disallow
  2. Choose one of four rules:

    • Begins with
    • Ends with
    • Contains
    • Regex
  3. Define the path pattern, e.g. /blog, /troubleshooting, .*.

The Boolean logic of crawl rules can get complex fast. Start simple, and when you are comfortable with the results, gradually add complexity as necessary. For example, you might want to restrict your first crawls to a single path on the target domain. Learn how to do this in the examples below.

A few important facts about crawl rules:

  • When the web crawler discovers a new page, it will evaluate the URL against the rules you defined — to decide if the URL should be visited.
  • Rules are evaluated in sequential order. The first crawl rule to match determines the policy for the URL. Therefore, order matters.
  • Each URL is evaluated according to the first match.
  • If the URL matches the crawl rule, the URL will be allowed or disallowed for crawling.
  • If the URL doesn’t match the rule, it will be evaluated against the next rule.

Default crawl rule

The domain dashboard adds a default crawl rule to each domain, which cannot be deleted or re-ordered:

Policy Rule Path pattern

Allow

Regex

.*

Examples: Restricting paths using crawl rulesedit

Rule to disallow a specific path, e.g. /blog:

Policy Rule Path pattern

Disallow

Begins with

/blog

Allow

Regex

.*

Rules to allow only the /blog path and disallow all others:

Policy Rule Path pattern

Allow

Begins with

/blog

Disallow

Regex

.*

Allow

Regex

.*

When you restrict a crawl to specific paths, be sure to add entry points that allow the crawler to discover those paths. The web crawler will crawl only those paths that are allowed by the crawl rules for the domain and the directives within the robots.txt file for the domain.

See Crawl rules for a full overview of crawl rules.

Running your first crawledit

You have added and validated a domain, defined entry points, pointed to sitemaps, and set some basic crawl rules. It’s time to run your first crawl. It’s best to start with manual, one-off crawls before thinking about scheduling recurring crawls.

Crawl with custom settings is ideal for your first tests, because this allows you to further restrict the pages to be crawled.

Select the Crawl with custom settings option and follow these steps:

  • Set a maximum crawl depth, to specify how many pages deep the crawler traverses.

    • Set the value to 1, for example, to limit the crawl to only entry points.
  • Crawl select domains.

This will help you define small, targeted test crawls.

It’s time to launch your crawl. Next you’ll need to verify that the crawl is running, and that documents are being added to your index.

Monitor crawledit

Under Recent crawl requests, you can quickly verify important information about your crawls.

Each crawl has an associated crawl request, identified by a unique ID in the following format: 60106315beae67d49a8e787d.

Each crawl has a status, which quickly communicates its state. If a crawl failed, you’ll need to troubleshoot. If a crawl is running, or has successfully completed, you can start verifying your documents.

Basic troubleshootingedit

Here we outline some basic troubleshooting techniques you might encounter with your first crawls. For advanced troubleshooting tips and tricks, including how to view web crawler logs in Kibana, see Troubleshooting crawls.

Content discovery and extraction problemsedit

You may encounter some of the following issues:

  • A domain is not being crawled: Make sure you’ve added it. Each unique combination of protocol and hostname is a separate domain. Each of the following is its own domain:

    • http://example.com
    • https://www.example.com
    • http://shop.example.com
    • https://shop.example.com
  • A path is not being crawled: Make sure it’s not disallowed by crawl rules or robots.txt file.
  • A page has no incoming links: Add the URL as an entry point.
  • A page is too large: The web crawler will not index its content.
  • A page is not indexed: The web crawler cannot parse extremely broken HTML pages.
  • Duplicate content: The web crawler only indexes unique pages. See Duplicate document handling.
Verify your documentsedit

Examine the fields that make up an indexed document. The web crawler transforms all HTML page content into documents with these same fields.

If a crawl is running successfully, you’ll start to see documents being added to your Elasticsearch index almost immediately. You need to check that these documents are being indexed as expected.

Follow this checklist to verify your documents:

  • Browse the documents in your index, under ContentIndicesDocuments.
  • Find a specific document, by wrapping the document’s URL in quotes. For example: "https://example.com/some/page.html".
  • Confirm that documents contain the expected content. Does each document have the content you were expecting?
  • Check for missing documents.
  • Check for the presence of documents which shouldn’t be indexed.

Your documents may contain too much, or too little, content. If you’re crawling content you control, you may need to optimize your content.

If you get too many documents, in your next iteration either delete the documents you don’t want, or start over with a new index.

Kibana Dev Toolsedit

Your documents are stored in Elasticsearch indices, so you can also explore them using Kibana Dev Tools.

Use the Elasticsearch REST API in Console to send requests and view their responses.

See the Elasticsearch documentation to learn all about how to search your data.

Next stepsedit

You’ve learned how to set up, configure, and troubleshoot your first crawls using the web crawler. You’ve run a few targeted test crawls. You know how to define basic crawl rules and how to verify your crawled documents. You’ve followed the iterative feedback cycle of Defining the crawlMonitoring the resultsTroubleshooting, to gradually add complexity to your settings.

Once your documents are being crawled and indexed as required, you’re ready to dive into advanced crawler configuration topics.

Learn more:

See Engines and content sources to learn how to build search experiences using App Search engines.