Web crawler events logs reference


Web crawler events logs referenceedit

See View web crawler events logs to learn how to view web crawler events logs in Kibana.

The web crawler logs many events while discovering, extracting, and indexing web content.

Enterprise Search records these events using Elastic Common Schema (ECS), including a custom field set called crawler.* for crawler-specific data (like crawl_id).

This document provides a reference to these events and their fields.

This reference describes the fields common to all web crawler events, including:

The remainder of the document describes different types of web crawler events:

Fields common to all web crawler eventsedit

All web crawler events include the following common fields.

Crawler-specific fieldsedit

A unique ID of a specific crawl.

Base fieldsedit

A UTC timestamp of the event.
A unique identifier of the event.
The type of event. See the sections that follow.
A textual description of the event (useful for displaying in a UI for human consumption).

Service fieldsedit

A unique identifier of the crawler process generating the ID (changes every time a process is restarted).
All events will have this set to crawler.
Current version of the Enterprise Search product.

Process fieldsedit

The PID of the crawler instance.
The id of the thread logging the event.

Host fieldsedit

The host name where the crawler instance is deployed.

Crawl lifecycle eventsedit

Each crawl lifecycle event records important checkpoints within the lifecycle of a specific crawl, for example: start, seed, end. Most of the event information is captured in the message field, along with the other common fields described above. The fields below provide additional details.

Each crawl lifecycle event has one of the following values for event.action:

Emitted when a crawl is started. Includes crawl configuration.
Emitted every time a crawl is seeded with a set of URLs from the outside. Includes the list of URLs submitted to the crawler.
Emitted when a crawl is ended for any reason (finished, canceled, etc).
Periodic events with a snapshot of crawler status metrics used for monitoring an active crawl over time.

Crawl start eventsedit

Set to event.
Set to start.
Set to crawl-start.
A serialized version of the crawl config.

Crawl seed eventsedit

Set to event.
Set to change.
Set to crawl-seed.
A list of URLs used to seed a crawl.

A type of the URLs being added:

  • content for generic content URLs.
  • sitemap for sitemap and sitemap-index URLs.
  • feed for RSS/ATOM feeds.

Crawl end eventsedit

Set to event.
Set to end.
Set to crawl-end.
Set to success or failure depending on how a crawl ended (canceled crawls will be considered failed, etc).

Crawl status eventsedit

Set to metric.
Set to info.
Set to crawl-status.
A set of metrics describing the global state of a crawl and crawl-specific stats that may be useful to understand the state of a crawl over time.

URL lifecycle eventsedit

Each URL lifecycle event is scoped to a particular URL within a specific crawl. Each event describes what happened to the URL during the crawl, for example: how and when did the crawler discover it?, why did the crawler skip it? These events have enough details to allow a human operator to understand exactly how the system discovered a specific URL, what decisions have been made about it, and what was the result of processing the URL.

Each URL lifecycle event has one of the following values for event.action:

URL submitted to the crawl backlog for processing (from a seed list, from within the crawl, via an API, etc).
URL fetch attempt including timing information, server response headers, HTTP code, etc.
URL discovery events. Each time the crawler discovers a URL on a page and makes a descision about it, the URL and the decision are logged.
Events logged when we finish content extraction from a URL (maybe with some basic metadata extracted from the page).
An event marking the end of URL processing.

Fields common to all URL lifecycle eventsedit

All URL lifecycle events include the following common fields:

Identification fields:

A unique identifier (hash) for the URL as it is handled by the crawler. All events for the same URL within a single crawl share the same hash.
A unique identifier of the URL that was used to discover this URL (only used for cases when a URL was discovered during a crawl and not submitted as a seed URL).

URL details:

The full URL string.
Scheme portion of the URL.
Domain portion of the URL.
Port of the URL.
Path of the URL.
URL query string. Included when available.
URL fragment. Included when available.
Username portion of the URL. Included when available.
Password portion of the URL. Included when available.

URL seed eventsedit

These are small events used to track the flow of URLs into the crawler system and are primarily focused on tracking how a specific URL got into the backlog.

Set to event.
Set to start.
Set to url-seed.

A type of the URL being added:

  • content for generic content URLs.
  • sitemap for sitemap and sitemap-index URLs.
  • feed for RSS/ATOM feeds.

A name of the source used for seeding the crawl:

  • seed-list for seed-list URLs submitted as a part of the crawl configuration.
  • organic for URLs discovered during a crawl by following organic links.
  • redirect for pages discovered by following a redirect.
  • canonical-url for pages discovered via the canonical URL meta tag.
Set to the hash of the URL the crawler used to discover this page (only for URLs discovered during a crawl and not for entry points).
A positive number, indicating the number of steps the crawler had to take from our seed URLs set to reach this specific page.

URL fetch eventsedit

These are the primary events that will be used for troubleshooting networking layer issues with a crawl. They therefore aim to provide enough insight into what happened during a fetch attempt and what were the results.

These events represent a single HTTP request. If the crawler followed redirects, it logs a separate record for each event including information about the redirect response to help with redirect chain troubleshooting.

Set to event.
Set to access.
Set to url-fetch.

Event timing and outcome details:

The start of the HTTP request.
The end of the HTTP request.
Response timing for the HTTP request (total time it took to get the full response).

An ECS categorization field. Denotes whether the event represents a success or a failure from the perspective of the crawler:

  • failure - for all 3xx, 4xx and 5xx responses.
  • success - for all 2xx responses.
  • unknown - for network timeouts.

HTTP request details:

The method of the request.

HTTP response details:

The size of the response body in bytes (for successful responses only).
A string status code.

HTTP redirect details:

A Location header content for redirect responses.
Number of redirects followed so far in a redirect chain (starts with 1 on the first redirect and is increased on each subsequent redirect until a non-redirect response is received or the maximum number of redirects is reached).

URL discover eventsedit

These are small events used to troubleshoot URL discovery within the crawler. Each time the crawler sees a new URL (extracted from a page, from a sitemap or from following a redirect), it logs information about it along with the decision on what will happen to the newly discovered link.

Set to event.
Set to url-discover.

Depending on the decision regarding the URL, set to one of:

  • allowed if the URL will be added to the backlog for future crawling.
  • denied if the URL will not be followed (the message field will have a human-readable explanation of why the crawler decided not to follow it).

A type of the source used for discovering the link:

  • organic for URLs discovered during a crawl by following organic links.
  • redirect for pages discovered by following a redirect.
Set to the hash of the URL the crawler used to discover this page (for URLs discovered during the crawl and not for entry points).
A positive number, indicating the number of steps the crawler had to take from our seed URLs set to reach this specific page.

A field with a code explaining the reason for skipping a URL during a crawl:

  • already_seen when this exact URL/page has already been processed in this crawl.
  • link_too_deep when we hit a crawl depth limit.
  • link_too_long when we hit a URL length limit.
  • link_with_too_many_params when we hit a limit on the number of URL parameters allowed.
  • link_with_too_many_segments when we hit a limit on the number of URL segments allowed.
  • queue_full when we hit a backlog size limit.
  • sitemap_denied when a URL is prohibited from crawling by a sitemap rule.
  • domain_filter_denied for prohibited cross-domain links.
  • page_already_visited for crawl-scoped URL de-duplication events.
  • incorrect_protocol for non-HTTP links and non-HTTPS links in HTTPS-enforced mode.

URL extracted eventsedit

These events are focused on the extraction portion of the crawler process and are logged to help an operator troubleshoot the process of content extraction for the pages on their domains. The primary focus here is capturing the details of the extraction process.

Each event represents a single extractor handling a single piece of content.

Set to event.
Set to url-extracted.
The name of the extractor generating the event (e.g. html).

Depending on the decision regarding the URL, set to one of:

  • allowed if the URL has been allowed to be indexed.
  • denied if the URL has not been indexed because of a crawl rule, a robots.txt rule, etc (the message field will have a human-readable explanation of what happened).

Event timing and outcome details:

The start of the extraction process.
The end of the extraction process.
End-to-end timing for the extraction process (total time it took to get the data extracted).

An ECS categorization field. Denotes whether the event represents a success or a failure from the perspective of the crawler:

  • failure if extraction process failed and we are going to drop the content.
  • success if extraction process succeeded (or failed in a graceful manner).

Extraction result details:

Content type for the page.
The size of the page.
The list of fields extracted.

URL output eventsedit

These events are designed to capture the results of ingestion of a single piece of content into an external system (file, App Search, etc). The main goal here is to capture any data needed to tie a URL fetched and processed by the crawler to the changes performed in the external system as a result of the crawl.

Each event represents a single output module handling a single piece of content.

Set to event.
Set to end.
Set to url-output.
The name of the output module generating the event (e.g. file, app-search).

Event timing and outcome details:

The start of the output ingestion process.
The end of the output ingestion process.
End-to-end timing for the output ingestion process (total time it took to get the data processed by the module).

An ECS categorization field. Denotes whether the event represents a success or a failure from the perspective of the crawler:

  • failure if output ingestion process failed and we are going to drop the content.
  • success if output ingestion process succeeded (or failed in a graceful manner).
  • unknown for cases specific to an output module.

Output ingestion results (file module):

The directory where the event has been logged.
The name of the file where the event has been logged (base name without the directory).

Output ingestion results (app-search module):

The id of the engine used to ingest the content.
The name of the engine used to ingest the content.
The id of the document within the engine.
The content hash used for de-duplication purposes.

Content ingestion eventsedit

A special kind of event used to troubleshoot the ingestion process. These events are used only by complex output modules and, potentially, only enabled in debug mode or by using a special crawl config option. The goal of these events is to explain the ingestion process results in more details than could be captured by a URL output event.

Set to event.
Set to info.
Set to ingest-progress.
The name of the output module generating the event (e.g. file, app-search).

Details on what is happening with the extraction process.

App Search logs URL-scoped events that explain how a specific piece of content from the crawler got ingested into the external system. These are important for troubleshooting cases when the crawler discovers and crawls a URL, but due to App Search de-duplication logic the content does not get ingested, etc.

An event logged by an output module to help an operator troubleshoot the ingestion process. These are pretty generic events using the message field to explain what is happening.

URL identification fields:

These are used to correlate an ingestion event to the rest of the events generated by the crawler for a specific page:

A unique identifier for the URL as it is handled by the crawler, all events for the same URL within a single crawl share the same hash (since it is calculated as SHA1 hash of the URL itself).
The full URL string.
Scheme portion of the URL.
Domain portion of the URL.
Port of the URL.
Path of the URL.
URL query string. Included when available.
URL fragment. Included when available.
Username portion of the URL. Included when available.
Password portion of the URL. Included when available.