ℹ️ For new users, we recommend using our native Elasticsearch tools, rather than the standalone App Search product. We are actively developing new features and capabilities in the Elastic Stack to help you build powerful search applications. Refer to this blog for more information.

« Web crawler FAQ Web crawler events logs reference »

› ›

URL path	Rule	Path pattern	Match?
`/foo/bar`	Begins with	`/foo`	YES
`/foo/bar`	Begins with	`/*oo`	YES
`/bar/foo`	Begins with	`/foo`	NO
`/foo/bar`	Begins with	`foo`	NO
`/blog/posts/hello-world`	Ends	`world`	YES
`/blog/posts/hello-world`	Ends	`hello-*`	YES
`/blog/world-hello`	Ends	`world`	NO
`/blog/world-hello`	Ends	`*world`	NO
`/fruits/bananas`	Contains	`banana`	YES
`/fruits/apples`	Contains	`banana`	NO
`/2020`	Regex	`\/[0-9]{3,5}`	YES
`/20`	Regex	`\/[0-9]{3,5}`	NO
`/2020`	Regex	`[0-9]{3,5}`	NO

Policy	Rule	Path pattern
`Allow`	`Begins with`	`/blog`
`Disallow`	`Regex`	`.*`
`Allow`	`Regex`	`.*`

Code	Description	Web crawler behavior
`2xx`	Success	The web crawler extracts and de-duplicates the page’s content into a search document. Then it indexes the document into the engine, replacing an existing document with the same content if present.
`3xx`	Redirection	The web crawler follows all redirects within configured domains recursively until it receives a `2xx`, `4xx`, or `5xx` response status code, detects a circular redirect or reaches a configured max redirects limit (which are handled as failures). The crawler then handles that code as indicated in this table. Each redirect response is logged as a separate event in the event log along with redirect information to help with long or circular redirect chain troubleshooting.
`4xx`	Permanent error	The web crawler assumes this error is permanent. It therefore does not index the document. Furthermore, if the document is present in the engine, the web crawler deletes the document.
`5xx`	Temporary error	The web crawler optimistically assumes this error will resolve in the future. If the engine already contains a document representing the page, the document is updated to indicate the page was inaccessible. After the page is inaccessible for three consecutive crawls, the document is deleted from the engine.

Code	Description	Web crawler behavior
`2xx`	Success	The web crawler processes and honors the directives for the domain within the robots.txt file.
`3xx`	Redirect	The web crawler follows redirects (including redirects to external domains) until it receives a `2xx`, `4xx`, or `5xx` response status code, detects a circular redirect or reaches a configured max redirects limit (which are handled as failures). The crawler then handles that code as indicated in this table.
`4xx`	Permanent error	The web crawler assumes this error is permanent and no specific rules exist for the domain. It therefore allows all paths for the domain (subject to crawl rules).
`5xx`	Temporary error	The web crawler optimistically assumes this error will resolve in the future. However, in the interim, the crawler does not know which paths are allowed and disallowed. It therefore disallows all paths for the domain.

Web crawler reference

Web crawler reference

Crawl

Content discovery

Crawl state

Content extraction and indexing

Binary content extraction

Duplicate document handling

Content inclusion and exclusion

Content deletion

Domain

Entry point

Sitemap

Sitemap discovery and management

Sitemap format and technical specification

Crawl rule

Crawl rule logic (rules)

Crawl rule matching

Crawl rule order

Restricting paths using crawl rules

Robots.txt file

Robots.txt file discovery and management

Robots.txt file format and technical specification

Canonical URL link tag

Robots meta tags

Meta tags and data attributes to extract custom fields

Nofollow link

Crawl status

Partial crawl

Process crawl

HTTP authentication

HTTP proxy

HTTP response status codes handling

HTML response code handling

Robots.txt file response code handling

Web crawler schema

Web crawler events logs

Web crawler configuration settings