App Search web crawler configuration | App Search documentation [master]

You are looking at preliminary documentation for a future release. Not what you want? See the current release documentation.

› ›

« Configuration Configuring a mailer »

App Search web crawler configurationedit

If you are looking for the Elastic web crawler configuration documentation, see Elastic web crawler in the Enterprise Search configuration documentation. To compare features with the Elastic web crawler, see Elastic web crawler overview

crawler.http.user_agent

The User-Agent HTTP Header used for the App Search web crawler.

crawler.http.user_agent: Elastic-Crawler (<crawler_version_number>)

When running Elastic Web Crawler on Elastic Cloud, the default user agent value is Elastic-Crawler Elastic Cloud (https://www.elastic.co/guide/en/cloud/current/ec-get-help.html; <unique identifier>).

crawler.http.user_agent_platform

The user agent platform used for the App Search web crawler with identifying information. See User-Agent - Syntax in the MDN web docs.

This value will be added as a suffix to crawler.http.user_agent and used as the final User-Agent Header. This value is blank by default.

crawler.workers.pool_size.limit

The number of parallel crawls allowed per instance of Enterprise Search. By default, it is set to 2x the number of available logical CPU cores. On Intel CPUs, the default value is 4x the number of physical CPU cores due to hyper-threading. See Hyper-threading on Wikipedia.

crawler.workers.pool_size.limit: N

You cannot set crawler.workers.pool_size.limit to more than 8x the number of physical CPU cores available for the Enterprise Search instance.

Keep in mind that despite the setting above, you can still only have one crawl request running per engine at a time.

Per-crawl Resource Limitsedit

These limits guard against infinite loops and other traps common to production web crawlers. If your crawler is hitting these limits, try changing your crawl rules or the content you’re crawling. Adjust these limits as a last resort.

crawler.crawl.max_duration.limit

The maximum duration of a crawl, in seconds. Beyond this limit, the App Search web crawler will stop, abandoning all remaining URLs in the crawl queue.

crawler.crawl.max_duration.limit: 86400 # seconds

crawler.crawl.max_crawl_depth.limit

The maximum number of sequential pages the App Search web crawler will traverse starting from the given set of entry points. Beyond this limit, the web crawler will stop discovering new links.

crawler.crawl.max_crawl_depth.limit: 10

crawler.crawl.max_url_length.limit

The maximum number of characters within each URL to crawl. The App Search web crawler will skip URLs that exceed this length.

crawler.crawl.max_url_length.limit: 2048

crawler.crawl.max_url_segments.limit

The maximum number of segments within the path of each URL to crawl. The App Search web crawler will skip URLs whose paths exceed this length. Example: The path /a/b/c/d has 4 segments.

crawler.crawl.max_url_segments.limit: 16

crawler.crawl.max_url_params.limit

The maximum number of query parameters within each URL to crawl. The App Search web crawler will skip URLs that exceed this length. Example: The query string in /a?b=c&d=e has 2 query parameters.

crawler.crawl.max_url_params.limit: 32

crawler.crawl.max_unique_url_count.limit

The maximum number of unique URLs the App Search web crawler will index during a single crawl. Beyond this limit, the web crawler will stop.

crawler.crawl.max_unique_url_count.limit: 100000

Advanced Per-crawl Limitsedit

crawler.crawl.threads.limit

The number of parallel threads to use for each crawl. The main effect from increasing this value will be an increased throughput of the App Search web crawler at the expense of higher CPU load on Enterprise Search and Elasticsearch instances as well as higher load on the website being crawled.

crawler.crawl.threads.limit: 10

crawler.crawl.url_queue.url_count.limit

The maximum size of the crawl frontier - the list of URLs the App Search web crawler needs to visit. The list is stored in Elasticsearch, so the limit could be increased as long as the Elasticsearch cluster has enough resources (disk space) to hold the queue index.

crawler.crawl.url_queue.url_count.limit: 100000

Per-Request Timeout Limitsedit

crawler.http.connection_timeout

The maximum period to wait until abortion of the request, when a connection is being initiated.

crawler.http.connection_timeout: 10 # seconds

crawler.http.read_timeout

The maximum period of inactivity between two data packets, before the request is aborted.

crawler.http.read_timeout: 10 # seconds

crawler.http.request_timeout

The maximum period of the entire request, before the request is aborted.

crawler.http.request_timeout: 60 # seconds

Per-Request Resource Limitsedit

crawler.http.response_size.limit

The maximum size of an HTTP response (in bytes) supported by the App Search web crawler.

crawler.http.response_size.limit: 10485760

crawler.http.redirects.limit

The maximum number of HTTP redirects before a request is failed.

crawler.http.redirects.limit: 10

Content Extraction Resource Limitsedit

crawler.extraction.title_size.limit

The maximum size (in bytes) of some fields extracted from crawled pages.

crawler.extraction.title_size.limit: 1024

crawler.extraction.body_size.limit

The maximum size (in bytes) of some fields extracted from crawled pages.

crawler.extraction.body_size.limit: 5242880

crawler.extraction.keywords_size.limit

The maximum size (in bytes) of some fields extracted from crawled pages.

crawler.extraction.keywords_size.limit: 512

crawler.extraction.description_size.limit

The maximum size (in bytes) of some fields extracted from crawled pages.

crawler.extraction.description_size.limit: 1024

crawler.extraction.extracted_links_count.limit

The maximum number of links extracted from each page for further crawling.

crawler.extraction.extracted_links_count.limit: 1000

crawler.extraction.indexed_links_count.limit

The maximum number of links extracted from each page and indexed in a document.

crawler.extraction.indexed_links_count.limit: 25

crawler.extraction.headings_count.limit

The maximum number of HTML headers to be extracted from each page.

crawler.extraction.headings_count.limit: 25

crawler.extraction.default_deduplication_fields

Default document fields used to compare documents during de-duplication.

crawler.extraction.default_deduplication_fields: ['title', 'body_content', 'meta_keywords', 'meta_description', 'links', 'headings']

App Search web crawler HTTP Security Controlsedit

crawler.security.ssl.certificate_authorities

A list of custom SSL Certificate Authority certificates to be used for all connections made by the App Search web crawler to your websites. These certificates are added to the standard list of CA certificates trusted by the JVM. Each item in this list could be a file name of a certificate in PEM format or a PEM-formatted certificate as a string.

crawler.security.ssl.certificate_authorities: []

crawler.security.ssl.verification_mode

Control SSL verification mode used by the App Search web crawler:

full - validate both the SSL certificate and the hostname presented by the server (this is the default and the recommended value)
certificate - only validate the SSL certificate presented by the server
none - disable SSL validation completely (this is very dangerous and should never be used in production deployments).

crawler.security.ssl.verification_mode: full

crawler.security.auth.allow_http

Allow/Disallow authenticated crawling of non-HTTPS URLs:

false - Do not allow crawling non-HTTPS URLs (this is the default and the recommended value)
true - Allow crawling non-HTTPS URLs

Enabling this setting could expose your Authorization headers to a man-in-the-middle attack and should never be used in production deployments. See https://en.wikipedia.org/wiki/Man-in-the-middle_attack for more details.

App Search web crawler DNS Security Controlsedit

The settings in this section could make your deployment vulnerable to SSRF attacks (especially in cloud environments) from the owners of any domains you crawl. Do not enable any of the settings here unless you fully control DNS domains you access with the App Search web crawler. See Server Side Request Forgery on OWASP for more details on the SSRF attack and the risks associated with it.

crawler.security.dns.allow_loopback_access

Allow the App Search web crawler to access the localhost (127.0.0.0/8 IP namespace).

crawler.security.dns.allow_loopback_access: false

crawler.security.dns.allow_private_networks_access

Allow the App Search web crawler to access the private IP space: link-local, network-local addresses, etc. See Reserved IP addresses - IPv4 on Wikipedia for more details.

crawler.security.dns.allow_private_networks_access: false

App Search web crawler HTTP proxy settingsedit

If you need the App Search web crawler to send HTTP requests through an HTTP proxy, use the following settings to provide the proxy information to Enterprise Search.

Your proxy connections are subject to the DNS security controls described in App Search web crawler DNS Security Controls. If your proxy server is running on a private address or a loopback address, you will need to explicitly allow the App Search web crawler to connect to it.

crawler.http.proxy.host

The host of the proxy.

crawler.http.proxy.host: example.com

crawler.http.proxy.port

The port of the proxy.

crawler.http.proxy.port: 8080

crawler.http.proxy.protocol

The protocol to be used when connecting to the proxy: http (default) or https.

crawler.http.proxy.protocol: http

crawler.http.proxy.username

The username portion of the Basic HTTP credentials to be used when connecting to the proxy.

crawler.http.proxy.username: kimchy

crawler.http.proxy.password

The password portion of the Basic HTTP credentials to be used when connecting to the proxy.

crawler.http.proxy.password: A3renEWhGVxgYFIqfPAV73ncUtPN1b

Advanced App Search web crawler tuningedit

crawler.http.compression.enabled

Enable/disable HTTP content (gzip/deflate) compression in App Search web crawler requests.

crawler.http.compression.enabled: true

crawler.http.default_encoding

Default encoding used for responses that do not specify a charset.

crawler.http.default_encoding: UTF-8

« Configuration Configuring a mailer »