Elastic web crawler known issuesedit

The Elastic web crawler has the following known issues:

  • The crawler does not crawl pure JavaScript single-page applications (SPAs).

    We recommend looking at dynamic rendering to help your crawler properly index your JavaScript websites. Another option is to serve a static HTML version of your Javascript website, using a solution such as Prerender.

  • The crawler does not support dynamic content.

    The crawler does not execute JavaScript, and it only pulls text from HTML elements.

  • URLs being indexed despite having duplicate content and a canonical URL setting.

    Canonical URL link tags are embedded within HTML source for pages that duplicate the content of other pages. Refer to Duplicate document handling for details. The crawler identifies duplicate content by hashing the content of default deduplication fields derived from the page. These fields are defined by the configuration setting connector.crawler.extraction.default_deduplication_fields.

    The web crawler checks your index for an existing document with the same content hash. Users have faced issues where they set canonical link tags for a page that does not have identical content, because the hashes are different. However, upon inspection, the content is the same.

    Use the following workaround:

    You can manage which fields the web crawler uses to create the content hash. If your pages all define canonical URLs, you could safely change your deduplication fields settings to include only the url field. Otherwise, you may need more fields to help check for duplicates. By default, the web crawler checks body_content, headings, links, meta_description, meta_keywords, and title fields.