Indexing the web is hard. There’s a nearly infinite supply of misbehaving sites, misapplied (or ignored) standards, duplicate content, and corner cases to contend with. It’s a big task to create an easy-to-use web crawler that’s thorough and flexible enough to account for all the different content it encounters. In the spirit of our open source roots, we wanted to share a bit about our approach to these challenges as we build this powerful new ingestion mechanism for Elastic Enterprise Search.
A bit of history, and indexing challenges we’ve learned from along the way
We’ve come a long way since we built the first iteration of our web crawler about eight years ago (for the Swiftype Site Search product — Swiftype and Elastic joined forces in 2017). A revamped version of that web crawler has been in production on swiftype.com for the past five years, and it now handles more than a billion web pages per month. Now we’re using all that experience operating at scale to add a powerful content ingestion mechanism for the Elastic Enterprise Search solution. This new scalable and easy-to-use web crawler will allow our users to index content from any external sources, further enhancing the content ingestion picture for Elastic Enterprise Search.
The following are just a few of the interesting challenges we’ve encountered — some obvious, some not as much — while working on what seems like a simple task of indexing a bunch of HTML pages into a search index.
One of the key problems for a web crawler is the range of user expectations regarding the process of eliminating duplicate copies of data.
On one hand, there are users who follow best practices around content presentation and URL structure. They don't have any duplicate content and, when it is needed, they rely on canonical URL, robots.txt rules, and redirects to help web crawlers correctly handle their sites. Those sites tend to work with any web crawler software out of the box.
On the other hand are sites that abuse standards and don't follow the best practices. A few large, popular content management systems and forum hosting software tools are especially problematic. For reasons unknown, they abuse URL and content formatting rules to a point where a small site could easily generate millions (!!!) of seemingly unique pages. Organizations using those tools tend to require powerful content deduplication and URL filtering functionality.
To allow for such a wide range of different website structures and expectations towards the quality of content deduplication functionality, we’ve developed a set of features that allow users to control what content gets indexed and how they want to define content uniqueness. Our new web crawler is going to have a powerful ability to specify the rules for identifying unique pages and will ensure that duplicate pages are treated correctly for each use case.
What should we use to uniquely identify a piece of content on a website? It seems like a trivial question. Most people (and Internet bodies that create standards) tend to agree that a URL is a unique identifier of a web page. Unfortunately, the messy reality of the Internet in action breaks down this assumption rather quickly. For example:
- Pages get moved without proper redirects — and users expect the indexed content to be updated accordingly without having duplicate pieces of content in their index.
- Multiple duplicate copies of the same page get created without any canonical URL tags — and users expect correct deduplication.
We want to make sure our web crawler covers the widest range of possible content movement and update scenarios. It will use sophisticated URL and content hashing techniques to maintain the search index content even when content and URLs get updated in the presence of multiple duplicate copies of the data.
Dealing with standards noncompliance
Unfortunately, the Internet is ultimately a nearly limitless collection of broken HTML pages and faulty HTTP protocol implementations and web servers. Any web crawler worth its weight has to account for a variety of broken elements, like some of these memorable examples we’ve encountered:
- Huge auto-generated HTML pages with millions of links, where simply trying to parse such a page inevitably leads to a web crawler running out of memory
- Rendering of different content based on the User Agent value (aka "content cloaking"), which makes it difficult to debug and understand what is going on
- HTTP 200 response codes used for "Page not found" responses combined with a ton of broken links on the site
The experience of troubleshooting and working around hundreds of cases like these over the years has led us to build a set of defensive mechanisms and content parsing features. These features will allow the web crawler to handle almost any kind of content as long as it is accessible via HTTP.
Managing crawl lifecycles
Over the years of crawling customer content, we’ve realized that it’s nearly impossible to predict the duration for a given web site crawl. Even for small, simple sites we had cases when a crawl would take many months and have to be manually stopped. Frequently, a random combination of issues with the site leads to an explosion of unique site URLs due to a bug in their implementation of standard protocols.
Here’s a few of the “greatest hits” we’ve encountered that led to crawl lifecycle issues:
- Sites using URLs to store the HTTP session state (e.g., old forums and a number of old Java web frameworks). Every time a web crawler would visit a page, it'd get redirected to a unique URL containing a session identifier (e.g., /some/forum?sid=123456789), and each subsequent visit would present the same page with a different unique URL.
- A relative HTML link on the "Page not found" error page combined with the site using HTTP 200 response codes for all error pages. Each incorrect URL on the site would turn into a large number of unique pages that look like /foo/foo/foo/foo....
- Websites that use URL components to control/configure content filtering/sorting in their product catalogs (e.g., /dresses/color/red/size/small), causing an explosion of seemingly unique URLs generated by combining different filters together (/small/color/red, etc.).
As a result of such complexity, clearly defining a finished crawl is tricky. Our new web crawler is going to include a number of built-in heuristics, allowing it to handle a wide range of situations when websites take a long time to index. Included will be URL length and depth limits, crawl duration limits, support for content-based deduplication, and a number of rules that focus on the key content while allowing deep-linked broken pages to be ignored.
An important aspect of any crawl operation is understanding what happened, what decisions were made by the system, and why. The goal is to see how each page got discovered, fetched from the web server, parsed and ingested into an engine, or, when issues occur, what failed and why.
Since the new web crawler is based on Elasticsearch (the most popular platform for indexing and analysis of logging data in the world), we have made a decision early on to build as much observability as possible into the system. Every single web crawler action and decision is captured and indexed in Elasticsearch. The logs use a standard Elastic Common Schema structured logging format to allow for easier correlation between different logs and very detailed analysis of the events. All of the indexed events are available for analysis using all of the powerful capabilities of Kibana and will be documented from day one.
Be on the lookout for web crawler-related product announcements and updates.If you’d like to hear more about the web crawler, check out the Sprinting to a crawl presentation from our recent ElasticON Global virtual event (along with dozens of other presentations across all Elastic solutions). And, as always, you can add search experiences to your workplace, websites, and apps with Elastic Enterprise Search, with a free 14-day trial of Elastic Cloud (or a free download).