2013年09月13日

Indexing for Beginners, Part 1

By Morten Ingebrigtsen

UPDATE: This article refers to our hosted Elasticsearch offering by an older name, Found. Please note that Found is now known as Elastic Cloud.

The Missing Document on Search Engine Indexing

What is an Index?

“If you don’t find it in the index, look very carefully through the entire catalog.” – Sears, Roebuck, and Co., Consumers’ Guide 1897

The concept of an index is pretty straightforward. I’m sure you are familiar with indexes from textbooks, so we will use this as a point of reference for the following introductory explanation.

Here’s an everyday scenario: you want to make tomato soup but don’t know how. So you pick up that dusty cookbook you were given for Christmas. But hey, it happens to be a bible size cookbook covering all things foody. Surely you don’t want to start on page 1 and read through the majority of the cookbook to find out how to make tomato soup, right? Luckily for you, some clever person in the 16th century came up with the headsmacking idea of indexing books!

Here is a sample of what a cookbook index might look like:

<code>Tartar, beef tartar……………………page 67
Tomato chutney……………………page 645
Tomato soup…………………………page 23, 78
Umami burger…………………………page 378</code>

An index in a textbook is basically a mapping between words or phrases in the book, for instance “tomato soup”, and the page or pages where you can find the word or phrase. By convention and most widely used is the back-of-the-book index, sorted alphabetically. Evidently, this makes it quick and easy to look up a word or phrase and subsequently locate the mapped content.

Indexing by search engines works pretty much the same way. A search engine creates mappings between terms, in the above example recipe names, and occurrences of the terms, in the above example the pages in the cookbook where you can find the recipes.

Just as it’s a tedious and cumbersome task to read through an entire cookbook to find a specific recipe, it is equally inefficient for a computer to scan through a pile of documents or a database every time it is asked to find all occurrences of a term. For large databases or mountains of documents, this process could take several minutes, or even hours. Indexing has therefore been introduced to speed up the search process, or information retrieval (IR) process, as it is generally called.

Somewhat akin to the alphabetical sorting of the index in a cookbook, a search engine’s index is stored using specific formats - or data structures - to make it fast to locate and find the search term and their mappings. Moreover, storing the index in memory (RAM) is a sure-fire way to speed up the lookup process. Imagine if you could memorize all the indexes of your entire stack of textbooks!

Next: Document Parsing and Tokenization