29 octobre 2013

Indexing for Beginners, Part 3

UPDATE: This article refers to our hosted Elasticsearch offering by an older name, Found. Please note that Found is now known as Elastic Cloud.

What Does an Index Look Like?

In the previous article we looked at how search engines parses, analyzes and tokenizes text. In this article we will explore in more detail what an index looks like.

Terms, Words and Tokens

Read the previous article here: Document parsing and tokenization

After we have tokenized the text, we are left with a list of tokens that the search engine will store in the index. In our cookbook example, the tokens are the actual words in the recipes that we want to make searchable. As you may have noticed, we often use the terms word, term and token interchangeably and in the broadest sense when talking about indexing. For example, we use these terms for actual words, numbers, abbreviations, paths (in the example with the path-tokenizer) and so on. Strictly speaking, a token is the name of a unit that we derive from the tokenizer, and the token therefore depends on the tokenizer. A token is not necessarily a word, but a word is normally a token when dealing with text. When we store the token in the index, it is usually called a term.

The Forward Index

The most obvious way to build up an index, is to store a list of all terms for each document that we are indexing:

Document	Terms
Grandma’s tomato soup	peeled, tomatoes, carrot, basil, leaves, water, salt, stir, and, boil, …
African tomato soup	15, large, tomatoes, baobab, leaves, water, store, in, a, cool, place, …
Good ol’ tomato soup	tomato, garlic, water, salt, 400, gram, chicken, fillet, cook, for, 15, minutes, …

This is called a forward index. The forward index is pretty fast when indexing, as new documents are appended to the index and no rebuilding of the index is required. It is, however, not very efficient when querying because querying requires the search engine to look through all entries in the index for a specfic term in order to return all documents containing the term.

Inverted Index

To make the query process faster, it is much smarter to sort the index by terms, like this:

Term	Documents
baobaob	African tomato soup
basil	Grandma’s tomato soup
leaves	African tomato soup, Grandma’s tomato soup
salt	African tomato soup, Good ol’ tomato soup
tomato	African tomato soup, Good ol’ tomato soup, Grandma’s tomato soup
water	African tomato soup, Good ol’ tomato soup, Grandma’s tomato soup
…	…

This type of index is called an inverted index, namely because it is an inversion of the forward index. With the inverted index, we only have to look for a term once to retrieve a list of all documents containing the term. As mentioned in the first article in this series, conventional textbook indexing is based on inverted index.

Often both the forward and inverted index are used in search engines, where the inverted index is built by sorting the forward index by its terms.

In some search engines the index includes additional information such as frequency of the terms, e.g. how often a term occurs in each document, or the position of the term in each document. The frequency of a term is often used to calculate the relevance of a search result, whereas the position is often used to facilitate searching for phrases in a document.

In the next article, we will look at how you can build up more sophisticated indexes to speed up and improve the relevancy of your search results.

Previous: Document Parsing and Tokenization