October 8, 2013

Indexing for Beginners, Part 2

UPDATE: This article refers to our hosted Elasticsearch offering by an older name, Found. Please note that Found is now known as Elastic Cloud.

Document Parsing and Tokenization

We continue the Indexing for beginners article with an introduction to the document parsing process and use some space to explain the concept of tokenization and tokens.

Document Parsing

Read part 1 of this article here: What is an index?

Say you are running a recipe website where visitors can add their own recipes. Obviously, you want the recipes to be searchable, so you set up a search engine to take care of this. Every time someone creates or uploads a new recipe to the website, the search engine is tasked with making the new recipe searchable. The first step in this process is is to scan and process the text in the recipe. This process is generally called document parsing. Other common names for this process are text processing, text analysis, text mining and content analysis. In the document parsing step the search engine creates a list of the words - or terms - in the recipe, accompanied by a reference - or mapping - to the recipes containing these terms. This list is basically what makes up the index. Lastly, the search engine will save the index to disk to be able to retrieve it later. Also, parts of the index will reside in memory, to make the search process faster.

Full-Text Search Engines

Search engines that scan and process the entire text in a document prior to indexing, are so-called full-text search engines. Lucene, which Elasticsearch and Solr are built on top of, is an example of a full-text search engine.

Tokenization

When the recipe text is scanned and fed into our search engine, the first thing the search engine has to do is to make sense of the data it receives. This might seem trivial to us, but for a computer it's not that simple. When you and I read a text, we automatically break up the text into characters, words and sentences. This is a process we have learned and automated through years of practice. We know that words are separated by spaces, and that sentences are separated by punctuation marks. Computers, on the other hand, has no such built-in understanding of so-called natural languages¹. For a computer, the text is just a stream of characters where a white space, a punctuation mark or the character 't' has no special meaning. Therefore, computers has to be programmed to break up text into their distinct elements, such as words and sentences. This process is called tokenization and the different chunks, usually words, that constitute the text are called tokens. Here is an example how a sentence is tokenized:

The sentence

<code>Heat slowly, stir gently while adding twenty-five grams of flour.</code>

is split into the following tokens

<code>[heat] [slowly] [stir] [gently] [while] [adding] [twenty-five] [grams] [of] [flour]</code>

We see that the tokenizer used here knows how to separate the words from each other by looking for spaces and punctuation characters. The hyphen is a bit more tricky though; should we split twenty-five into two tokens: [twenty] and [five], or should it be only one token: [twenty-five]? And what if you want to match a search for 25?

Tokenizers can be quite sophisticated, and they should be able to handle text constructs such as abbreviations (ca., kg., dr.), numbers (10 or ten, 3x10^9 or 3 billion), hyphens (twenty-five), end-of-line-hyphens (a word split across two lines), lexical hyphens (pre-, multi-), contractions (I'm, it's), compound words (motorboat or Kraftfahrzeug), and so on. Additionally, in many cases the tokenizer needs to be language aware. This is especially true when tokenizing non-phonetic languages, where one grapheme² might indicate an entire word or a syllable, for example in Chinese or Japanese.

There are many specialized tokenizers, for example CamelCase tokenizer, URL tokenizer, path tokenizer and N-gram tokenizer. These are designed to handle special language constructs, especially in relation to artificial languages like mathematics or programming languages.

Just to give you an idea how these tokenizers work, here is an example of a path tokenizer, where the path:

<code>C:\WINDOWS\system32\rundll32.exe</code>

is split into the following tokens:

<code>[c:] [c:/windows] [c:/windows/system32] [c:/windows/system32/rundll32.exe]</code>

Stop Words

Sometimes we want to avoid certain words from being indexed. For instance, in many cases it would make no sense to store the words on, for, a, the, us, who etc. in the index. Such terms are called stop words. However, in most modern search engines, stop words are not problematic, and by indexing stop words it's possible to search for the band "The Who" or exact phrases and even entire sentences.

Relevancy

Essentially, all the tokens - in our case all the words in our Tomato recipes - are stored as terms in the index, together with the mapping to all occurrences of the terms. For example, if a visitor searches for the term "tomato", the search request will return every recipes containing the word "tomato", and it will certainly include more than just tomato soup recipes.

This rudimentary way of indexing does not take into account the relevancy of the search results. In fact, one of the most important considerations when setting up a search engine is how you tell the search engine to process and analyze the text, and how to build the indexes, as this will heavily influence how you construct your queries and filters (explained later), which in turn will determine the relevancy of the search results. We will talk more about relevancy and relevance models in an upcoming article.

Previous: What is an index?

Next: What does an index look like?

Languages that we humans speak or write are generally called natural languages.↩
A grapheme is the smallest unit in a written language, for instance the character 'w' in the Latin alphabet, or the sign representing the word 'table' in Chinese.↩