October 29, 2013

Indexing for Beginners, Part 3

UPDATE: This article refers to our hosted Elasticsearch offering by an older name, Found. Please note that Found is now known as Elastic Cloud.

What Does an Index Look Like?

In the previous article we looked at how search engines parses, analyzes and tokenizes text. In this article we will explore in more detail what an index looks like.

Terms, Words and Tokens

Read the previous article here: Document parsing and tokenization

After we have tokenized the text, we are left with a list of tokens that the search engine will store in the index. In our cookbook example, the tokens are the actual words in the recipes that we want to make searchable. As you may have noticed, we often use the terms word, term and token interchangeably and in the broadest sense when talking about indexing. For example, we use these terms for actual words, numbers, abbreviations, paths (in the example with the path-tokenizer) and so on. Strictly speaking, a token is the name of a unit that we derive from the tokenizer, and the token therefore depends on the tokenizer. A token is not necessarily a word, but a word is normally a token when dealing with text. When we store the token in the index, it is usually called a term.

The Forward Index

The most obvious way to build up an index, is to store a list of all terms for each document that we are indexing:

Document	Terms
Grandma’s tomato soup	peeled, tomatoes, carrot, basil, leaves, water, salt, stir, and, boil, …
African tomato soup	15, large, tomatoes, baobab, leaves, water, store, in, a, cool, place, …
Good ol’ tomato soup	tomato, garlic, water, salt, 400, gram, chicken, fillet, cook, for, 15, minutes, …

This is called a forward index. The forward index is pretty fast when indexing, as new documents are appended to the index and no rebuilding of the index is required. It is, however, not very efficient when querying because querying requires the search engine to look through all entries in the index for a specfic term in order to return all documents containing the term.

Inverted Index

To make the query process faster, it is much smarter to sort the index by terms, like this:

Term	Documents
baobaob	African tomato soup
basil	Grandma’s tomato soup
leaves	African tomato soup, Grandma’s tomato soup
salt	African tomato soup, Good ol’ tomato soup
tomato	African tomato soup, Good ol’ tomato soup, Grandma’s tomato soup
water	African tomato soup, Good ol’ tomato soup, Grandma’s tomato soup
…	…

This type of index is called an inverted index, namely because it is an inversion of the forward index. With the inverted index, we only have to look for a term once to retrieve a list of all documents containing the term. As mentioned in the first article in this series, conventional textbook indexing is based on inverted index.

Often both the forward and inverted index are used in search engines, where the inverted index is built by sorting the forward index by its terms.

In some search engines the index includes additional information such as frequency of the terms, e.g. how often a term occurs in each document, or the position of the term in each document. The frequency of a term is often used to calculate the relevance of a search result, whereas the position is often used to facilitate searching for phrases in a document.

In the next article, we will look at how you can build up more sophisticated indexes to speed up and improve the relevancy of your search results.

Previous: Document Parsing and Tokenization

Context engineering

Vector database

Search powered applications

Logs

Threat protection

Workflows

Elasticsearch

Kibana (Discover, Dashboards)

Elastic Agent Builder

AutoOps

Piped query language

Jina AI search models

Elastic Cloud Serverless

Elastic Cloud Hosted

Self-managed Elasticsearch

Ecommerce search

Customer support search

Search-driven apps

Log analytics

Infrastructure monitoring

Digital experience monitoring

App performance monitoring

AIOps

LLM observability

Next-gen SIEM

Workflows for security

XDR and endpoint security

AI for security

10x your data's value

Cloud providers

Elastic AI Ecosystem

Search AI Partner Program

AV-Comparatives

Forrester Wave™ XDR

Gartner Magic Quadrant Leader

IDC MarketScape

Search

Security

Observability

Get started

Demo gallery

Downloads

Integrations

Docs

Elasticsearch Labs

Elastic Security Labs

Elastic Observability Labs

Blog

Community

Events

Webinars

Discuss

Training

Support

Consulting

Indexing for Beginners, Part 3

What Does an Index Look Like?

Terms, Words and Tokens

The Forward Index

Inverted Index