24 February 2013

What is an Elasticsearch Index?

By Zachary Tong

Warning: This article contains outdated information. We recommend reading Index vs. Type instead.

What exactly is an index in Elasticsearch? Despite being a very basic question, the answer is surprisingly nuanced.

Basic Definition

An index is defined as:

An index is like a ‘database’ in a relational database. It has a mapping which defines multiple types.
An index is a logical namespace which maps to one or more primary shards and can have zero or more replica shards.

Ok. So there are two concepts in that definition. First, an index is some type of data organization mechanism, allowing the user to partition data a certain way. The second concept relates to replicas and shards, the mechanism Elasticsearch uses to distribute data around the cluster.

Let’s explore the first concept, using indices to organize data.

Indices for Relations

The easiest and most familiar layout clones what you would expect from a relational database. You can (very roughly) think of an index like a database.

  • MySQL => Databases => Tables => Columns/Rows
  • Elasticsearch => Indices => Types => Documents with Properties

An Elasticsearch cluster can contain multiple Indices (databases), which in turn contain multiple Types (tables). These types hold multiple Documents (rows), and each document has Properties(columns).

So in your car manufacturing scenario, you may have a SubaruFactory index. Within this index, you have three different types:

  • People
  • Cars
  • Spare_Parts

Each type then contains documents that correspond to that type (e.g. a Subaru Imprezza doc lives inside of the Cars type. This doc contains all the details about that particular car).

Searching and querying takes the format of: http://localhost:9200/[index]/[type]/[operation]

So to retrieve the Subaru document, I may do this:

$ curl -XGET localhost:9200/SubaruFactory/Cars/SubaruImprezza

Indices for Logging

Now, the reality is that Indices/Types are much more flexible than the Database/Table abstractions we are used to in RDBMs. They can be considered convenient data organization mechanisms, with added performance benefits depending on how you set up your data.

To demonstrate a radically different approach, a lot of people use Elasticsearch for logging. A standard format is to assign a new index for each day. Your list of indices may look like this:

  • logs-2013-02-22
  • logs-2013-02-21
  • logs-2013-02-20

Elasticsearch allows you to query multiple indices at the same time, so it isn’t a problem to do:

$ curl -XGET localhost:9200/logs-2013-02-22,logs-2013-02-21/Errors/_search?query="q:Error Message"

Which searches the logs from the last two days at the same time. This format has advantages due to the nature of logs – most logs are never looked at and they are organized in a linear flow of time. Making an index per log is more logical and offers better performance for searching.

It is akin to partitioning a RDBM table by time ranges, except we are creating new indices for each partition. This is a concept that RDBM’s would scoff at…a new database for each day? Crazy!

Indices are fairly lightweight data organization mechanisms, so Elasticsearch will happily let you create hundreds of indices.

Indices for Users

Another radically different approach is to create an index per user. Imagine you have some social networking site, and each users has a large amount of random data. You can create a single index for each user. Your structure may look like:

  • Zach’s Index
    • Hobbies Type
    • Friends Type
    • Pictures Type
  • Fred’s Index
    • Hobbies Type
    • Friends Type
    • Pictures Type

Notice how this setup could easily be done in a traditional RDBM fashion (e.g. “Users” Index, with hobbies/friends/pictures as types). All users would then be thrown into a single, giant index.

Instead, it sometimes makes sense to split data apart for data organization and performance reasons. In this scenario, we are assuming each user has a lot of data, and we want them separate. Elasticsearch has no problem letting us create an index per user.

Indices for data distribution

The first three examples dealt entirely with how data should be logically separated, allowing it to be represented naturally and efficiently.

However, the definition of an Index also includes that bit about shards and replicas. Underneath all the indices and types and documents, Elasticsearch has to store the data somewhere. This functionality is stored into shards, which are either the Primary or Replica

Each index is configured for a certain number of primary and replica shards. So taking the “User” example above, if you created an index for every user, you are also creating a set of shards for each user.

This is neither good or bad, simply a consideration when planning your cluster. Different performance requirements benefit from different shard layouts. I’m purposefully leaving this section short, since properly covering shards will require an article of its own.

So just remember, Indices organize data logically, but they also organize data physically through the underlying shards.