16 February 2010 Engineering

Search Engine Time Machine

By Shay Banon

Building a highly available product requires some creative thinking. High availability is measured in two aspects when talking about distributed products. The first is the ability to survive partial cluster failure gracefully (and scale to new nodes when added), and the second is handling complete cluster failure or shutdown (and bringing it back up again).

The following covers how a distributed system can handle these two problems, especially in cloud environments, and what ElasticSearch does in such cases.

Partial Cluster Failure

Partial cluster failure means that one or more nodes failed within the cluster. Support in a distributed system for partial cluster failure means (surprise surprise) not losing data when such events happen. It gets even more important when working in the cloud, since the control one has over the servers (for good or bad) is out of our hands. Moreover, with cloud providers innovating with cool features like Amazon’s spot instances (the ability to bid for instances), we cannot control when our nodes operate. They basically come and go as they wish.

This is a question of state, and where the state is stored. I will cover the case when the state is stored locally to the node, which means that if you lose your node, you lose the data stored with the node. It is important to understand that almost any system will work much faster with local state (local file system, “local” memory) than with a shared storage solution (like NFS). Also, when more than one node needs to access shared state (like NFS), usually locking needs to take place, which hinders the scalability of the system.

The most common solution to partial cluster failure is to replicate data between nodes. The ability to control the number of replicas in the cluster allows one to provide a higher degree of availability (this also relates to total cluster failure, see below). The idea of replication is to have one node that perform the operation on itself, and then to replicate the data or operation to its replicas.

In the case of ElasticSearch, an index is broken down into shards, and each shard can have zero or more replicas. Both are configurable when creating an index. Here is an example of creating an index with 3 shards, each with 2 replicas (note, this is configurable on a per index basis):

<span class="pln">$ curl </span><span class="pun">-</span><span class="pln">XPUT http</span><span class="pun">:</span><span class="com">//localhost:9200/twitter/ -d '</span><span class="pln">
index </span><span class="pun">:</span><span class="pln">
    number_of_shards </span><span class="pun">:</span><span class="pln"> </span><span class="lit">3</span><span class="pln">
    number_of_replicas </span><span class="pun">:</span><span class="pln"> </span><span class="lit">2</span><span class="pln">
</span><span class="str">'</span>

When an operation is presented to a node, it is routed automatically to the primary shard within its replication group, which performs the operation and then replicates it to all its backups. Replication to the backup is done in parallel using non-blocking IO, which basically means that more replicas mainly just means more traffic on the network.

Replicas accept read operations as well, meaning that the more replicas you have, the more highly available your index is, but also the more scalable it is in terms of search and get operations!

Once we work under the assumption that node level state is volatile, we can store the index itself in a fast medium, perhaps a local file system (even SSD), JVM heap memory, or even non-JVM (or native) memory.

The choice of where to store an index for ElasticSearch is configurable on the index level. Here is an example that indicates that the “twitter” index be stored in main memory:

<span class="pln">$ curl </span><span class="pun">-</span><span class="pln">XPUT http</span><span class="pun">:</span><span class="com">//localhost:9200/twitter/ -d '</span><span class="pln">
index </span><span class="pun">:</span><span class="pln">
    store</span><span class="pun">:</span><span class="pln">
        type </span><span class="pun">:</span><span class="pln"> memory
</span><span class="str">'</span>

But if node storage is considered transient, what do we do regarding long term persistency of the index? The answer is the next section.

Full Cluster Failure / Long Term Persistency

Handling full cluster failure or shutdown means that both the state of the cluster (i.e. which indices exist, what are the their settings, what are their mapping definitions, etc.) and the content of each index must be stored persistently elsewhere.

A solution for this is similar to Time Machine, we need to persist the state into a long term storage. This solution is similar to what data grids have called Write Behind.

The idea is that short term high availability is maintained by replication, while long term persistency is done by asynchronously writing deltas of the state into long term (persistent) storage. The benefit is that real time operations are not affected by this process.

So, what happens when we restart a cluster? When it first starts up, it queries the long term storage for the cluster state (indices created, settings, mappings, etc.) and established them locally (creating indices, creating mapping). Each time a shard is first instantiated within its shard replication group, it will also recover its state from the long term persistency.

ElasticSearch provides this capability with a module called Gateway. For example, to have the state stored in a shared file system, each node needs to start with a configuration of:

<span class="pln">gateway </span><span class="pun">:</span><span class="pln">
    type </span><span class="pun">:</span><span class="pln"> fs
    fs </span><span class="pun">:</span><span class="pln">
        location </span><span class="pun">:</span><span class="pln"> </span><span class="str">/shared/</span><span class="pln">fs</span><span class="pun">/</span>

This can also be started right from the command line using (only on unix):

<span class="pln">$ bin</span><span class="pun">/</span><span class="pln">elasticsearch </span><span class="pun">-</span><span class="pln">f 
        </span><span class="pun">-</span><span class="typ">Des</span><span class="pun">.</span><span class="pln">gateway</span><span class="pun">.</span><span class="pln">type</span><span class="pun">=</span><span class="pln">fs </span><span class="pun">-</span><span class="typ">Des</span><span class="pun">.</span><span class="pln">gateway</span><span class="pun">.</span><span class="pln">fs</span><span class="pun">.</span><span class="pln">location</span><span class="pun">=</span><span class="str">/shared/</span><span class="pln">fs</span><span class="pun">/</span>

The above means that the cluster state will be stored on a shared file system, and each index created will automatically store its state on a shared file system as well. But, since each index has its own index gateway, we can get clever and decide on a per index level if it should be stored using a gateway or not. For example, the following example disables long term storage of an index:

<span class="pln">$ curl </span><span class="pun">-</span><span class="pln">XPUT http</span><span class="pun">:</span><span class="com">//localhost:9200/my_special_volatile_index/ -d '</span><span class="pln">
index </span><span class="pun">:</span><span class="pln">
    gateway</span><span class="pun">:</span><span class="pln">
        type </span><span class="pun">:</span><span class="pln"> none
</span><span class="str">'</span>

As you can see, this is really powerful. More over, this is the perfect solution for cloud environments. In the cloud, storing long term state is usually done using what the cloud provider provides. For example, with Amazon, its EBS or S3. Right out of the (current version) box, you can use Amazon EBS for long term storage in ElasticSearch (since its a mounted shared file system). In the future, there will also be a gateway module to store the state directly on Amazon S3 and others.

Final Words

This is only the tip of the iceberg of how ElasticSearch replication, recovery and gateway works, and the future will just increase the iceberg (right back at you, global warming!). For example, what I am playing with now is with the idea of treating indices state as a version control, being able to tag them, and recover back specific tags into new indices or current ones. I hope that at least in terms of handling cluster failures (both partial and total), ElasticSearch makes more sense now.