30 November 2015

Clustering Across Multiple Data Centers

By Pius Fung

We are frequently asked whether it is advisable to distribute an Elasticsearch cluster across multiple data centers (DCs). The short answer is "no" (for now), but there are some alternate options available described below. This blog post is intended to help you understand why this is the case, and what other options are available to you.


But Why?

The architecture and design decisions that we make in Elasticsearch are based on certain assumptions, including the assumption that nodes are located on a local network. This is the use case that we optimize and extensively test for, because this is the environment that the vast majority of our users operate in.

Network disruptions are much more common across WAN links, between geographical distributed DCs. Even if there is a dedicated link between DCs.  Elasticsearch is built to be resilient to networking disconnects, but that resiliency is intended to handle the exception, not the norm.

Running a single Elasticsearch cluster that spans multiple DCs is not a scenario we test for and there are a number of additional reasons why it is not a recommended or supported practice we will go into below. 

(Note:  On AWS, running a cluster across availability zones within a single region is supported as Amazon provides consistent high bandwidth and low latency.)

Expect the Unexpected

High Latency

Latency is a problem in distributed systems.   High latency slows indexing because the indexing request is processed on the primary shard first and then sent to all the replicas for indexing, and all cluster-wide communications (eg. cluster state updates) in Elasticsearch.

Limited or Unreliable Connectivity

If connectivity between nodes in a cluster is momentarily lost, it’s likely that remote shards will be out of date and any single update processed while in disconnected state will invalidate all content held on isolated replicas.

This means that Elasticsearch requires the copying of these out of date shards to sync up replicas from their primaries to ensure consistency of data and search responses.

Sending full shards for multiple indices may overwhelm a WAN based connection or cause considerable slowdown, leaving your cluster in a degraded state for an extended period of time.

Data Availability

Assuming the correct setting of discovery.zen.minimum_master_nodes, in the event of a network disconnect between two or more DCs, only the DC with the elected master node will remain active. This can cause many issues for applications in the different DCs which may be attempting to index new data, as the nodes not part of the active cluster will reject any attempted writes.

This also provides a challenge with cluster sizing. When the link between the two DCs is broken, the active half of the cluster will need to bear the full load of indexing and queries for all requests.

When the link is restored, these nodes will also be pushing data and documents across the network while still handling the full indexing and request load. This necessitates larger or more powerful clusters to ensure enough CPU and IOPS to maintain acceptable performance during such events.

What Are The Options ?

Here are 3 common scenarios on how this may look to give you some ideas.

Real-time Availability Across Geographic Regions

Here you would have your application code write to a replicated queuing system (e.g. Kafka, Redis, RabbitMQ) and have a process (e.g. Logstash) in each DC reading from the relevant queue and indexing documents into the local Elasticsearch cluster.

This way if network connectivity is lost between the DCs, when it is restored, the indexing will continue where it left off.

Simple Disaster Recovery (DR) with No Hot Replication Requirements

Snapshots can be used to backup indices at regular intervals (eg. to S3) and restore to the passive DC for disaster recovery. Tools such as Elasticsearch Curator make automation of the Snapshot process very simple. Restoration can either be done as soon as the Snapshot has completed or at scheduled times such as hourly or even just once daily.

The Snapshot and Restore only copies the segments files that do not already exist in the snapshot repository, so except for the initial snapshot, your backups are incremental which reduces the amount of space needed to store them.

Even if you aren’t using Snapshot and Restore to replicate data between DCs, we definitely recommend it to make sure you have a backup!

Dataset Local to Each DC

For cases where the data is local to each DC (eg. 1 dataset in Hong Kong, another in London) and there is a need to search across all of them, a tribe node in each DC can be used to give a view of both datasets.

This allows you to query both clusters as if they were one big cluster, reducing complexity in having to merge datasets.  Note that master write operations like create index should be performed against the individual clusters (see “exceptions” paragraph in the tribe node documentation linked above).

In an upcoming (separate) blog post, we will elaborate on a multiple data center architecture for a proposed set of use cases.

Final Words

As you can see, this is an area that we think a lot about, and we continue to keep these in mind as we design and develop the multi-cluster replication functionality that is on the Elasticsearch roadmap.