Resilience in larger clusters

It’s not unusual for nodes to share common infrastructure, such as network interconnects or a power supply. If so, you should plan for the failure of this infrastructure and ensure that such a failure would not affect too many of your nodes. It is common practice to group all the nodes sharing some infrastructure into zones and to plan for the failure of any whole zone at once.

Note

This document focuses on self-managed Elasticsearch deployments and describes how Elasticsearch handles zone-aware resilience internally, including behavior during network partitions, shard allocation strategies, and the role of master-eligible nodes.

This information might also be useful for other deployment types, such as Elastic Cloud on Kubernetes.

For details on how similar principles are implemented in Elastic Cloud Hosted and Elastic Cloud Enterprise, refer to Resiliency in ECH and ECE deployments.

Elasticsearch expects node-to-node connections to be reliable, have low latency, and have adequate bandwidth. Many Elasticsearch tasks require multiple round-trips between nodes. A slow or unreliable interconnect may have a significant effect on the performance and stability of your cluster.

For example, a few milliseconds of latency added to each round-trip can quickly accumulate into a noticeable performance penalty. An unreliable network may have frequent network partitions. Elasticsearch will automatically recover from a network partition as quickly as it can but your cluster may be partly unavailable during a partition and will need to spend time and resources to resynchronize any missing data and rebalance itself once the partition heals. Recovering from a failure may involve copying a large amount of data between nodes so the recovery time is often determined by the available bandwidth.

If you’ve divided your cluster into zones, the network connections within each zone are typically of higher quality than the connections between the zones. Ensure the network connections between zones are of sufficiently high quality. You will see the best results by locating all your zones within a single data center with each zone having its own independent power supply and other supporting infrastructure. You can also stretch your cluster across nearby data centers as long as the network interconnection between each pair of data centers is good enough.

There is no specific minimum network performance required to run a healthy Elasticsearch cluster. In theory, a cluster will work correctly even if the round-trip latency between nodes is several hundred milliseconds. In practice, if your network is that slow then the cluster performance will be very poor. In addition, slow networks are often unreliable enough to cause network partitions that lead to periods of unavailability.

If you want your data to be available in multiple data centers that are further apart or not well connected, deploy a separate cluster in each data center and use cross-cluster search or cross-cluster replication to link the clusters together. These features are designed to perform well even if the cluster-to-cluster connections are less reliable or performant than the network within each cluster.

After losing a whole zone’s worth of nodes, a properly-designed cluster may be functional but running with significantly reduced capacity. You may need to provision extra nodes to restore acceptable performance in your cluster when handling such a failure.

For resilience against whole-zone failures, it is important that there is a copy of each shard in more than one zone, which can be achieved by placing data nodes in multiple zones and configuring shard allocation awareness. You should also ensure that client requests are sent to nodes in more than one zone.

You should consider all node roles and ensure that each role is split redundantly across two or more zones. For instance, if you are using ingest pipelines or machine learning, you should have ingest or machine learning nodes in two or more zones. However, the placement of master-eligible nodes requires a little more care because a resilient cluster needs at least two of the three master-eligible nodes in order to function. The following sections explore the options for placing master-eligible nodes across multiple zones.

Two-zone clusters

If you have two zones, you should have a different number of master-eligible nodes in each zone so that the zone with more nodes will contain a majority of them and will be able to survive the loss of the other zone. For instance, if you have three master-eligible nodes then you may put all of them in one zone or you may put two in one zone and the third in the other zone. You should not place an equal number of master-eligible nodes in each zone. If you place the same number of master-eligible nodes in each zone, neither zone has a majority of its own. Therefore, the cluster may not survive the loss of either zone.

Two-zone clusters with a tiebreaker

The two-zone deployment described above is tolerant to the loss of one of its zones but not to the loss of the other one because master elections are majority-based. You cannot configure a two-zone cluster so that it can tolerate the loss of either zone because this is theoretically impossible. You might expect that if either zone fails then Elasticsearch can elect a node from the remaining zone as the master but it is impossible to tell the difference between the failure of a remote zone and a mere loss of connectivity between the zones. If both zones were capable of running independent elections then a loss of connectivity would lead to a split-brain problem and therefore data loss. Elasticsearch avoids this and protects your data by not electing a node from either zone as master until that node can be sure that it has the latest cluster state and that there is no other master in the cluster. This may mean there is no master at all until connectivity is restored.

You can solve this by placing one master-eligible node in each of your two zones and adding a single extra master-eligible node in an independent third zone. The extra master-eligible node acts as a tiebreaker in cases where the two original zones are disconnected from each other. The extra tiebreaker node should be a dedicated voting-only master-eligible node, also known as a dedicated tiebreaker. A dedicated tiebreaker need not be as powerful as the other two nodes since it has no other roles and will not perform any searches nor coordinate any client requests nor be elected as the master of the cluster.

You should use shard allocation awareness to ensure that there is a copy of each shard in each zone. This means either zone remains fully available if the other zone fails.

All master-eligible nodes, including voting-only nodes, are on the critical path for publishing cluster state updates. Cluster state updates are usually independent of performance-critical workloads such as indexing or searches, but they are involved in management activities such as index creation and rollover, mapping updates, and recovery after a failure. The performance characteristics of these activities are a function of the speed of the storage on each master-eligible node, as well as the reliability and latency of the network interconnections between all nodes in the cluster. You must therefore ensure that the storage and networking available to the nodes in your cluster are good enough to meet your performance goals.

Clusters with three or more zones

If you have three zones then you should have one master-eligible node in each zone. If you have more than three zones then you should choose three of the zones and put a master-eligible node in each of these three zones. This will mean that the cluster can still elect a master even if one of the zones fails.

As always, your indices should have at least one replica in case a node fails, unless they are searchable snapshot indices. You should also use shard allocation awareness to limit the number of copies of each shard in each zone. For instance, if you have an index with one or two replicas configured then allocation awareness will ensure that the replicas of the shard are in a different zone from the primary. This means that a copy of every shard will still be available if one zone fails. The availability of this shard will not be affected by such a failure.

Summary

The cluster will be resilient to the loss of any zone as long as:

The cluster health status is green.
There are at least two zones containing data nodes.
Every index that is not a searchable snapshot index has at least one replica of each shard, in addition to the primary.
Shard allocation awareness is configured to avoid concentrating all copies of a shard within a single zone.
The cluster has at least three master-eligible nodes. At least two of these nodes are not voting-only master-eligible nodes, and they are spread evenly across at least three zones.
Clients are configured to send their requests to nodes in more than one zone or are configured to use a load balancer that balances the requests across an appropriate set of nodes. The Elastic Cloud service provides such a load balancer.