Engineering

# A new era for cluster coordination in Elasticsearch

One of the reasons why Elasticsearch has become so widely popular is due to how well it scales from just a small cluster with a few nodes to a large cluster with hundreds of nodes. At its heart is the cluster coordination subsystem. Elasticsearch version 7 contains a new cluster coordination subsystem that offers many benefits over earlier versions. This article covers the improvements to this subsystem in version 7. It describes how to use the new subsystem, how the changes affect upgrades from version 6, and how these improvements prevent you from inadvertently putting your data at risk. It concludes with a taste of the theory describing how the new subsystem works.

## What is cluster coordination?

An Elasticsearch cluster can perform many tasks that require a number of nodes to work together. For example, every search must be routed to all the right shards to ensure that its results are accurate. Every replica must be updated when you index or delete some documents. Every client request must be forwarded from the node that receives it to the nodes that can handle it. The nodes each have their own overview of the cluster so that they can perform searches, indexing, and other coordinated activities. This overview is known as the cluster state. The cluster state determines things like the mappings and settings for each index, the shards that are allocated to each node, and the shard copies that are in-sync. It is very important to keep this information consistent across the cluster. Many recent features, including sequence-number based replication and cross-cluster replication, work correctly only because they can rely on the consistency of the cluster state.

The coordination subsystem works by choosing a particular node to be the master of the cluster. This elected master node makes sure that all nodes in its cluster receive updates to the cluster state. This is harder than it might first sound, because distributed systems like Elasticsearch must be prepared to deal with many strange situations. Nodes sometimes run slowly, pause for a garbage collection, or suddenly lose power. Networks suffer from partitions, packet loss, periods of high latency, or may deliver messages in a different order from the order in which they were sent. There may be more than one such problem at once, and they may occur intermittently. Despite all this, the cluster coordination subsystem must be able to guarantee that every node has a consistent view of the cluster state.

Importantly, Elasticsearch must be resilient to the failures of individual nodes. It achieves this resilience by considering cluster-state updates to be successful after a quorum of nodes have accepted them. A quorum is a carefully-chosen subset of the master-eligible nodes in a cluster. The advantage of requiring only a subset of the nodes to respond is that some of the nodes can fail without affecting the cluster's availability. Quorums must be carefully chosen so the cluster cannot elect two independent masters which make inconsistent decisions, ultimately leading to data loss.

Typically we recommend that clusters have three master-eligible nodes so that if one of the nodes fails then the other two can still safely form a quorum and make progress. If a cluster has fewer than three master-eligible nodes, then it cannot safely tolerate the loss of any of them. Conversely if a cluster has many more than three master-eligible nodes, then elections and cluster state updates can take longer.

## Evolution or revolution?

Elasticsearch versions 6.x and earlier use a cluster coordination subsystem called Zen Discovery. This subsystem has evolved and matured over the years, and successfully powers clusters large and small. However there are some improvements we wanted to make which required some more fundamental changes to how it works.

Zen Discovery lets the user choose how many master-eligible nodes form a quorum using the discovery.zen.minimum_master_nodes setting. It is vitally important to configure this setting correctly on every node, and to update it correctly as the cluster scales dynamically. It is not possible for the system to detect if a user has misconfigured this setting, and in practice it is very easy to forget to adjust it after adding or removing nodes. Zen Discovery tries to protect against this kind of misconfiguration by waiting for a few seconds at each master election, and is generally quite conservative with other timeouts too. This means that if the elected master node fails then the cluster is unavailable for at least a crucial few seconds before electing a replacement. If the cluster cannot elect a master then sometimes it can be very hard to understand why.

For Elasticsearch 7.0 we have rethought and rebuilt the cluster coordination subsystem:

• The minimum_master_nodes setting is removed in favour of allowing Elasticsearch itself to choose which nodes can form a quorum.
• Typical master elections now take well under a second to complete.
• Growing and shrinking clusters becomes safer and easier and leaves much less room to configure the system in a way that risks losing data.
• Nodes log their status much more clearly to help diagnose why they cannot join a cluster or why a master cannot be elected.

As nodes are added or removed, Elasticsearch automatically maintains an optimal level of fault tolerance by updating the cluster's voting configuration. The voting configuration is a set of master-eligible nodes whose votes are counted when making a decision. Typically the voting configuration contains all the master-eligible nodes in the cluster. The quorums are simple majorities of the voting configuration: all cluster state updates need agreement from more than half of the nodes in the voting configuration. Since the system manages the voting configuration, and therefore its quorums, it can avoid any chance of misconfigurations that might lead to data loss even if nodes are added or removed.

If a node cannot discover a master node and cannot win an election itself then, starting in 7.0, Elasticsearch will periodically log a warning message describing its current status in enough detail to help diagnose many common problems.

Additionally, Zen Discovery had a very rare failure mode, recorded on the Elasticsearch Resiliency Status Page as "Repeated network partitions can cause cluster state updates to be lost", which can no longer occur. This item is now marked as resolved.

## How do I use this?

If you start some freshly installed Elasticsearch nodes with completely default configurations, then they will automatically seek out other nodes running on the same host and form a cluster after a few seconds. If you start more nodes on the same host, then by default they will discover and join this cluster too. This makes it just as easy to start a multi-node development cluster with Elasticsearch version 7.0 as it is with earlier versions.

This fully-automatic cluster formation mechanism works well on a single host but it is not robust enough to use in production or other distributed environments. There is a risk that the nodes might not discover each other in time and might form two or more independent clusters instead. Starting with version 7.0, if you want to start a brand new cluster that has nodes on more than one host, you must specify the initial set of master-eligible nodes that the cluster should use as voting configuration in its first election. This is known as cluster bootstrapping, and is only required the very first time the cluster forms. Nodes that have already joined a cluster store the voting configuration in their data folders and reuse that after a restart, and freshly started nodes that are joining an existing cluster can receive this information from the cluster's elected master.

You bootstrap a cluster by setting the cluster.initial_master_nodes setting to the names or IP addresses of the initial set of master-eligible nodes. You can provide this setting on the command line or in the elasticsearch.yml file of one or more of the master-eligible nodes. You will also need to configure the discovery subsystem so that the nodes know how to find each other.

If initial_master_nodes is not set then brand-new nodes will start up expecting to be able to discover an existing cluster. If a node cannot find a cluster to join then it will periodically log a warning message indicating

master not discovered yet, this node has not previously joined a bootstrapped (v7+) cluster,
and [cluster.initial_master_nodes] is empty on this node

There is no longer any special ceremony required to add new master-eligible nodes to a cluster. Simply configure the new nodes to discover the existing cluster, start them up, and the cluster will safely and automatically adapt its voting configuration when the new nodes join. It is also safe to remove nodes simply by stopping them as long as you do not stop half or more of the master-eligible nodes all at once. If you need to stop half or more of the master-eligible nodes, or you have more complex scaling and orchestration needs, then there is a more targeted scaling procedure which uses an API to adjust the voting configuration directly.

If you perform a rolling upgrade then the cluster bootstrapping happens automatically, based on the number of nodes in the cluster and any existing minimum_master_nodes setting. This means it is important to ensure this setting is set correctly before starting the upgrade. There is no need to set initial_master_nodes here since cluster bootstrapping happens automatically when performing this rolling upgrade. The version 7 master-eligible nodes will prefer to vote for version 6.7 nodes in master elections, so you can normally expect a version 6.7 node to be the elected master during the upgrade until you have upgraded every one of the master-eligible nodes.

If you perform a full-cluster restart upgrade, then you must bootstrap the upgraded cluster as described above: before starting the newly-upgraded cluster you must first set initial_master_nodes to the names or IP addresses of the master-eligible nodes.

In versions 6 and earlier there are some other settings that allow you to configure the behaviour of Zen Discovery in the discovery.zen.* namespace. Some of these settings no longer have any effect and have been removed. Others have been renamed. If a setting has been renamed then its old name is deprecated in version 7, and you should adjust your configuration to use the new names:

Old nameNew name
discovery.zen.ping.unicast.hostsdiscovery.seed_hosts
discovery.zen.hosts_providerdiscovery.seed_providers
discovery.zen.no_master_blockcluster.no_master_block

The new cluster coordination subsystem includes a new fault detection mechanism. This means that the Zen Discovery fault detection settings in the discovery.zen.fd.* namespace no longer have any effect. Most users should use the default fault detection configuration in versions 7 and later, but if you need to make any changes then you can do so using the cluster.fault_detection.* settings.

## Safety first

Versions of Elasticsearch before 7.0 sometimes allowed you to inadvertently perform a sequence of steps that unsafely put the consistency of the cluster at risk. In contrast, versions 7.0 and later will make you fully aware that you might be doing something unsafe and will require confirmation that you really do want to proceed.

For example, an Elasticsearch 7.0 cluster will not automatically recover if half or more of the master-eligible nodes are permanently lost. It is common to have three master-eligible nodes in a cluster, allowing Elasticsearch to tolerate the loss of one of them without downtime. If two of them are permanently lost then the remaining node cannot safely make any further progress.

Versions of Elasticsearch before 7.0 would quietly allow a cluster to recover from this situation. Users could bring their cluster back online by starting up new, empty, master-eligible nodes to replace any number of lost ones. An automated recovery from the permanent loss of half or more of the master-eligible nodes is unsafe, because none of the remaining nodes are certain to have a copy of the latest cluster state. This can lead to data loss. For example, a shard copy may have been removed from the in-sync set. If none of the remaining nodes know this, then that stale shard copy might be allocated as a primary. The most dangerous part of this was that users were completely unaware that this sequence of steps had put their cluster at risk. It might be weeks or months before a user notices any inconsistency.

In Elasticsearch 7.0 and later this kind of unsafe activity is much more restricted. Clusters will prefer to remain unavailable rather than take this kind of risk. In the rare situation where there are no backups, it is still possible to perform this kind of unsafe operation if absolutely necessary. It just takes a few extra steps to confirm that you are aware of the risks and to avoid the chance of doing an unsafe operation by accident.

If you have lost half or more of the master-eligible nodes, then the first thing to try is to bring the lost master-eligible nodes back online. If the nodes' data directories are still intact, then the best thing to do is to start new nodes using these data directories. If this is possible then the cluster will safely form again using the most up-to-date cluster state.

The next thing to try is to restore the cluster from a recent snapshot. This brings the cluster into a known-good state but loses any data written since you took the snapshot. You can then index any missing data again, since you know what the missing time period is. Snapshots are incremental so you can perform them quite frequently. It is not unusual to take a snapshot every 30 minutes to limit the amount of data lost in such a recovery.

If neither of these recovery actions is possible then the last resort is the elasticsearch-node unsafe recovery tool. This is a command-line tool that a system administrator can run to perform unsafe actions such as electing a stale master from a minority. By making the steps that can break consistency very explicit, Elasticsearch 7.0 eliminates the risk of unintentionally causing data loss through a series of unsafe operations.

## How does it work?

If you are familiar with the theory of distributed systems you may recognise cluster coordination as an example of a problem that can be solved using distributed consensus.  Distributed consensus was not very widely understood when the development of Elasticsearch started, but the state of the art has advanced significantly in recent years.

Zen Discovery adopted many ideas from distributed consensus algorithms, but did so organically rather than strictly following the model that the theory prescribes. It also has very conservative timeouts, making it sometimes recover very slowly after a failure. The introduction of a new cluster coordination subsystem in 7.0 gave us the opportunity to follow the theoretical model much more closely.

Distributed coordination is known to be a difficult problem to solve correctly. We heavily relied on formal methods to validate our designs up-front, with automated tooling providing strong guarantees in terms of correctness and safety. You can find the formal specifications of Elasticsearch's new cluster coordination algorithm in our public Elasticsearch formal-models repository. The core safety module of the algorithm is simple and concise and there is a direct one-to-one correspondence between the formal model and the production code in the Elasticsearch repository.

If you are acquainted with the family of distributed consensus algorithms that includes Paxos, Raft, Zab and Viewstamped Replication (VR), then the core safety module will look familiar. It models a single rewritable register and uses a notion of a master term that parallels the ballots of Paxos, the terms of Raft and the views of VR. The core safety module and its formal model also covers cluster bootstrapping, persistence across node restarts, and dynamic reconfiguration. All of these features are important to ensure that the system behaves correctly in all circumstances.

Around this theoretically-robust core we built a liveness layer to ensure that, no matter what failures happen to the cluster, once the network is restored and enough nodes are online a master will be elected and will be able to publish cluster state updates. The liveness layer uses a number of state-of-the-art techniques to avoid many common issues. The election scheduler is adaptive, altering its behavior according to network conditions to avoid an excess of contested elections. A Raft-style pre-voting round suppresses unwinnable elections before they start, avoiding disruption by rogue nodes. Lag detection prevents nodes from disrupting the cluster if they fall too far behind the master. Active bidirectional fault detection ensures that the nodes in the cluster can always mutually communicate. Most cluster state updates are published efficiently as small diffs, avoiding the need to copy the whole cluster state from node to node. Leaders that are gracefully terminated will expressly abdicate to a successor of their choosing, reducing downtime during a deliberate failover by avoiding the need for a full election. We developed testing infrastructure to efficiently simulate the effects of pathological disruptions that could last for seconds, minutes, or hours, allowing us to verify that the cluster always recovers quickly once the disruption is resolved.

### Why not Raft?

A question we are commonly asked is why we didn't simply "plug in" a standard distributed consensus algorithm such as Raft. There are quite a number of well-known algorithms, each offering different trade-offs. We carefully evaluated and drew inspiration from all the literature we could find. One of our early proofs-of-concepts used a protocol much closer to Raft. We learned from this experience that the changes required to fully integrate it with Elasticsearch turned out to be quite substantial. Many of the standard algorithms also prescribe some design decisions that would be suboptimal for Elasticsearch. For example:

• They are frequently structured around a log of operations, whereas Elasticsearch's cluster coordination is more naturally based directly on the cluster state itself. This enables vital optimisations such as batching (combining related operations into a single broadcast) more simply than would be possible if it were operation-based.
• They often have a fairly restricted ability to grow or shrink clusters, requiring a sequence of steps to achieve many maintenance tasks, whereas Elasticsearch's cluster coordination can safely perform arbitrary reconfigurations in a single step. This simplifies the surrounding system by avoiding problematic intermediate states.
• They often focus heavily on safety, leave open the details of how they guarantee liveness, and do not describe how the cluster should react if it finds a node to be unhealthy. Elasticsearch's health checks are complex, having been used and refined in the field for many years, and it was important for us to preserve its existing behaviour. In fact it took much less effort to implement the system's safety properties than to guarantee its liveness. The majority of the implementation effort focussed on the liveness properties of the system.
• One of the project goals was to support a zero-downtime rolling upgrade from a 6.7 cluster running Zen Discovery to a version 7 cluster running with the new coordination subsystem. It did not seem feasible to adapt any of the standard algorithms into one that allowed for this kind of rolling upgrade.

A full, industrial-strength implementation of a distributed consensus algorithm takes substantial effort to develop and must go beyond what the academic literature outlines. It is inevitable that customizations will be needed in practice, but coordination protocols are complex and any customizations carry the risk of introducing errors. It ultimately made the most sense from an engineering perspective to treat these customizations as if developing a new protocol.

## Summary

Elasticsearch 7.0 ships with a new cluster coordination subsystem that is faster, safer, and simpler to use. It supports zero-downtime rolling upgrades from 6.7, and provides the basis for resilient data replication. To try out the new cluster coordination subsystem, download the latest 7.0 beta release, consult the docs, take it for a spin, and give us some feedback.