Fault tolerance (high availability)

Fault tolerance for Elastic Cloud Enterprise is based around the concept of availability zones. An availability zone contains resources available to an Elastic Cloud Enterprise installation that are isolated from other availability zones to safeguard against potential failure. In a fault tolerant Elasticsearch cluster, the nodes of a cluster are spread across two or three availability zones to ensure that the cluster can survive the failure of an availability zone. An availability zone could be a rack, a server zone, a zone on a cloud platform such as AWS or GCP, or some other logical constraint that supports the requirement that you could lose the entire availability zone and yet your cluster would still be up and running, because other availability zones are unaffected and can handle the workload.

All hosts that you install Elastic Cloud Enterprise on belong to an availability zone. You can specify which zone a runner should belong to with the --availability-zone parameter when you install . If you are following our deployment recommendations, the runners in your installation will end up distributed evenly across three availability zones.

When you create or change an Elasticsearch cluster, you can select between one and three availability zones, similar to what our Elastic Cloud hosted offering supports. In broad terms:

  • A single availability zone is suitable for testing and development
  • Two availability zones are suitable for production use (with a tiebreaker)
  • Three availability zones are great for mission critical environments

Elasticsearch nodes are started in each of the availability zones when your cluster is provisioned. When deploying clusters in multiple availability zones, shard allocation awareness ensures that primary shards and their replica shards are spread across different zones to minimize the risk of losing all shard copies at the same time.

Planning for a fault-tolerant installation with multiple availability zones means avoiding any single point of failure that could bring down Elastic Cloud Enterprise. For example, if you create an installation that uses a single physical rack with multiple availability zones placed on the same rack, that rack becomes a potential single point of failure. If the rack suffers an outage or a hardware failure, your entire cluster could be forced offline. To make such an installation more fault tolerant, you should create availability zones that are not dependent on the same physical rack. Similarly, you need to plan for sufficient capacity when an availability zone goes down: If your cluster is barely keeping up with its workload when all availability zones are up, the loss of one availability zone will likely cause the remaining Elasticsearch nodes in your cluster to be overwhelmed.

The main difference between Elastic Cloud Enterprise installations that include two or three availability zones is that three availability zones enable Elastic Cloud Enterprise to create clusters with a tiebreaker. If you have only two availability zones in total in your installation, no tiebreaker is created.

Tiebreakers are used in distributed clusters to avoid cases of split brain, where a cluster splits into multiple, autonomous parts that continue to handle requests independently of each other, at the risk of affecting cluster consistency and data loss. A split-brain scenario is avoided by making sure that a minimum number of master-eligible nodes must be present in order for any part of the cluster to elect a master node and accept user requests. To prevent multiple parts of a cluster from being eligible, there must be a quorum-based majority of (n/2)+1 nodes, where n is the number of nodes in the cluster. The minimum number of master nodes to reach quorum in a two-node cluster is the same as for a three-node cluster: two nodes must be available.

If you have only two availability zones in total in your installation, Elastic Cloud Enterprise disables the creation of a cluster with two availability zones. The reason is simple: If you could create a cluster with nodes in both of these availability zones, there would be nowhere reliable to put the tiebreaker. The tiebreaker would have a 50:50 chance of being part of the surviving availability zone in case of a zone failure, which is not enough to guarantee a quorum for every possible failure. If the availability zone that fails happens to contain the tiebreaker, the remaining availability zone would control less than the required (n/2)+1 nodes in the cluster. Without a quorum, the remaining nodes would not be able to process user requests. Effectively, such clusters provide little or no advantage in fault tolerance over clusters in a single availability zone which is why we disable their creation.

When you create a cluster with nodes in two availability zones when a third zone is available, Elastic Cloud Enterprise can create a tiebreaker in the third availability zone to help establish quorum in case of loss of an availability zone. The extra tiebreaker node that helps to provide quorum does not have to be a full-fledged and expensive node, as it does not hold data. For example: By tagging allocators hosts in Elastic Cloud Enterprise, can you create a cluster with eight nodes each in zones ece-1a and ece-1b, for a total of 16 nodes, and one tiebreaker node in zone ece-1c.This cluster can lose any of the three availability zones whilst maintaining quorum, which means that the cluster can continue to process user requests, provided that there is sufficient capacity available when an availability zone goes down.

By default, each node in an Elasticsearch cluster is a master-eligible node and a data node. In larger clusters, such as production clusters, it’s a good practice to split the roles, so that master nodes are not handling search or indexing work. When you create a cluster, you can specify to use dedicated master-eligible nodes, one per availability zone.