Elastic Cloud Enterprise supports a wide range of configurations. With such flexibility comes great freedom, but also the first rule of deployment planning: Your host machines need to be matched to the workload that you plan to run.
Generally, you can make it easy for Elastic Cloud Enterprise to perform well by using higher-end host machines and by substantially provisioning CPU and disk resources. Deploy host machines that provide 128 to 256 GB of RAM: you will need fewer hosts and your management overhead will be lower.
Based on our experience with the Elastic Cloud hosted offering, here is what we recommend:
- For search workloads, use high-performance SSDs drives and a RAM-to-storage ratio for users of 1:16 (or even 1:8). These specifications support merge operations and reindexing over time, and fast external sorting when data is too large to fit into memory all at once. Hosts that meet these requirements include the i2 instances available on AWS, for example.
- For logging workloads, switch to a RAM-to-storage ratio of 1:48 to 1:96, as data sizes are typically much larger compared to the RAM needed for logging. A cost effective solution might be to step down from SSDs to spinning media, such as high-performance server disks.
None of these recommendations are cast in stone and are not intended to recommend a particular vendor: if they do not suit your particular deployment, plan accordingly.
Regardless of your planned workload, you can likely extract higher performance from your Elastic Cloud Enterprise deployment by avoiding these practices:
- Don’t under-provision CPU resources, especially in multi-tenant environments. Some of your users might require substantial CPU resources.
- Don’t over-provision containers for mixed workloads on the assumption that applications won’t need their full resources all the time. Elasticsearch instances do work in the background, even if you are not actively running queries. For example, over-provisioning can impact background merges for search-based workloads and read/write operations for logging workloads.
Fault tolerance for Elastic Cloud Enterprise is based around the concept of availability zones. An availability zone contains resources available to an Elastic Cloud Enterprise installation that are isolated from other availability zones to safeguard against potential failure. In a fault tolerant cluster, the nodes of a cluster are spread across two or three availability zones to ensure that the cluster can survive the failure of an availability zone. An availability zone could be a rack, a server zone or some other logical constraint that supports the requirement that you could lose the entire availability zone and yet your cluster would still be up and running, because other availability zones are unaffected and can handle the workload.
All hosts that you install Elastic Cloud Enterprise on belong to an availability zone. You can specify which zone a runner should belong to with the
--availability-zone parameter when you install . If you are following our topology recommendations, the runners in your installation will end up distributed evenly across three availability zones.
When you create or change an Elasticsearch cluster, you can select between one and three availability zones, similar to what our Elastic Cloud hosted offering supports. In broad terms:
- A single availability zone is suitable for testing and development
- Two availability zones are suitable for production use (with a tiebreaker)
- Three availability zones are great for mission critical environments
Elasticsearch nodes are started in each of the availability zones when your cluster is provisioned. When deploying clusters in multiple availability zones, shard allocation awareness ensures that primary shards and their replica shards are spread across different zones to minimize the risk of losing all shard copies at the same time.
Planning for a fault-tolerant installation with multiple availability zones means avoiding any single point of failure that could bring down Elastic Cloud Enterprise. For example, if you create an installation that uses a single physical rack with multiple availability zones placed on the same rack, that rack becomes a potential single point of failure. If the rack suffers an outage or a hardware failure, your entire cluster could be forced offline. To make such an installation more fault tolerant, you should create availability zones that are not dependent on the same physical rack. Similarly, you need to plan for sufficient capacity when an availability zone goes down: If your cluster is barely keeping up with its workload when all availability zones are up, the loss of one availability zone will likely cause the remaining Elasticsearch nodes in your cluster to be overwhelmed.
The main difference between Elastic Cloud Enterprise installations that include two or three availability zones is that three availability zones enable Elastic Cloud Enterprise to create clusters with a tiebreaker. If you have only two availability zones in total in your installation, no tiebreaker is created.
Tiebreakers are used in distributed clusters to avoid cases of split brain, where a cluster splits into multiple, autonomous parts that continue to handle requests independently of each other, at the risk of affecting cluster consistency and data loss. A split-brain scenario is avoided by making sure that a minimum number of master-eligible nodes must be present in order for any part of the cluster to elect a master node and accept user requests. To prevent multiple parts of a cluster from being eligible, there must be a quorum-based majority of (n/2)+1 nodes, where n is the number of nodes in the cluster. The minimum number of master nodes to reach quorum in a two-node cluster is the same as for a three-node cluster: two nodes must be available.
If you have only two availability zones in total in your installation, Elastic Cloud Enterprise disables the creation of a cluster with two availability zones. The reason is simple: If you could create a cluster with nodes in both of these availability zones, there would be nowhere reliable to put the tiebreaker. The tiebreaker would have a 50:50 chance of being part of the surviving availability zone in case of a zone failure, which is not enough to guarantee a quorum for every possible failure. If the availability zone that fails happens to contain the tiebreaker, the remaining availability zone would control less than the required (n/2)+1 nodes in the cluster. Without a quorum, the remaining nodes would not be able to process user requests. Effectively, such clusters provide little or no advantage in fault tolerance over clusters in a single availability zone which is why we disable their creation.
When you create a cluster with nodes in two availability zones when a third zone is available, Elastic Cloud Enterprise can create a tiebreaker in the third availability zone to help establish quorum in case of loss of an availability zone. The extra tiebreaker node that helps to provide quorum does not have to be a full-fledged and expensive node, as it does not hold data. For example: By tagging allocators hosts in Elastic Cloud Enterprise, can you create a cluster with eight nodes each in zones
ece-1b, for a total of 16 nodes, and one tiebreaker node in zone
ece-1c.This cluster can lose any of the three availability zones whilst maintaining quorum, which means that the cluster can continue to process user requests, provided that there is sufficient capacity available when an availability zone goes down.
By default, each node in an Elasticsearch cluster is a master-eligible node and a data node. In larger clusters, such as production clusters, it’s a good practice to split the roles, so that master nodes are not handling search or indexing work. When you create a cluster, you can specify to use dedicated master-eligible nodes, one per availability zone.
When you install Elastic Cloud Enterprise on the first host, it is assigned many different runner roles, such as the role of allocator, coordinator, director, and proxy. This role assignment is required to bring up your cluster initially. In a production environment, some of these roles need to be separated, as their loads scale differently and can create conflicting demands when placed on the resource. There are also certain security implications that are addressed by separating roles.
Roles that should not be held by the same runner:
- Allocators and coordinators
- Coordinators and proxies
If this separation of roles is not possible, fewer hosts that provide substantial hardware resources could be used. For example, if you have only three hosts but these hosts are heavily provisioned, sharing roles might be feasible.
Some roles are safe for runners to hold at the same time:
- Directors and coordinators
Elastic Cloud Enterprise is designed to be used in conjunction with at least one load balancer. A load balancer is not included with Elastic Cloud Enterprise, so you need to provide one yourself and place it in front of the Elastic Cloud Enterprise proxies. The exact number of load balancers depends on the utilization rate for your clusters. In a highly available installation, use at least two load balancers per availability zone in your installation.
The load balancer should also handle HTTPS encryption and decryption, that is, TLS/SSL should terminate at the load balancer.
Load balancers require that inbound traffic is open on the ports used by Elasticsearch, Kibana, and the transport client. To learn more, see the networking prerequisites.
By default, Elastic Cloud Enterprise uses the external ip.es.io service provided by Elastic to resolve virtual cluster host names in compliance with RFC1918. The service works by resolving host names of the form
. In the case of Elastic Cloud Enterprise, each cluster is assigned a virtual host name of the form
..ip.es.io:, such as
https://6dfc65aae62341e18a8b7692dcc97188.8.131.52.132.ip.es.io:9243. The ip.es.io service simply resolves the virtual host name of the cluster to the proxy address which is specified during installation, 10.8.156.132 in our example, so that client requests are sent to the proxy. The proxy then extracts the cluster ID from the virtual host name of the cluster and uses its internal routing table to route the request to the right allocator.
The ip.es.io service is provided to help you evaluate Elastic Cloud Enterprise without having to set up DNS records for your environment. If you do not use the ip.es.io service for your production environment, you must set up a wildcard DNS record. You typically set up a wildcard DNS record that resolves to the proxy host or to a load balancer if you set up multiple proxies fronted by a load balancer. You can create both a wildcard DNS entry for your endpoints and a wildcard TLS/SSL certificate, so that you can create multiple clusters without the need for further DNS or TSL/SSL modifications. Simply configure your DNS to point to your load balancers and install your certificates on them, so that communication with the cluster is secure.
A wildcard certificate is enabled based on the CNAME record that is generated for each cluster. For more information on modifying the CNAME record, see Configure endpoints. The CNAME also determines the endpoint URLs are displayed in the Cloud UI.
Elastic Cloud Enterprise gives 50% of the available memory to the Elasticsearch JVM heap, while leaving the other 50% for the operating system. This memory won’t go unused, as Lucene is designed to leverage the underlying OS for caching in-memory data structures, meaning that Lucene will happily gobble up whatever is left over. The ideal heap size is somewhere below 32 GB, as heap sizes above 32 GB become less efficient.
What these recommendations mean is that on a 64 GB host, the ideal setup is to dedicate 32 GB to the Elasticsearch heap and 32 GB to the OS. If you provision a 128 GB host, you create at least two 64 GB nodes, each node with 32 GB reserved for the Elasticsearch heap and 32 GB reserved for the OS.
For more information about why heap sizes, memory for the OS, and the 32 GB maximum for JVMs matter, see Heap: Sizing and Swapping.
Elasticsearch scales to whatever capacity you need, pure and simple. If you need more processing capacity because your allocators are close being to maxed out, install Elastic Cloud Enterprise on additional hosts and assign the allocator role. The constructor immediately recognizes and provisions these additional resources so that they can start processing user requests.
ECE version 1.0.x takes a fill first approach to using up all available space on previously used allocators before starting to use new ones. A future version of ECE will offer a distribute first approach that ensures the distribution of Elasticsearch clusters across available allocators, including any new ones that you might have added recently.
Proxies redirecting user requests need to scale based on the ingest and search demands on Elasticsearch or Kibana instances. The load on proxies also depends on the algorithm specified by your load balancer. If you find that CPU and memory on your existing proxies are being maxed out regularly, it is time to add another proxy.
Coordinators and directors scale based on the overall load on the infrastructure of the Elastic Cloud Enterprise installation. In production environments, we recommend that you always use at least three directors and coordinators, preferably separating these roles by placing them on separate runners. Even a large-scale deployment will most likely not need to scale to a number that exceeds five.
To learn more, see adding capacity.
Some additional security considerations apply when you are planning your Elastic Cloud Enterprise installation. To learn more, see Secure Elastic Cloud Enterprise.