Elastic Cloud Enterprise supports a wide range of configurations. With such flexibility comes great freedom, but also the first rule of deployment planning: Your deployment needs to be matched to the workloads that you plan to run on your Elasticsearch clusters and Kibana instances.
Generally, you can make it easy for Elastic Cloud Enterprise to perform well by using higher-end host machines and by substantially provisioning CPU and disk resources.
The ECE management services provided by the coordinators and directors require fast SSD storage to work correctly. For smaller deployments that co-locate the ECE management services with proxies and allocators on the same hosts, this means that you must use fast SSD storage for your entire deployment. If SSD-only storage is not feasible, some of the ECE management services need to be separated.
For the allocators hosting your Elasticsearch clusers and Kibana instances, deploy host machines that provide 128 to 256 GB of RAM: you will need fewer hosts and your management overhead will be lower. Based on our experience with the Elastic Cloud hosted offering, here is what we recommend for different types of workloads:
- For search workloads, use high-performance SSDs drives and a RAM-to-storage ratio for users of 1:16 (or even 1:8). These specifications support merge operations and reindexing over time, and fast external sorting when data is too large to fit into memory all at once. Hosts that meet these requirements include the i3 instances available on AWS, for example.
- For logging workloads, switch to a RAM-to-storage ratio of 1:48 to 1:96, as data sizes are typically much larger compared to the RAM needed for logging. A cost effective solution might be to step down from SSDs to spinning media, such as high-performance server disks.
None of these recommendations are cast in stone and are not intended to recommend a particular vendor: if they do not suit your particular deployment, plan accordingly. You can also consult Elastic to find the optimal setup for your use case.
Elastic Cloud Enterprise lets you configure multiple RAM-to-storage ratios to support these workloads, called node configurations. To learn more, see Manage Node Configurations.
Regardless of your planned workload, you can likely extract higher performance from your Elastic Cloud Enterprise deployment by avoiding these practices:
- Don’t under-provision CPU resources, especially in multi-tenant environments. Some of your users might require substantial CPU resources.
- Don’t over-provision containers for mixed workloads on the assumption that applications won’t need their full resources all the time. Elasticsearch instances do work in the background, even if you are not actively running queries. For example, over-provisioning can impact background merges for search-based workloads and read/write operations for logging workloads.
For production environments, our deployment recommendations for a good baseline always include:
- Three availability zones for high availability.
- Three runners that hold the director and coordinator roles, spread across three availability zones.
- As many runners that hold the allocator role as needed for the workload you’re planning to run, spread across three availability zones.
- Enough runners that hold the proxy role to handle the volume of user requests, or at least one per availability zone for high availability. In the playbook examples, we always use three.
Fault tolerance for Elastic Cloud Enterprise is based around the concept of availability zones. An availability zone contains resources available to an Elastic Cloud Enterprise installation that are isolated from other availability zones to safeguard against potential failure. In a fault tolerant cluster, the nodes of a cluster are spread across two or three availability zones to ensure that the cluster can survive the failure of an availability zone. An availability zone could be a rack, a server zone, a zone on a cloud platform such as AWS or GCP, or some other logical constraint that supports the requirement that you could lose the entire availability zone and yet your cluster would still be up and running, because other availability zones are unaffected and can handle the workload.
All hosts that you install Elastic Cloud Enterprise on belong to an availability zone. You can specify which zone a runner should belong to with the
--availability-zone parameter when you install . If you are following our deployment recommendations, the runners in your installation will end up distributed evenly across three availability zones.
When you create or change an Elasticsearch cluster, you can select between one and three availability zones, similar to what our Elastic Cloud hosted offering supports. In broad terms:
- A single availability zone is suitable for testing and development
- Two availability zones are suitable for production use (with a tiebreaker)
- Three availability zones are great for mission critical environments
Elasticsearch nodes are started in each of the availability zones when your cluster is provisioned. When deploying clusters in multiple availability zones, shard allocation awareness ensures that primary shards and their replica shards are spread across different zones to minimize the risk of losing all shard copies at the same time.
Planning for a fault-tolerant installation with multiple availability zones means avoiding any single point of failure that could bring down Elastic Cloud Enterprise. For example, if you create an installation that uses a single physical rack with multiple availability zones placed on the same rack, that rack becomes a potential single point of failure. If the rack suffers an outage or a hardware failure, your entire cluster could be forced offline. To make such an installation more fault tolerant, you should create availability zones that are not dependent on the same physical rack. Similarly, you need to plan for sufficient capacity when an availability zone goes down: If your cluster is barely keeping up with its workload when all availability zones are up, the loss of one availability zone will likely cause the remaining Elasticsearch nodes in your cluster to be overwhelmed.
The main difference between Elastic Cloud Enterprise installations that include two or three availability zones is that three availability zones enable Elastic Cloud Enterprise to create clusters with a tiebreaker. If you have only two availability zones in total in your installation, no tiebreaker is created.
Tiebreakers are used in distributed clusters to avoid cases of split brain, where a cluster splits into multiple, autonomous parts that continue to handle requests independently of each other, at the risk of affecting cluster consistency and data loss. A split-brain scenario is avoided by making sure that a minimum number of master-eligible nodes must be present in order for any part of the cluster to elect a master node and accept user requests. To prevent multiple parts of a cluster from being eligible, there must be a quorum-based majority of (n/2)+1 nodes, where n is the number of nodes in the cluster. The minimum number of master nodes to reach quorum in a two-node cluster is the same as for a three-node cluster: two nodes must be available.
If you have only two availability zones in total in your installation, Elastic Cloud Enterprise disables the creation of a cluster with two availability zones. The reason is simple: If you could create a cluster with nodes in both of these availability zones, there would be nowhere reliable to put the tiebreaker. The tiebreaker would have a 50:50 chance of being part of the surviving availability zone in case of a zone failure, which is not enough to guarantee a quorum for every possible failure. If the availability zone that fails happens to contain the tiebreaker, the remaining availability zone would control less than the required (n/2)+1 nodes in the cluster. Without a quorum, the remaining nodes would not be able to process user requests. Effectively, such clusters provide little or no advantage in fault tolerance over clusters in a single availability zone which is why we disable their creation.
When you create a cluster with nodes in two availability zones when a third zone is available, Elastic Cloud Enterprise can create a tiebreaker in the third availability zone to help establish quorum in case of loss of an availability zone. The extra tiebreaker node that helps to provide quorum does not have to be a full-fledged and expensive node, as it does not hold data. For example: By tagging allocators hosts in Elastic Cloud Enterprise, can you create a cluster with eight nodes each in zones
ece-1b, for a total of 16 nodes, and one tiebreaker node in zone
ece-1c.This cluster can lose any of the three availability zones whilst maintaining quorum, which means that the cluster can continue to process user requests, provided that there is sufficient capacity available when an availability zone goes down.
By default, each node in an Elasticsearch cluster is a master-eligible node and a data node. In larger clusters, such as production clusters, it’s a good practice to split the roles, so that master nodes are not handling search or indexing work. When you create a cluster, you can specify to use dedicated master-eligible nodes, one per availability zone.
When you install Elastic Cloud Enterprise on the first host, it is assigned many different runner roles, such as the role of allocator, coordinator, director, and proxy. This role assignment is required to bring up your cluster initially. In a production environment, some of these roles need to be separated, as their loads scale differently and can create conflicting demands when placed on the same hosts. There are also certain security implications that are addressed by separating roles.
Roles that should not be held by the same runner:
- Allocators and coordinators
- Allocators and directors
- Coordinators and proxies
If this separation of roles is not possible, fewer hosts that provide substantial hardware resources with fast SSD storage can be used, but we recommend this setup only for development, test, and small-scale use cases. For example, even if you have only three hosts, sharing roles might be feasible in some cases. To learn more about how such a setup can work, see Example: Create a First Baseline Installation (Small). If SSD-only storage is not feasible, you must separate the ECE management services provided by the coordinators and directors from your proxies and allocators and place them on different hosts that use fast SSD storage.
If you decide to use spinning disks with a small installation, you must not assign the director role to hosts that also hold the allocator role. The inherent latency of disk seek speeds can affect the performance of ZooKeeper running on hosts with the director role, which in turn can affect the stability of your installation. Do not assign the director and the allocator role to the same hosts when using spinning disks, even if this means having to use additional host machines so that you can separate these roles.
Some roles are safe for runners to hold at the same time:
- Directors and coordinators (the ECE management services)
For an example of how a completed Elastic Cloud Enterprise installation separates these roles, see our Example: A Large Installation with Separate Management Services and Proxies. To learn more about how you can assign roles, see Assign Roles.
Elastic Cloud Enterprise is designed to be used in conjunction with at least one load balancer. A load balancer is not included with Elastic Cloud Enterprise, so you need to provide one yourself and place it in front of the Elastic Cloud Enterprise proxies. The exact number of load balancers depends on the utilization rate for your clusters. In a highly available installation, use at least two load balancers per availability zone in your installation.
The load balancer should pass through the TCP stream of IP packets to the proxies, which expect TLS/SSL traffic. This is called TCP mode or stream mode in commonly available load balancers.
Load balancers require that inbound traffic is open on the ports used by Elasticsearch, Kibana, and the transport client. To learn more, see the networking prerequisites.
By default, Elastic Cloud Enterprise uses the external ip.es.io service provided by Elastic to resolve virtual cluster host names in compliance with RFC1918. The service works by resolving host names of the form
. In the case of Elastic Cloud Enterprise, each cluster is assigned a virtual host name of the form
..ip.es.io:, such as
https://6dfc65aae62341e18a8b7692dcc97184.108.40.206.132.ip.es.io:9243. The ip.es.io service simply resolves the virtual host name of the cluster to the proxy address which is specified during installation, 10.8.156.132 in our example, so that client requests are sent to the proxy. The proxy then extracts the cluster ID from the virtual host name of the cluster and uses its internal routing table to route the request to the right allocator.
The ip.es.io service is provided to help you evaluate Elastic Cloud Enterprise without having to set up DNS records for your environment. If you do not use the ip.es.io service for your production environment, you must set up a wildcard DNS record. You typically set up a wildcard DNS record that resolves to the proxy host or to a load balancer if you set up multiple proxies fronted by a load balancer. You can create both a wildcard DNS entry for your endpoints and a wildcard TLS/SSL certificate, so that you can create multiple clusters without the need for further DNS or TSL/SSL modifications. Simply configure your DNS to point to your load balancers and install your certificates on them, so that communication with the cluster is secure.
A wildcard certificate is enabled based on the CNAME record that is generated for each cluster. For more information on modifying the CNAME record, see Configure endpoints. The CNAME also determines the endpoint URLs are displayed in the Cloud UI.
Elasticsearch scales to whatever capacity you need, pure and simple. If you need more processing capacity because your allocators are close being to maxed out, install Elastic Cloud Enterprise on additional hosts and assign the allocator role. The constructor immediately recognizes and provisions these additional resources so that they can start processing user requests.
ECE version 1.0.x takes a fill first approach to using up all available space on previously used allocators before starting to use new ones. A future version of ECE will offer a distribute first approach that ensures the distribution of Elasticsearch clusters across available allocators, including any new ones that you might have added recently.
Proxies redirecting user requests need to scale based on the ingest and search demands on Elasticsearch or Kibana instances. The load on proxies also depends on the algorithm specified by your load balancer. If you find that CPU and memory on your existing proxies are being maxed out regularly, it is time to add another proxy.
Coordinators and directors scale based on the overall load on the infrastructure of the Elastic Cloud Enterprise installation. In production environments, we recommend that you always use at least three directors and coordinators, preferably separating these roles by placing them on separate runners. Even a large-scale deployment will most likely not need to scale to a number that exceeds five.
To learn more, see adding capacity.
Some additional security considerations apply when you are planning your Elastic Cloud Enterprise installation. To learn more, see Secure Elastic Cloud Enterprise.