May 8, 2015

Virtualization and Elasticsearch: Best Practices

To have a better understanding of the challenges we may deal with when using Elasticsearch in a virtualized environment, we need to change the focus from conventional hardware problems to a more complex view. The purpose of this article is to uncover some common issues you might experience using Elasticsearch in virtual environments.

A Brief History

Way before Elasticsearch appeared, the concept of virtualization was taking its place as a first class citizen in computing. Virtualization refers to the act of creating a virtual (not an actual) version of something, including, among others, virtual-computer hardware platforms, operating systems, storage devices, or computer network resources.

Virtualization was born in the late 1960s and early 1970s, when IBM created the CP-40/CMS (Conversational Monitor System) as a method of logically dividing the system resources provided by mainframe computers between different applications. Afterwards, the meaning of the term broadened to what currently is: full virtual machine (VM) implementations and control of processing, network and memory, all working together seamlessly in the cloud.

Existing Platforms

There are various existing platforms to handle Elasticsearch in virtual environments, all of which are different between them. Generally, the three main platforms we see used for Elasticsearch are:

Cloud Solutions

Amazon (Elastic Compute Cloud) EC2: Amazon service with a simple web interface that allows you to create and configure virtual machine on the cloud.

Microsoft Azure: Microsoft's competitive alternative to Amazon EC2, which offers similar features and functionality.

On-Premise Solution

VMware vSphere/ESX: Create full virtual clusters installing on top of physical servers and making VM's partitions on it.

Finally, as a different way to handle our Elasticsearch virtualized infrastructure, Found by Elastic is a hosted and fully managed Elasticsearch Software as a Service (SaaS). Found provides a fast, scalable, reliable and easy to operate search service hosted for you in the cloud.

The Architecture

As an example of how complex a virtualized architecture can be, and all the points we have to understand to manage Elasticsearch on a virtual environment, we can take a brief look into VMware's vSphere architecture. VMware vSphere is used to transform entire datacenters into a single cloud computer infrastructure, virtualizing and aggregating the main physical hardware resource across multiple systems and providing virtual resources to the datacenter.

VMware vSphere consists of multiple component layers such as:

Infrastructure Services - VMware vCompute, VMware vStorage and VMware vNetwork. VMware ESX and ESXi are both physical servers that abstract away from the processor, manage storage in virtual environments and simplify networking.

Application Services - Ensure availability, security and scalability for applications.

VMware vCenter Server - A single application that takes control of the datacenter, providing access control, performance monitoring and configurations.

Clients - Different types of clients to access VMware vSphere datacenter, where we can create and access an Elasticsearch node.

Although the architecture is complex, no matter which virtualization solution we use, we will have tools that makes it very easy to manage entire datacenter or clusters. Those tools can help us to easily allocate storage and networking to the physical nodes, parcel out resource allocation (CPU, memory, disk and network bandwidth) as needed, monitor datacenter status, and more. The tools will allow us to configure and setup Elasticsearch in a virtual environment exactly as required depending on our needs. Regardless, we need to take care around some issues that can crop up with CPU, memory and disk utilization.

Handling Resources

There are various ways to achieve the goal of running Elasticsearch in a virtualized environment. Each platform and solution, whether is cloud-based or not, has his own complexity and difficulty for configuring and running. Handling resources is the key area for achieving success.

CPU

Every virtualization solution has limits regarding CPU usage. A physical processor core can support up to 32 virtual CPUs (vCPU) in both vSphere 6 and Azure, and 36 vCPU in Amazon EC2. As we increase CPU allocation on cloud providers, we will increase the cost for each instance.

Elasticsearch uses Java, so we will need to handle a Java Virtual Machine (JVM) within our virtual environment. A good approach for JVM's is to have a minimum of two CPU's, one to handle garbage collection and JVM administration, and the other to handle the application processing.

A good way to handle CPU usage is to monitor CPU utilization inside the VM using Marvel. If Elasticsearch is using a lot of CPU resources inside the VM, it may be worth considering increasing the number of available vCPUs.

Memory

As well as CPU limits, there are limits for the amount of RAM we can allocate on a host depending the provider: up to 6 TB on vSphere, 244 GB on Amazon EC2, and 112 GB on Azure. As we increase memory usage, we will generally see increase in costs.

CPU and disk usage can be affected by reaching memory limits. You might want to watch and monitor the Host and VM status with Marvel, to find whether you need to do something in order to decrease memory usage, such as refactoring Elasticsearch queries or increasing the amount of memory on the host.

Java objects, methods, thread stacks and others, reside in Java heap. The amount of memory given to the heap will ensure us good — or bad — behavior of our Elasticsearch cluster. When the heap starts to fill, the Java garbage collector will start running. It is a best practice to allocate half of the total amount of memory for the heap. In addition, we have detailed information in our documentation on how to limit memory usage .

Disk

Disk utilization is similar on a host and a VM. We need to eliminate disk contention as we do in any environment. If a set of disks in the host is being overused, meaning that the average I/O is close to 100%, we might see an impact in all the virtual environments that are using the same disks. Disk resources can also be impacted by "noisy neighbors", which are generally larger VMs running on or against the same hardware, thereby consuming resources in negative and surprising ways.

Backing up your Elasticsearch cluster, or creating snapshots for individual indices as well as entire clusters, is incredibly important! By making backups from the VM, we can ensure that we have a starting point to continue from in the case of failure. Creating snapshots or backups from VMs has some cost and may have an impact in the VM response time, so we may also impact Elasticsearch's responsiveness by doing such operations. Plus, it is just good practice to have a Backup and Snapshot policy for your clusters.

Elasticsearch disk usage depends on each use case. We recommend doing stress and performance tests on the server in order to understand the amount of disk we need to allocate in order to make the cluster work well. When it comes to CPU and Memory, some cloud solutions can become pricey as you increase the disk allocation.

Network

Configuring the network is usually straight forward. There are plenty of possible configurations depending on which cloud provider you choose and what your needs are. You can share the network with the host, or create an independent network to use on your VM.

You make consider creating a Virtualized Private Network (VPN) to isolate the cluster, as well as to secure it.

The Perils of Virtualization

In addition to the areas outlined above, there are a few other places where we can run into trouble running Elasticsearch in a virtualized environment.

As an example, we can see one of the latest bugs fixed on Ubuntu. A simple bug on the Ubuntu kernel (version 3.13) was causing a failure in the transport connection thread on EC2 when the network' load increased. Consequently, Elasticsearch indexing, query operations and administrative commands started to fail on EC2 instances running Ubuntu.

The problem was caused by a combination of Gather-scatter and the maximum transmission unit limit on the network interfaces. The solution was either to update Ubuntu's kernel version and restart the EC2 instance, or disable gather-scatter.

Another example is the problem that we might encounter while working in a cluster with limited resources on the VMs and losing one of the nodes. Shards that were allocated in that specific node will be relocated to another node, without executing any process to see if the new node contains enough resources to handle the new shards. In order to limit this problem, we can use forced awareness. Forced awareness allows us to force allocate new shards in specific zones that we define in the configuration.

Finally, as it is too common to have more than one VM within the same hardware, to avoid the risk of losing data in a virtualized environment we can use shard allocation awareness to prevent primary and replica shards to be located on the same hardware, rack or zone. We can then force each replica shard to be allocated in another VM that is not on the same hardware as the primary one.

Conclusion

There are many different possibilities for using Elasticsearch in a virtualized environment. Choosing which is the best will involve analyzing and deciding on some technical and financial tradeoffs. Consider the best choice for your solution: you want to have a configuration that allows you to use all the resources available not only effectively, but also efficiently.