Cloud Enterprise - The Architecture

In today's blog post we would like to give you an overview of Elastic Cloud Enterprise and its architecture.

Introduction

We'll start by describing what Elastic Cloud Enterprise is and how it differs from our current Software-as-a-Service offering — Elastic Cloud.

Elastic offers a hosted version of the Elastic Stack named Elastic Cloud. It allows you to run Elasticsearch and Kibana in the cloud. No need to set up the infrastructure or work out the management details. Provisioning and scaling clusters is just a few clicks away. Behind the scene the clusters are hosted on AWS. We've had a lot of great traction with this offering to date — companies of various sizes and profiles love the ease of use, security, and having the latest version of Elasticsearch in a monitored and managed cloud environment.

elastic-cloud-enterprise.svgUsing a public cloud is a bit trickier for large enterprises. Either because enterprises deal with regulated or sensitive data that cannot leave internal networks; or because of the investments they have already made in existing on premises infrastructure. Yet, the rapid adoption of the Elastic Stack within different lines of business within these enterprises leads to proliferation of separate clusters managed by different teams. And consequently to a zoo of versions, configurations, and usage patterns.

Centralizing the management of these clusters can not only enforce uniform versioning, data governance, backup, and user management policies but also reduce the total cost by increasing the hardware utilization. As we can see, an enterprise with a large number of Elasticsearch installations can hugely benefit from a centralized cloud-like approach, such as the one that is present in Elastic Cloud.

We happen to have the right solution to address the challenge — we have decided to package our SaaS platform and to make it available as a product. Enter Elastic Cloud Enterprise.

Architecture

Elastic Cloud Enterprise shares most of its codebase with our Elastic Cloud SaaS offering. The key tenets of the architecture are:

  • Service-oriented architecture
  • Containerization using Docker
  • Reliance on ZooKeeper for cluster-state coordination
  • No assumptions about the underlying machines or VMs

Let us discuss the points in more detail.

Services

We have avoided the monolithic approach from day one. The service-oriented architecture has various benefits.

It allows us to scale the platform easily. It supports the notion of different services having different reliability and performance requirements as each service can be scaled separately.

The services have well-defined behavior accessible via an API. This eases the operational management and allows us to change and improve one service without affecting all the other services.

Each service is deployed independently in its own Docker container. This, combined with fine-grained permissions to read and write application state, makes the whole installation more secure. Even if a service is compromised, the damage is contained to a single container plus part of the application state. For example, in our cloud service we assume that any Elasticsearch cluster node can be compromised at any time due to a yet undiscovered vulnerability. Even if attackers can compromise a cluster node they cannot break out of their containers and the host that hosts this container. They can write no application state and they can read only the part of the state that is related to their own cluster.

Diagram - Core Services and Connections

The above diagram depicts the core services and the connections between them.

Proxy

Proxy is the first component that a user's request hits. It maps a cluster id passed in the request url to the container to the actual cluster nodes. The association of cluster id to a container is stored in ZooKeeper, but the proxy caches it. That means that even in the rare event of ZooKeeper downtime the platform can still service the requests to existing clusters.

The Proxy is intelligent — if you have a highly available cluster, so that your nodes are spread across two or three availability zones, and if one of the zones goes down then the proxy will not route any requests there. It keeps track of the state and availability of the zones.

It helps with no-downtime scaling and upgrades. Before we perform an upgrade a snapshot is taken. Then new nodes with new configuration or new quota are spun up. The data is migrated to the new nodes using standard Elasticsearch features. Finally, when the migration is complete the proxy switches the traffic to the new nodes and disconnects the old ones.

Allocator

Allocators let you scale the Cloud Enterprise installation. They run on all the machines that we want to host the Elasticsearch and Kibana nodes on. Containers with Elasticsearch cluster nodes are then run on the machines managed by allocators.

An allocator advertises the resources of the underlying machine in ZooKeeper. It controls the lifecycle of cluster nodes — it creates new containers and starts Elasticsearch nodes in these containers when asked; it makes sure to restart a node in case it becomes unresponsive; finally it kills a node if it's no longer needed.

We use Docker containers to guarantee shares of resources for the underlying clusters. This allows us to mitigate the noisy neighbor effect where one busy cluster can overwhelm the entire host.

Constructor

An allocator is an agent — it manages the container and Elasticsearch nodes, but it only responds to explicit requests. Basically, it needs a service to direct and tell it what to do. The constructor provides such a service.

The constructor monitors new requests from admin console (see below), calculates what needs to be changed and writes it to ZooKeeper nodes which are monitored by the allocators. Its job is also to assign cluster nodes to proper allocators.

The constructor is location aware, meaning that it understands the topology of the allocators within a region. If you select a cluster plan with high-availability it will place cluster nodes within different availability zones to ensure that the cluster can survive downtime of the whole zone. Additionally, it maximizes the utilization of underlying allocators to make sure that we don't need to spin up extra hardware for new clusters.

Cloud UI

Provides the UI and API to manage and monitor the clusters. It consists of both the administrative part — which can be used to monitor the overall health of the platform and managed clusters, and the end user's part, which users can use to provision and configure their clusters.

Containerization

All of the services are deployed as Docker containers. This simplifies the operational effort plus makes it easy to provision a similar environment for development and staging. Each cluster node is also run within a Docker container — to make sure that all of the nodes have access to a guaranteed share of the host resources.

Containerization is great for security — we assume that any cluster can be compromised and give its container no access to the platform. The same is true for the services — each one can only read and/or write these parts of the system state that are relevant to it. Even if some services are compromised the attacker won't get hold of the keys to the rest of them and will not compromise the whole platform.

Stunnels

We want the Docker containers to communicate securely with one another. Usually the way to provide secure communication is to use Transport Layer Security. Not all the services or components that we use offer a native support for TLS. Stunnel to the rescue. We use it to tunnel all the traffic between the containers and hence to make sure that it is not possible to eavesdrop the conversation even when someone else has access to the underlying cloud or network infrastructure.

ZooKeeper

ZooKeeper is a distributed, strongly consistent data store. It offers a file system-like structure, where each node is both a folder (it can have sub items) and a file (it holds data). These nodes are called znodes to differentiate them from the physical nodes that ZooKeeper runs on. ZooKeeper is designed to remain consistent even in the event of network partitions — a write operation is rejected unless it can be confirmed by a majority of ZooKeeper servers. The writes are linear. It is possible to set watches on znodes so ZooKeeper can serve as an event bus — one service can notify another by writing to an observed znode.

Znodes can have associated Access Control Lists (ACLs). We use this feature to give a fine-grained access to the system state for the various services. For example, the constructor can write cluster plans, but the allocators can only read them.

ZooKeeper is central to our solution as it is the source of truth — it stores the state of the Elastic Cloud Enterprise installation and the state of all of the clusters running on that installation. It is also the event bus coordinating all the other services.

No assumptions about the underlying infrastructure

Because all the services are containerized we can support a wide range of configurations. Elastic Cloud Enterprise can be deployed on public or private clouds or even on bare metal hardware.

Our only prerequisites are to use Linux (RHEL, CentOS or Ubuntu) with a recent kernel that can run Docker. The installation script needs either internet connectivity to access the Elastic Docker registry or access to an internal Docker registry with cached images of our services. We rely on the Docker daemon and we deliver the services as Docker images, but we do not rely on any specific container orchestration solution.

Not taking an opinionated view on the underlying infrastructure allows Elastic Cloud Enterprise to work with a wide range of deployment strategies.

Summary

At this point we've released two private alphas. We are getting great feedback about the technology. If you want to take part in the testing please fill in this form. We hope to release the next version shortly.

Elastic Stack is becoming increasingly popular. It is being used for new and innovative use cases, which results in a proliferation of clusters. This poses new challenges for enterprises where centralizing the management and creating internal centers of excellence may provide great benefits.

With Elastic Cloud we have a battle-tested technology that makes it not only possible, but also easy to manage thousands of clusters. With Elastic Cloud Enterprise we bring this technology and experience to your own data center or internal cloud.