UPDATE: This article refers to our hosted Elasticsearch offering by an older name, Found. Please note that Found is now known as Elastic Cloud.
The Zen discovery plugin might seem like magic, but replacing it is actually not that hard. Getting down and dirty with Zen might very well pay off in your particular setup.
The purpose and responsibility of the discovery plugin is to provide up to date information about the cluster to an Elasticsearch node. This boils down to the following behaviour:
- On non-master nodes:
- Elect a master, when no known master.
- Receive cluster state from master and update local ClusterService
- On the master:
- Publish changes to the cluster state
- Detect nodes leaving and joining the cluster and update cluster state accordingly.
This is done by providing an implementation of the discovery interface and communicating with the ClusterService. The essential parts of the discovery interface are:
DiscoveryNode localNode(); void publish(ClusterState clusterState);
The localNode-method does not really do any magic beyond what you would expect, it returns the DiscoveryNode representing the local instance of Elasticsearch. The publish-method is where half the fun happens: when the ClusterService on the master node receives an updated state it uses the publish-method to notify other nodes. How and where to publish the ClusterState is not specified, but Elasticsearch relies on all other nodes participating in the cluster to receive the update. Exemplified by the ZenDiscovery, the publish-method first updates internal state with the new cluster state - e.g. which nodes to send heartbeats to - then transmits the cluster state as a message via the transport protocol to all other nodes. Depending on the motivation for replacing the ZenDiscovery one could for instance update a global third party directory with the cluster state instead of transmitting it directly. Bear in mind that it’s important that all nodes have the same view of the cluster; regardless of method used it must be able to transmit the cluster state reliably and efficiently.
The other half of the fun is what the discovery plugin does on its own. As the discovery interface extends LifecycleComponent
- node discovery in the event of becoming the master
- master election in case the master goes offline
- network/node fault detection to ensure other nodes are reachable
We have now briefly discussed what a discovery plugin might do and what is expected of it, or what you can refer to as the what and how. Let’s discuss the why. The short answer to why implement another discovery plugin is that you are not happy with ZenDiscovery. But why are you not happy with it? After all, it’s the one most commonly used discovery plugin. Why do you think that you can make a better plugin than the core developers did? The answer lies in the question. The solution for everybody has to work with any network in any environment. Hence the solution for everybody can only do discovery through multicast or a preconfigured list, it has to handle leader election on its own (which is one of the tough problems in distributed systems), and node failure detection is limited to ping timeouts. In your environment you may have resources that could be utilized to greatly simplify one or more of these challenges. In addition to ZenDiscovery, Elasticsearch is bundled with LocalDiscovery which assumes all nodes run on the same VM and hence does node discovery through a singleton object and avoids the network altogether for cluster state publishing. By installing the cloud-aws plugin you can use EC2Discovery which does node discovery through the EC2 API. And finally there is the sonian-zookeeper-plugin that uses ZooKeeper for node discovery and leader election. At Found, we have done something similar to the sonian plugin to improve leader election and node discovery using ZooKeeper, but it’s currently not publicly available as it’s too tightly integrated with our other infrastructure.
If you are curious about leader election in general and the challenges related to achieving consensus in a distributed system, you should read up on Paxos. That being said, it is usually best not to reinvent the wheel and, depending on the resources available in your environment, you can usually find someone who have already made a good implementation like the curator frameworks leader latch, if you are already using ZooKeeper.