14. Januar 2015

Interfacing with Elasticsearch: Picking a Client

UPDATE: This article refers to our hosted Elasticsearch offering by an older name, Found. Please note that Found is now known as Elastic Cloud.

When starting using Elasticsearch, it's easy to get confused about all the different ways to connect to Elasticsearch and why one of them should be preferred over the other. In this article we'll provide an overview of the different client types available and give some pointers on when one should be chosen over another.

Introduction

Having lots of options when picking a client can be both great and confusing. It’s great because it means that there’s likely to be a client out there that meets your requirements, but it’s confusing because there’s a lot of clients to choose from and there’s a lot of overlap between them - and a decision needs to be made.

Many languages have official clients written and supported by Elasticsearch, which is great as a default go-to solution, but knowing the options available and their different characteristics makes the choice a lot easier.

Different Client Types

By default, Elasticsearch comes with support for two protocols:

HTTP: A RESTful API
Native Elasticsearch binary protocol: a custom protocol developed by Elasticsearch for inter-node communiaction.

It’s also possible to extend Elasticsearch with support for different protocols via plugins. There there are a few official plugins around.

Using the native protocol from anything other than Java is not recommended, as it would entail implementing a lot of custom serialization.

Transport Client

The Transport client is one of the native ways to connect to Elasticsearch. It is part of the official Elasticsearch distribution and thus requires your client to be written in Java (or at least run on the JVM) as well.

It’s very fast and runs natively on the JVM. The serialization is effective and there’s little to no overhead in messages and operations sent to/from your Elasticsearch instances. It requires keeping the Elasticsearch server and client versions somewhat synchronized. Prior to Elasticsearch 1.0 the exact same versions would be required, but newer versions (1.0 and later) support interacting between versions. It’s also beneficial to run the same JVM update version on both the clients and the servers due to exception serializing and other potential minor differences between updates.

There’s currently no support for encryption or authentication, but Shield is announced to cover these needs. For using the Transport client on Found.no hosted clusters, you can use our custom transport module which takes care of encryption, authentication and keep-alives.

Node Client

The Node client is very similar to the Transport client: it’s part of the official Elasticsearch distribution, requires your client run Java and so on, but there’s some significant differences as well.

Where the cluster is largely indifferent about whether or not a Transport client has connected to one of the nodes in your cluster, a node client is considered part of your cluster. This means that the presence of the node client is stored in the cluster state, and that all the other nodes in the cluster will attempt to establish a few tcp connections back to the client. This may be a significant drawback if your cluster is large or you’re using several clients.

This may seem a bit exessive, but it’s currently required in order to enable the server nodes to propagate changes to the cluster state to the client. The end result of this is that the node client always has an up-to-date cluster state and a connection to every other node in the Elasticsearch cluster, which enables it to perform the routing of operations locally, be the coordinator of its own requests and so on. This skips a network jump for each and every request and results in less work for the remaining nodes in the cluster.

Having the cluster state locally also costs a bit of memory, as well as CPU resources in order to perform incoming updates to the cluster state. Additionally, the Node client loads quite a few more Modules from Elasticsearch, which increases the memory footprint.

HTTP Clients

HTTP is well supported in most programming languages, and this is the most common way to connect to Elasticsearch. If you’re going with HTTP, there are still one important choices to make: use an existing Elasticsearch HTTP-based library or simply just create a small wrapper for the operations you need using the HTTP client if your choice.

Since HTTP is a general purpose protocol and supports a wide variety of use cases, a few important things need to be implemented by the client: connection pooling and keep-alives. Connection pooling is required in order to avoid having to pay the TCP connection establishment cost for each request. This is even more important if it uses HTTPS, which comes with an additional encryption handshake cost. Connection pooling often requires keep-alive support as well, since we would like to avoid connections being broken due to idling.

While it might not be initially obvious that connection establishment actually is significant, consider that establishing a TCP connection requires a three-way handshake. Put simply, with a ping time of 50 milliseconds, it will take around 75 ms to establish a connection, in addition to the time it takes to acquire and release local resources (handling client ports, connection management, etc) and so on – this is without considering the time it takes to handle the request/response at both ends (e.g serialization). Without connection pooling, this time is added on top of each and every request. For HTTPS, which we recommend for security and privacy, the connection establishment overhead can sometimes be measured in seconds, which is even more noticeable. Considering the basic advice that end users’ response times have to be under 100 ms in order to be observed as “instant”, even the non-encrypted overhead makes this limit almost impossible to stay within.

The official (non-Java) clients written and supported by Elasticsearch all use HTTP under the hood to communicate with Elasticsearch. Our general recommendation is to use the official clients that wrap the HTTP API if possible because they take care of handling all these details.

HTTP client implementations can be quite fast, and some of them even compete with the speed of the native protocol. The HTTP API of Elasticsearch is widely used and has considerable community support. The performance depends a lot on the client library however, and often needs to be configured or tweaked in order to be maximized.

Other Protocols

As it’s possible to create a new client interface to Elasticsearch simply by writing a plugin, a lot of other protocols are supported. There are official plugins for using the Memcached and Apache Thrift protocols this way. The degree of encryption supported depends on the client (for example, Thrift supports SSL, but requires some additional setup for both the client and the server).

This system is quite flexible and you can conceivably use any protocol that supports the request/response pattern, but a significant drawback of using these protocols is that there are less users using them, so it’s harder to get community support. There’s also the risk that the lesser used protocols may become deprecated without any guarantee of future support, so users should only use these protocols if this risk is acceptable and the protocol and its corner cases are well known in-house. For these reasons we generally don’t recommend taking this approach.

Conclusion

It’s easy to spend a lot of time figuring out the differences between the myriad of protocols and clients to use with Elasticsearch, but the choice is actually pretty simple: if possible, use a high-performance HTTP client that you are comfortable with or an official language binding.

If you’re using Java, the Transport client should be chosen over the Node client unless the performance gain from using a Node client turns out to be large enough to warrant the additional network complexity. Use benchmarks to verify the performance gains.

When using other non-Java JVM-based languages (e.g Scala, Clojure, Groovy, JRuby and so on), it might be worthwhile to further explore more language-native wrappers on top of either the Transport or Node clients. This may make the interaction with Elasticsearch a lot more natural in the source code. Be aware that the wrapper may not expose all the functionality of the underlying client and this could mean more poking around in order to get to it, so consider the tradeoffs compared to using the official clients here as well.