2016년 12월 14일 엔지니어링

State of the official Elasticsearch Java clients

By Luca Cavanna

Java programmers have two choices when communicating with Elasticsearch: they can use either the REST API over HTTP, or the internal Java API used by Elasticsearch itself for node-to-node communication.

So, what's the difference between these two APIs?  When a user sends a REST request to an Elasticsearch node, the coordinating node parses the JSON body and transforms it into its corresponding Java object.  From then on, the request is sent to other nodes in the cluster in a binary format -- the Java API -- using the transport networking layer.  A Java user uses the Transport Client to build these Java objects directly in their application, then makes requests using the same binary format passed across the transport layer, skipping the need for the parsing step needed by REST.

What are the problems with this approach?

This solution is quite powerful, and didn’t require us to write specific Java client code for Elasticsearch as the Java API was already used and maintained internally. The Java API is also theoretically more performant than REST, as it skips the parsing step and allows clients to use the binary protocol. Benchmarks, however,  show that the performance of the HTTP client is close enough to that of the Transport client that the difference can be pretty much disregarded.

Over time, we came to realize that the Java API has some downsides:

Backwards compatibility

We are very careful with backwards compatibility on the REST layer where breaking changes are made only in major releases. On the other hand, we make breaking changes to Elasticsearch’s internal classes all the time, which is necessary in order to move the project forward. Those changes result in changes to the binary format, which we compensate for by having version-specific serialization code.  It is this compatibility layer that allows you to do a rolling upgrade of an Elasticsearch cluster.

That said, running a mixed version cluster is something we only recommend during the upgrade process, not as the status quo. Having older versions of nodes or clients in the cluster limits the support of newer features, as the older client simply doesn't know how to write or read requests in the newer binary format.  

When you upgrade your cluster, you should upgrade all nodes and clients to have the same version. The requirement to upgrade all Java clients makes upgrades harder, because it affects all the applications that communicate with Elasticsearch. And all of those changes we made to the internal Java API?  Your application has to be adapted to cope with them.  Fortunately, Java has a compiler which complains when an interface has changed, so updating your Java app shouldn't be too complicated.  That said it can still be a pain in the neck to do for every upgrade.

The REST interface is much more stable and can be upgraded out of step with the Elasticsearch cluster.

JVM version

We also recommend that the client and the server are on the same Java version. This used to be a strict requirement before Elasticsearch 2.0, when we used Java serialization for exceptions.  These days, having exactly the same Java version probably isn't as important as it used to be, but given how low level the binary format of the Java API is, it is advisable to use the same JVM version on all nodes and clients.

The REST client can use the same version of the JVM that is used by your application.

Dependencies

The Java API is not published as a separate artifact, which means that, your project has to depend on the whole Elasticsearch, including all the server code that is not really needed on the client side. This means you have some strange dependencies like lucene and log4j2, which you may not need in your application, and can actually end up conflicting with libraries that you have in your classpath. This problem has been reduced with the recent removal of the Guava dependency, but it still exists.

The low-level REST client today has minimal dependencies: Apache HTTP Async Client and its transitive dependencies (Apache HTTP Client, Apache HTTP Core, Apache HTTP Core NIO, Apache Commons Codec and Apache Commons Logging).

Security

Ideally, we would like to have a single entry-point to the cluster for users: the REST API, which can be secured via HTTPS. The transport layer should only be used for internal node-to-node communication. Knowing that users can only enter via the REST layer and not via the transport layer would greatly simplify how we write code, give us more freedom to add new features and also simplify how an Elasticsearch cluster can be secured.

We reached the point where the disadvantages of the java API outweighed the main benefit, which is not having to write and maintain a separate client.

Low-level Java REST client

This is why we released our first Java REST Client with version 5.0.0. It was the first step towards making the Java client work in the same way as all the other language clients. The first release only included what we call a low-level client, which has the following features:

  • compatibility with any Elasticsearch version
  • load balancing across all available nodes
  • failover in case of node failures and upon specific response codes
  • failed connection penalization
  • persistent connections
  • trace logging of requests and responses
  • optional automatic discovery of cluster nodes (also known as sniffing)

The documentation is available here.

It is called the low-level client because it does little to help the Java users to build requests or to parse responses. It handles path and query-string construction for requests,  but it treats JSON request and response bodies as opaque byte arrays which have to be handled by the user.

The next step is releasing a high level client that accepts proper request objects,  takes care of their marshalling, and  returns parsed response objects. We’ve been considering a few different approaches on how to get there. We thought about starting from scratch: having a client with minimal dependencies with its own requests and responses. That would be a nice green field project to work on, but it would require a big effort and make migrating applications from the Java API quite a headache.

We decided against this especially as this approach would considerably delay the first release of the high level REST client. Instead, the initial version of the high level Java REST client will still depend on Elasticsearch, like the Java API does today. This will allow us to reuse the existing Java API requests and responses, which will ease migration for users.

We will not implement the existing Client interface though - we have plans to improve this interface, so client calls in your application will need to be migrated, but requests and responses should stay exactly the same as today. The first release will not support all the APIs but we are planning to support the most important ones to start with: index, bulk, get, delete and search.

The Java REST client is the future for Java users of Elasticsearch. Please get involved and try out the high-level client as soon as it becomes available, as your feedback will help us to make it better faster. As soon as the REST client is feature complete and is mature enough to replace the Java API entirely, we will deprecate and finally remove the transport client and the Java API.