24 juin 2014

Java Clients for Elasticsearch

DON'T PANIC. This article contains outdated information. Get the latest from The State of the Official Elasticsearch Java Clients post, instead.

One of the important aspects of Elasticsearch is that it is programming language independent. All of the APIs for indexing, searching and monitoring can be accessed using HTTP and JSON so it can be integrated in any language that has those capabilities. Nevertheless Java, the language Elasticsearch and Lucene are implemented in, is very dominant. In this post I would like to show you some of the options for integrating Elasticsearch with a Java application.

The Native Client

The obvious first choice is to look at the client Elasticsearch provides natively. Unlike other solutions there is no separate jar file that just contains the client API but you are integrating the whole application Elasticsearch. Partly this is caused by the way the client connects to Elasticsearch: It doesn’t use the REST API but connects to the cluster as a cluster node. This node normally doesn’t contain any data but is aware of the state of the cluster.

The node client integrates with your Elasticsearch cluster

On the right side we can see two normal nodes, each containing two shards. Each node of the cluster, including our application’s client node, has access to the cluster state as indicated by the cylinder icon. That way, when requesting a document that resides on one of the shards of Node 1 your client node already knows that it has to ask Node 1. This saves a potential hop that would occur when asking Node 2 for the document that would then route your request to Node 1 for you.

Creating a client node in code is easy. You can use the NodeBuilder to get access to the Client interface. This then has methods for all of the API functionality, e.g. for indexing and searching data.

Client client = NodeBuilder.nodeBuilder()
                                .client(true)
                                .node()
                                .client();
    boolean indexExists = client.admin().indices().prepareExists(INDEX).execute().actionGet().isExists();
    if (indexExists) {
        client.admin().indices().prepareDelete(INDEX).execute().actionGet();
    }
    client.admin().indices().prepareCreate(INDEX).execute().actionGet();
    SearchResponse allHits = client.prepareSearch(Indexer.INDEX)
                                .addFields("title", "category")
                                .setQuery(QueryBuilders.matchAllQuery())
                                .execute().actionGet();

You can see that after having the client interface we can issue index and search calls to Elasticsearch. The fluent API makes the code very readable. Note that the final actionGet() call on the operations is caused by the asynchronous nature of Elasticsearch and is not related to the HTTP operation. Each operation returns a Future that provides access to the result once it is available.

Most of the operations are available using dedicated builders and methods but you can also use the generic jsonBuilder() that can construct arbitrary JSON objects for you.

An alternative to the node client we have seen above is the TransportClient. It doesn’t join but connects to an existing cluster using the transport module (the layer that is also used for inter-node communication). This can be useful if maintaining the cluster state in your application can be problematic, e.g. when you are having tight constraints regarding your memory consumption or you are restarting your application a lot.

The TransportClient can be created by passing in one or more urls to nodes of your cluster:

Client client = new TransportClient()
                        .addTransportAddress(new InetSocketTransportAddress("localhost", 9300))
                        .addTransportAddress(new InetSocketTransportAddress("localhost", 9301));

Using the property client.transport.sniff the TransportClient will also retrieve all the URLs for the other nodes of the cluster for you and use those in a round robin fashion.

The native Java client is the perfect solution if you need to have all of the features of Elasticsearch available. New functionality is automatically available with every release. You can either use the node client that will save you some hops or the TransportClient that communicates with an existing cluster.

If you’d like to learn more about the two kinds of clients you can have a look at this article on using Elasticsearch from Java or this post on networking on the Found blog.

Note: Elasticsearch service providers that have built a highly secure platform and service, e.g. implementing security measures such as ACLs and encryption, do not support unmodified clients. For more details regarding Found’s requirements, see Found Elasticsearch Transport Module.

Jest

For when you need a lightweight client in your application (regarding jar size or memory consumption) there is a nice alternative. Jest provides an implementation of the Elasticsearch REST API using the Apache HttpComponents project.

The API of Jest is very similar to the Elasticsearch API. It uses a fluent API with lots of specialized builders. All of the interaction happens using the JestClient that can be created using a factory:

JestClientFactory factory = new JestClientFactory();
        factory.setHttpClientConfig(new HttpClientConfig.Builder("http://localhost:9200")
                .multiThreaded(true)
                .build());
        JestClient client = factory.getObject();

When it comes to communicating with Elasticsearch you have two options: You can either create strings in the JSON-API of Elasticsearch or you can reuse the builder classes of Elasticsearch. If it’s not a problem to have the Elasticsearch dependency on your classpath this can lead to cleaner code. This is how you conditionally create an index using Jest:

boolean indexExists = client.execute(new IndicesExists.Builder("jug").build()).isSucceeded();
        if (indexExists) {
            client.execute(new DeleteIndex.Builder("jug").build());
        }
        client.execute(new CreateIndex.Builder("jug").build());

And this is how a search query can be executed.

String query = "{\n"
                + "    \"query\": {\n"
                + "        \"filtered\" : {\n"
                + "            \"query\" : {\n"
                + "                \"query_string\" : {\n"
                + "                    \"query\" : \"java\"\n"
                + "                }\n"
                + "            }"
                + "        }\n"
                + "    }\n"
                + "}";
        Search.Builder searchBuilder = new Search.Builder(query).addIndex("jug").addType("talk");
        SearchResult result = client.execute(searchBuilder.build());

You can see that concatenating the query can become complex so if you have the option to use the Elasticsearch builders you should try it.

The really great thing about Jest is that you can use Java Beans directly for indexing and searching. Suppose we have a bean Talk with several properties we can index instances of those in bulk in the following way:

Builder bulkIndexBuilder = new Bulk.Builder();
        for (Talk talk : talks) {
            bulkIndexBuilder.addAction(new Index.Builder(talk).index("jug").type("talk").build());
        }
        client.execute(bulkIndexBuilder.build());

Given the SearchResult we have seen above we can then also retrieve our talk instances directly from the Elasticsearch results:

List<Hit<Talk, Void>> hits = result.getHits(Talk.class);
        for (Hit<Talk, Void> hit: hits) {
            Talk talk = hit.source;
            log.info(talk.getTitle());
        }

Besides the execute method we have used so far there is also an async variant that returns a Future.

The structure of the JEST API is really nice, you will find your way around it immediately. The possibility to index and retrieve Java Beans in your application makes it a good alternative to the native client. But there is also one thing I absolutely don’t like: It throws too many checked Exceptions, e.g. a plain Exception on the central execute method of the JestClient. Also, there might be cases where the Jest client doesn’t offer all of the functionality of newer Elasticsearch versions immediately. Nevertheless, it offers a really nice way to access your Elasticsearch instance using the REST API.

For more information on Jest you can consult the project documentation on GitHub. There is also a nice article on Elasticsearch at IBM developerWorks that demonstrates some of the features using Jest.

Spring Data Elasticsearch

The Spring Data project is a set of APIs that provide access to multiple data stores using a similar feeling. It doesn’t try to use one API for everything, so the characteristics of a certain data store can still be available. The project supports many stores, Spring Data JPA and Spring Data MongoDB being among the more popular. Starting with the latest GA release the implementation of Spring Data Elasticsearch is also officially part of the Spring Data release.

Spring Data Elasticsearch goes even one step further than the Jest client when it comes to indexing Java Beans. With Spring Data Elasticsearch you annotate your data objects with a @Document annotation that you can also use to determine index settings like name, numbers of shards or number of replicas. One of the attributes of the class needs to be an id, either by annotating it with @Id or using one of the automatically found names id or documentId. The other properties of your document can either come with or without annotations: without an annotation it will automatically be mapped by Elasticsearch, using the @Field annotation you can provide a custom mapping. The following class uses the standard mapping for speaker but a custom one for the title.

@Document(indexName="talks")
    public class Talk {
        @Id
        private String path;
        @Field(type=FieldType.String, index=FieldIndex.analyzed, indexAnalyzer="german", searchAnalyzer="german")
        private String title;
        private List<String> speakers;
        @Field(type= FieldType.Date)
        private Date date;
        // getters and setters ommitted
    }

There are two ways to use the annotated data objects: Either using a repository or the more flexible template support. The ElasticsearchTemplate uses the Elasticsearch Client and provides a custom layer for manipulating data in Elasticsearch, similar to the popular JDBCTemplate or RESTTemplate. The following code indexes a document and uses a GET request to retrieve it again.

IndexQuery query = new IndexQueryBuilder().withIndexName("talks").withId("/tmp").withObject(talk).build();
    String id = esTemplate.index(query);
    GetQuery getQuery = new GetQuery();
    getQuery.setId(id);
    Talk queriedObject = esTemplate.queryForObject(getQuery, Talk.class);

Note that none of the classes used in this example are part of the Elasticsearch API. Spring Data Elasticsearch implements a completely new abstraction layer on top of the Elasticsearch Java client.

The second way to use Spring Data Elasticsearch is by using a Repository, an interface you can extend. There is a general interface CrudRepository available for all Spring Data projects that provides methods like findAll(), count(), delete(...) and exists(...). PagingAndSortingRepository provides additional support for, what a surprise, paging and sorting.

For adding specialized queries to your application you can extend the ElasticsearchCrudRepository and declare custom methods in it. What might come as a surprise at first: You don’t have to implement a concrete instance of this interface, Spring Data automatically creates a proxy for you that contains the implementation. What kind of query is executed is determined by the name of the method, which can be something like findByTitleAndSpeakers(String title, String speaker). Besides the naming convention you can also annotate the methods with an @Query annotation that contains the native JSON query or you can even implement the method yourself.

Spring Data Elasticsearch provides lot of functionality and can be a good choice if you are already using Spring or even Spring Data. Some of the functionality of Elasticsearch might not be available at first or is more difficult to use because of the custom abstraction layer.

Besides the clients we have seen in this post there are more available for the JVM. The Groovy Client wraps the Java API in Groovy, Elastisch is a client that implements the Elasticsearch API in a Clojure way. Have a look at the clients supported by the community to see more, e.g. several clients for Scala.

Some of the clients might be a little behind so if you need to have the newest features it might be best to choose the native client. But of course all of this projects are open source, so if you need a feature why not implement it yourself and contribute it back to the project?

Java Clients for Elasticsearch

The Native Client

Jest

Spring Data Elasticsearch

More