14 August 2014 Engineering

Elasticsearch Hadoop 2.0.1 and 2.1.Beta1 Released

By Costin Leau

I am happy to announce Elasticsearch for Apache Hadoop releases 2.0.1 and 2.1.Beta1

2.0.1 is the latest stable release, that fixes several bugs, improves compatibility across various Hadoop distributions and also tracks the latest library updates.

2.1.Beta1 is the first preview release from the development branch focused mainly on the emerging, real-time components in the Hadoop ecosystem; in particular Beta1 provides native integration with Apache Spark

Native Apache Spark support


Elasticsearch for Apache Hadoop 2.0 added support for Apache Spark, through its Map/Reduce functionality. Beta1 goes way beyond that, providing a dedicated native Spark RDD (or Resilient Distributed Dataset) for Elasticsearch, for both Java and Scala. Thus, one can easily execute searches in Elasticsearch and transparently feed the results back to Spark for transformation.

For example, using the artists example from the reference documentation, counting the performers starting with “me” is a one-liner:

import org.elasticsearch.spark._
val sc = new SparkContext(new SparkConf())
val number = sc.esRDD("radio/artists", "?me*").count()

Dedicated Java and Scala APIs

In Beta1, with the introduction of the Spark module, Elasticsearch for Apache Hadoop has grown beyond Java and is also using Scala as a language (as Apache Spark itself is written in Scala). As such, one will find the typical Scala patterns (like “pimp my library” – gotta love the name) in place as well as support for Scala collections and objects as returned by the RDD.

However, Java users are not forgotten, the Spark module provides a dedicated RDD for Java which returns <code>java.util collections and proper JDK types – the equivalent of its Scala brethren, but for Java.

Since the infrastructure is the same, one will get the same end result regardless of the which API is used. In fact, the two APIs are fully compatible and one can even use both in the same application, at the same time. To wit, here’s an example of the Java RDD leveraging Java 8 lambda expressions for conciseness, to filter some entries (please ignore the fact one can and should do the filtering directly through Elasticsearch):

import org.elasticsearch.spark.java.api.JavaEsSpark;
JavaSparkContext jsc = ...
JavaRDD<Map<String, Object>> esRDD = JavaEsSpark.esRDD(jsc, "radio/artists", "?me*");
JavaRDD<Map<String, Object>> filtered = esRDD.filter(
            m -> m.values().stream().filter(v -> v.contains("mega")));

Notice how the returned collection is used as is, without any conversion of any sort.

Index arbitrary RDDs to Elasticsearch

In a similar vein, through the Spark module any RDD can be saved to Elasticsearch as long as its structure maps to a document (one can easily transform the data or plug her own serializer if needed):

val game = Map("media_type"->"game","title" -> "FF VI","year" -> "1994")
val book = Map("media_type" -> "book","title" -> "A Clash of Kings","year" -> "1999")
val cd = Map("media_type" -> "music","title" -> "Surfing With The Alien")
val sc = new SparkContext(...)
sc.makeRDD(Seq(game, book, cd)).saveToEs("my-collection/{media-type}")

The attentive user may have noticed the pattern usage in the index definition – the Spark module supports all the functionality of es-hadoop such as dynamic writing above, writing of raw JSON, scripting or customizing the mapping. In fact, all the options are supported as the underlying code-base is the same – whether you are using Java, Scala or the Map/Reduce layer.

Modular design – pick only what you need

While part of Elasticsearch for Apache Hadoop, the Spark module is self-contained and can be used either itself or alongside the Map/Reduce, <code>Hive, Pig and <code>Cascading integrations. One can use it in standalone mode or against a YARN cluster.

We cannot wait to get your feedback on 2.1.Beta1 – you can find it in the usual spots and the new features explained in the documentation.

And by the way, if you are interested in data exploration, search, and identifying anomalies in real-time on your Hadoop data, you are warmly invited to our upcoming webinar hosted by yours truly, next week, on Wednesday, August 20th. The webinar will showcase some of the ways in which Elasticsearch for Apache Hadoop can help. Please register here.