08 octobre 2014 Nouveautés

Elasticsearch Hadoop 2.0.2 and 2.1.Beta2 Released

Par Costin Leau

I am pleased to announce Elasticsearch for Apache Hadoop releases 2.0.2 and 2.1.Beta2. (If you haven't been following our story so far, es-hadoop is our connector that serves up real-time search & analytics for your Hadoop deployments.)

2.0.2 is the latest stable release containing several bug fixes and is recommended upgrade for all existing users.

2.1.Beta2 is the second preview from the development branch bringing a number of new features and improvements besides the typical bug fixes, adding Apache Storm and Spark SQL support.

Spark SQL Support

2.1 Beta2 extends our native Spark support through Spark SQL integration. One can save SchemaRDDs to Elasticsearch or materialize them based on indices or queries (effectively creating views).

For example, finding out the “Smith"s is a one liner:

import org.apache.spark.sql.SQLContext
import org.elasticsearch.spark.sql._          
...
val people = sqlContext.esRDD("spark/people","?q=Smith")
// check the associated schema
println(people.schema)                        
// root
//  |-- name: string (nullable = true)
//  |-- surname: string (nullable = true)
//  |-- age: long (nullable = true)

The data and its associated schema are loaded through the returned SchemaRDD and through Spark SQL, and can be further interrogated through SQL.

Writing to Elasticsearch looks strikingly similar, as any SchemaRDD can be indexed. For this example, let's use the Java support:

import org.apache.spark.sql.api.java.*;                      
import org.elasticsearch.spark.sql.java.api.JavaEsSparkSQL;  
JavaSchemaRDD people = JavaSQLContext.parquetFile("people.dat")
// filter data using SQL
people.registerTempTable("people");
JavaSchemRDD teenagers = sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")
// index it to Elastic
JavaEsSparkSQL.saveToEs(teenagers, "spark/teens");

Again, it's just a one liner to save the data to Elasticsearch.

In addition to the Spark SQL support, the Spark module has had several improvements: the existing RDDs have been enhanced to PairRDDs and the code base has been upgraded to Spark 1.1 while maintaining backwards compatibility with Spark 1.0.

CDH 5.1 Certified

Speaking of Spark, we are glad to report that es-hadoop is now officially certified for CDH 5.1 (in addition to CDH 5.0) this time including the Spark category. We are tracking our releases to Hadoop's releases to make sure our product evolves in step with its ecosystem, giving our users peace of mind knowing that es-hadoop will simply work out of the box.

Apache Storm Integration

2.1 Beta2 makes Apache Storm a first class citizen. (And, by the way, congrats to the Storm team for graduating to a top level project) in the Apache incubator. es-hadoop brings real-time search and analytics to Storm's stream data processing platform through dedicated native Bolt and Spout implementation to ingest data and fan-out queries from and to Storm topologies.

To index data to Elasticsearch simply use EsBolt:

TopologyBuilder builder = new TopologyBuilder();
builder.setBolt("esBolt", new EsBolt("twitter/tweets"));

Executing queries in Elasticsearch for Storm is yet another one-liner:

TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("es-spout", new EsSpout("twitter/tweets", "?q=nfl*), 5);
builder.setBolt("bolt", new PrinterBolt()).shuffleGrouping("es-spout");

That's it!

Under the covers, es-hadoop uses its parallelized infrastructure to map the Spout and Bolt instances across the index shards for what we call partition-to-partition architecture.

Low-latency/high-performance patterns like micro-batching and tick-tuples are supported to provide excellent through-put out of the box and closely integrate the real-time capabilities of Storm and Elasticsearch.

Elasticsearch 1.4 Repository Support

Elasticsearch 1.4 Beta 1 was released last week bringing significant enhancements especially in resilience area. Among them, the snapshot and restore infrastructure has been revisited, with the new version supported by 2.1 Beta 2. (For Elasticsearch 1.0 – 1.3 please use es-hadoop 2.0.x.)

Elasticsearch Comes to NYC

If you happen to be in NYC next week and are interested in Elasticsearch, we'd love to talk to you!
Join us for the meetup (please RSVP – seats are limited) on Oct 15th at Twitter or if you are attending Strata NYC please pass by our booth. Many thanks to Twitter for hosting us!

We look forward to your feedback on 2.1.Beta2 – you can find the binaries are available on the download page and the new features explained in the reference documentation. As always, you can file bugs or feature requests on GitHub.