Elasticsearch Hadoop 2.0.2 and 2.1.Beta2 Released
I am pleased to announce Elasticsearch for Apache Hadoop releases 2.0.2 and 2.1.Beta2. (If you haven't been following our story so far, es-hadoop is our connector that serves up real-time search & analytics for your Hadoop deployments.)
2.0.2 is the latest stable release containing several bug fixes and is recommended upgrade for all existing users.
2.1.Beta2 is the second preview from the development branch bringing a number of new features and improvements besides the typical bug fixes, adding Apache Storm and Spark SQL support.
Spark SQL Support
2.1 Beta2 extends our
native Spark support through Spark SQL integration. One can save SchemaRDD
s to Elasticsearch or materialize them based on indices or queries (effectively creating views).
For example, finding out the “Smith"s is a one liner:
import org.apache.spark.sql.SQLContext import org.elasticsearch.spark.sql._ ... val people = sqlContext.esRDD("spark/people","?q=Smith") // check the associated schema println(people.schema) // root // |-- name: string (nullable = true) // |-- surname: string (nullable = true) // |-- age: long (nullable = true)
The data and its associated schema are loaded through the returned
SchemaRDD
and through Spark SQL, and can be further interrogated through SQL.
Writing to Elasticsearch looks strikingly similar, as any
SchemaRDD
can be indexed. For this example, let's use the Java support:
import org.apache.spark.sql.api.java.*; import org.elasticsearch.spark.sql.java.api.JavaEsSparkSQL; JavaSchemaRDD people = JavaSQLContext.parquetFile("people.dat") // filter data using SQL people.registerTempTable("people"); JavaSchemRDD teenagers = sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND age <= 19") // index it to Elastic JavaEsSparkSQL.saveToEs(teenagers, "spark/teens");
Again, it's just a one liner to save the data to Elasticsearch.
In addition to the Spark SQL support, the Spark module has had several improvements: the existing
RDD
s have been enhanced to PairRDD
s and the code base has been upgraded to Spark 1.1 while maintaining backwards compatibility with Spark 1.0.
CDH 5.1 Certified
Happy to announce that
@elasticsearch for Apache Hadoop 2.1 is now certified on Cloudera 5, including support for Apache Spark!
— Cloudera Connect (@ClouderaConnect)
October 2, 2014
Speaking of Spark, we are glad to report that es-hadoop is now officially certified for CDH 5.1 (in addition to CDH 5.0) this time including the Spark category. We are tracking our releases to Hadoop's releases to make sure our product evolves in step with its ecosystem, giving our users peace of mind knowing that es-hadoop will simply work out of the box.
Apache Storm Integration
2.1 Beta2 makes
Apache Storm a first class citizen. (And, by the way, congrats to the Storm team for graduating to a top level project) in the Apache incubator. es-hadoop brings real-time search and analytics to Storm's stream data processing platform through dedicated native Bolt
and Spout
implementation to ingest data and fan-out queries from and to Storm topologies.
To index data to Elasticsearch simply use
EsBolt
:
TopologyBuilder builder = new TopologyBuilder(); builder.setBolt("esBolt", new EsBolt("twitter/tweets"));
Executing queries in Elasticsearch for Storm is yet another one-liner:
TopologyBuilder builder = new TopologyBuilder(); builder.setSpout("es-spout", new EsSpout("twitter/tweets", "?q=nfl*), 5); builder.setBolt("bolt", new PrinterBolt()).shuffleGrouping("es-spout");
That's it!
Under the covers, es-hadoop uses its parallelized infrastructure to
map the Spout
and Bolt
instances across the index shards for what we call partition-to-partition architecture.
Low-latency/high-performance patterns like micro-batching and tick-tuples are supported to provide excellent through-put out of the box and closely integrate the real-time capabilities of Storm and Elasticsearch.
Elasticsearch 1.4 Repository Support
Elasticsearch 1.4 Beta 1 was released last week bringing significant enhancements especially in resilience area. Among them, the snapshot and restore infrastructure has been revisited, with the new version supported by 2.1 Beta 2. (For Elasticsearch 1.0 – 1.3 please use es-hadoop 2.0.x.)
Elasticsearch Comes to NYC
If you happen to be in NYC next week and are interested in Elasticsearch, we'd love to talk to you!
Join us for the
meetup (please RSVP – seats are limited) on Oct 15th at Twitter or if you are attending Strata NYC please pass by our booth. Many thanks to Twitter for hosting us!
We look forward to your feedback on 2.1.Beta2 – you can find the binaries are available on the download page and the new features explained in the reference documentation. As always, you can file bugs or feature requests on GitHub.