Several months ago, we announced on the mailing list the transformation of elasticsearch-hadoop into the go-to place for all things related to Elasticsearch and Hadoop. Today, we have published the first milestone (1.3.0.M1) based on the new code-base that has been in the works for the last few months.
Native integration with Hadoop
Using Elasticsearch in Hadoop has never been easier. Thanks to the deep API integration, interacting with Elasticsearch is similar to that of HDFS resources. And since Hadoop is more then just vanilla Map/Reduce, in elasticsearch-hadoop one will find support for Apache Hive, Apache Pig and Cascading in addition to plain Map/Reduce.
To wit, writing data to Elasticsearch is just Hadoop as usual:
-- load data from HDFS into Pig using a schema A = LOAD 'in/artists.dat' USING PigStorage() AS (id:long, name, url:chararray, picture: chararray); -- ETL the data to your heart content B = FOREACH A GENERATE name, TOTUPLE(url, picture) AS links; -- save the result to Elasticsearch STORE B INTO 'radio/artists' USING org.elasticsearch.hadoop.pig.ESStorage();
Or, if you prefer:
JobConf conf = new JobConf(); // index used for storing data conf.set("es.resource", "radio/artists"); // use dedicated output format conf.setOutputFormat(ESOutputFormat.class); ... JobClient.runJob(conf);
Reading is just as easy:
CREATE EXTERNAL TABLE artists ( id BIGINT, name STRING, links STRUCT<url:STRING, picture:STRING>) STORED BY 'org.elasticsearch.hadoop.hive.ESStorageHandler' TBLPROPERTIES('es.resource' = 'radio/artists/_search?q=me*'); -- stream data from Elasticsearch SELECT * FROM artists;
Note the use of the search query to retrieve only the data that we are interested in, which is all it takes to have a real-time engine available in Hadoop.
Pure Map/Reduce Model
The dedicated Elasticsearch Input/OutputFormat, Tap, Storage and SerDe take care of the heavy lifting so you do not have to fiddle with data conversion or communicating with Elasticsearch. But more importantly, the Map/Reduce model is ported over; by using Elasticsearch distributed architecture and shard allocation, elasticsearch-hadoop operations are executed in parallel.
In other words, no matter the environment, whenever a query is triggered, elasticsearch-hadoop will dynamically determine the number of shards used for the target index and for each one use a dedicated task to query the data in parallel. The results are aggregated (reduced) and returned back to the user transparently.
Moreover, Hadoop has full insight into the data topology used underneath so it can run the tasks with the data, a great performance boost in deployments where Elasticsearch and Hadoop clusters are co-located.
Read / Write data transparently
Whatever library is used, elasticsearch-hadoop automatically handles type conversion to and from JSON whether it’s Hadoop Writable, Hive Serde or Pig’s data type. And while the latest data types are supported (such as DateTime in Pig and Decimal in Hive), elasticsearch-hadoop is backwards compatible with older releases.
To minimize memory usage, type conversion happens on-the-fly and for efficiency data is written in bulk and read through streaming. By taking advantage of Elasticsearch capabilities, mappings are created automatically whether the data has or not a schema associated with it.
A job’s dependencies can easily increase its size to several megabytes, complicating its deployment; at only ~150KB, elasticsearch-hadoop can be easily embedded and requires no extra JARs outside Hadoop itself. At runtime, it relies on HTTP/REST for communication so it can be used without any changes to existing networks.
For the upcoming releases we plan on adding support for advanced mapping (such as parent/child and nested documents), HDFS implementation for the new snapshot/restore feature in Elasticsearch 1.0 and push down operations to name but a few.
We are eager for feedback and in this post we only scratched the surface of the features available. Check out the documentation for in-depth details.