2014年05月27日 エンジニアリング

Elasticsearch for Apache Hadoop 2.0 GA Released

By Costin Leau

I am elated to announce the release of Elasticsearch for Apache Hadoop 2.0 GA.

Elasticsearch for Apache Hadoop, affectionately known as es-hadoop, enables Hadoop users and data-hungry businesses to enhance their workflows with a full-blown search and analytics engine, in real-time. es-hadoop is open source and works across different Hadoop, Cascading, Apache Hive and Apache Pig versions and across multiple Hadoop distributions, whether vanilla Apache Hadoop, Cloudera CDH, Hortonworks HDP, MapR or Pivotal.

No dependencies, all the functionality.

Native Integration with Hadoop

Using Elasticsearch with Hadoop has never been easier. Thanks to our deep API integration, interacting with Elasticsearch is similar to interacting with HDFS resources, whatever the environment used, from plain Map/Reduce, to Cascading plans, Pig scripts and Hive queries.

For each environment, es-hadoop provides a native interface that one can use to read, write and query Elasticsearch transparently; the dedicated Map/Reduce Input/OutputFormat, Cascading Sink/Tap, Hive StorageHandler/SerDe and Pig Load/StoreFunctions take care of the heavy lifting so you do not have to fiddle with data conversion or network communicating with Elasticsearch. If your data happens to be JSON that's fine by us; es-hadoop supports that too.

Pure Map/Reduce model

Most importantly, the Map/Reduce model in Hadoop is mapped on top of your Elasticsearch cluster; by leveraging Elasticsearch distributed architecture each es-hadoop operation scales out, being executed in parallel across the target shards. In other words, whenever a write or a read is issued, es-hadoop will dynamically determine the number of shards used for the target index and, for each one, use a dedicated task to push/pull the data in parallel, enabling the operation to scale out with the data.

Moreover, es-hadoop has full insight into the data topology used underneath so it can run its tasks co-located with the data, a great performance boost in deployments where Elasticsearch and Hadoop clusters run side by side.

Portability

Elasticsearch Hadoop is actively tested against various Hadoop distributions (such as vanilla Hadoop, CDH, HDP, MapR, Pivotal). Whether you are using Hadoop 1.x or 2.x, the so-called old (org.apache.hadoop.mapred) or the new (org.apache.hadoop.mapreduce) API, vanilla Hadoop or a certain distribution, we invest heavily in ensuring that es-hadoop works reliably no matter your Hadoop environment.

Operational Ease

es-hadoop provides a single binary (~350 KB jar) for its entire feature set, and those interested in saving a few KBs can use the modular jars. Without any dependencies, each jar can be used as is, easily embedded inside Hadoop jobs or provisioned throughout the cluster. At runtime, the firewall-friendly HTTP/REST protocol is used while HTTP and SOCKS proxies (with or without authentication) being supported for those running in locked down networks.

Production-ready, at Scale

We are happy to report that es-hadoop is being used in multiple data-intensive environments; in a recent example, a large financial institute that stores all of their raw access logs in Hadoop – billions of documents – has been using es-hadoop to index the data into Elasticsearch and then visualize it using Kibana. This approach allowed the customer to have near real-time visibility into their data through Kibana, yet also run batch oriented jobs over all their raw data when needed.

By combining Hadoop and Elasticsearch, organizations gain a scalable, distributed platform that enables fast search and data discovery across tremendous amounts of information. And through es-hadoop, this is easier than ever.

But don't take our word for it, download es-hadoop 2.0 GA, try it out and let us know what you think!