08 April 2014

Elasticsearch for Apache Hadoop 1.3 M3 is out

Von Costin Leau

I am happy to announce that Elasticsearch for Apache Hadoop 1.3 M3 has been released. Elasticsearch for Apache Hadoop provides a single jar that enables real time search and analytics across different Hadoop, Cascading, Hive and Pig versions and across multiple Hadoop distributions, whether it is vanilla Apache Hadoop, CDH, HDP, MapR or Pivotal. No dependencies, all the functionality.

Besides a healthy dose of bug fixes, the last (planned) milestone in 1.3 adds a series of improvements and features, such as:

Multi-index Writes

When indexing data, it is common to `split` it into different buckets based on its content. For example, log data is typically indexed per date (per-day, week or month), which makes it easy to both handle and manage the data and its life-cycle.
es-hadoop 1.3 M3 brings this functionality to the table, allowing data in Hadoop to be indexed in real-time and based on its content, regardless of whether you are using Map/Reduce, Cascading, Hive or Pig. If we want to index our media based on its type, we can simply define the following target resource:

es.resource = "my-collection/{media_type}"

At runtime, the {media_type} field is being extracted, the actual index/type resolved and the data properly dispatched. The actual value extraction happens transparently whether you are using a Cascading tuple or a Hive table and also, no matter if you passing in native types or raw JSON. See this section of the user guide for more information.

Support for automatic time-based formatting is scheduled for 1.3-RC1, so one could use logstash-like patterns: ( es.resource="my-collection"/{timestamp:YYYY.MM.dd}")


Getting insight into how your jobs are performing is crucial for maximizing performance, tracking behavior and diagnosing issues. That is why in M3, we have added over 15 metrics that cover extensively the entire I/O spectrum of es-hadoop. Simply upgrade and you will get the stats automatically logged into the console for each Map/Reduce job:

Elasticsearch Hadoop Counters
	Bulk Retries=0
	Bulk Retries Total Time(ms)=0
	Bulk Total Time(ms)=518
	Bulk Writes=20
	Bytes Accepted=159129
	Bytes Read=79921
	Bytes Retried=0
	Bytes Written=159129
	Documents Accepted=993
	Documents Read=0
	Documents Retried=0
	Documents Written=993
	Network Retries=0
	Network Total Time(ms)=937
	Node Retries=0
	Scroll Reads=0
	Scroll Total Time(ms)=0

All the stats are exposed through the Hadoop infrastructure and are available through the standard APIs. There are no extra steps or configurations that need to be setup - everything is already included.

Mapping Typo Suggestions

Typos happen to everyone (to some more often than to others), and it can be quite annoying discovering that your query is incorrect because there is no naem or adress in your data. es-hadoop tries to help out:

WARN main mr.EsInputFormat - Field(s) [naem, adress] not found in the Elasticsearch mapping specified; did you mean [name, location.address]?

This validation can be enforced if you wish to prevent your potentially expensive job from running with typos.

Proxy (HTTP and SOCKS) support

If your network has access restrictions, with M3 you can use both HTTP and SOCKS proxies to transparently route connections from Hadoop to Elasticsearch (and back :)). Both open and authorized proxies are supported. Plus, the proxying is scoped, so es-hadoop connections do not interfere with the rest of the JVM.

Same binary for both Hadoop 1.x and 2.x

es-hadoop supports both Hadoop 1.x and 2.x environments. Since Hadoop 2.x is not binary compatible with Hadoop 1.x (for org.apache.mapreduce package) this resulted in two separate jars that had to be used, one for each environment. Needless to say, it was easy to mix the two, until now. In M3, we introduce a single jar that works across both Hadoop environments!

Modular jars

Speaking of binaries, in M3 we also introduced one jar per module. es-hadoop as a whole takes about ~300 KB which gives you integration with Map/Reduce, Hive, Pig and Cascading. Yet we understand that some folks might want to use Elasticsearch just with Cascalog or only do real-time search with Hive. For this reason, one can now get a dedicated jar, for each integration, with a dedicated Maven POM, slimmer in size and that can run stand-alone, no other jars required.

We have also expanded the configuration option for each integration, allowing a single job to read and write data to different Elasticsearch indices, extended Cascading integration with per-Tap configuration and Lingual support (run ANSI SQL in Hadoop against Elasticsearch) and improved exception reporting and handling.

Feedback welcome

We are quite excited about the upcoming 1.3 release and we hope you are too! So go ahead, take M3 for a spin and let us know what you think!