Celebrating the start of 2016, Elasticsearch for Apache Hadoop (ES-Hadoop) 2.2 rc1 has been released.
Packing a significant number of bug fixes and enhancements, this release candidate is the last step towards a full general availability release for the current development branch. As always, the artifacts are available at the download page or Maven.
ES-Hadoop 2.2 RC1 introduced support for the just-released Spark 1.6, in particular skipping pushed down filters that otherwise would be processed again in Spark (despite being already handled by the connector). For large result sets, this results in an important optimization.
The push-down translation has been improved, in particular when dealing with
IN filters by providing better matching when dealing with raw terms vs. values (such as dates or timestamps).
Speaking of Spark SQL, the schema declaration has been improved to handle multi-valued/array fields in a simple and elegant fashion (whether the fields are nested or not).
In addition, the connector configuration is now sanitized and safely passed throughout a Spark job; this addressed a subtle bug caused by command line-only properties being discarded during a job stage and causing abnormal behavior.
A batch of updates were done to the YARN module by upgrading to Elasticsearch 2.1.x and allowing JVM system properties to be passed directly in the children container.
The repository HDFS plugin has seen a lot of activity. While currently for Elasticsearch 2.0 and 2.1, it requires the JVM security manager to be disabled (as Hadoop is significantly greedier than Elasticsearch itself in terms of permissions), starting from Elasticsearch 2.2, due to the security improvements the plugin can customize its own code base grants.
Please note that the migration of the plugin to Elasticsearch core has already started and is currently scheduled for Elasticsearch 2.3.
More about that in a future blog post!
The wan/cloud feature has seen a lot of uptake which exposed the connector to more varied network topologies and configuration. This led to a number of fixes in the way ES-Hadoop handles Elasticsearch clusters with hostnames and IPs (typically with network publishing enabled) and the translation between the two. Overall, the connector picks up more information about its environments, reducing the amount of extra configuration on the user's behalf.
Looking forward to 2016!