Elasticsearch for Apache Hadoop 7.0.0 released

We’re excited to announce the release of Elasticsearch for Apache Hadoop (aka ES-Hadoop) 7.0.0 built against Elasticsearch 7.0.0.

Our eyes to the future

So long Java 6 and 7

For a long while we have supported releases of ES-Hadoop on Java versions 6 and 7. This year Java turns 24 years old, and has already seen the GA of Java 12. They grow up so fast…

There are so many things that become easier with more recent versions of Java, and as time continues to move forward, we open ourselves to getting stuck on older Java versions. One such event landing in 7.0 is that Elasticsearch’s Java low-level REST client is bumping its minimum required Java version to 8. We see this as a good chance to follow suit. Starting in 7.0, ES-Hadoop requires Java 8 or higher to run.

Fond farewell to Cascading

We've seen the usage statistics dropping for the Cascading integration over the last couple of years. These include download numbers, community posts, and open issues. Additionally, ES-Hadoop only ever supported the Cascading 2 line. Cascading is already well into its 3.X line, but we haven't seen much interest or pressure to get on to any of the newer versions.

Each integration that we support incurs cost for testing and maintenance as new features are added. Unfortunately, we feel that the Cascading integration is not resulting in enough benefit for our entire user base, and the effort to maintain it could be better spent on features and fixes in other areas.

Due to these factors, the Cascading integration was deprecated in 6.7.0 and has now been removed in the 7.0.0 release.

Hello to simpler partitioning

Back in 5.0, we added the ability to divide an index’s shards into smaller “slices” for parallel reading. This allowed ES-Hadoop to make more input splits for jobs instead of defaulting to the same number of shards your input index had. Unfortunately, over the last few years we’ve seen these changes cause confusion, and in extreme cases, losses in read performance. While expecting 100k documents per input split is fine in some general cases, we found that as the system is taxed, that limit would often need to be increased or disabled all together.

In 7.0, we are configuring ES-Hadoop to return to its original default behavior of using one input partition per shard. Users will still be able to slice these partitions smaller by making use of the es.input.max.docs.per.partition setting, but by default it will be empty, denoting no further splitting to be required. We hope that this change improves the getting started and tuning experience for users going forward.

Tell us how we did!

We love hearing your feedback and suggestions. If you have a great idea for a new feature or enhancement, or if you have any questions, stop by our forums or submit an issue on Github!