October 26, 2016

Elasticsearch for Apache Hadoop 5.0.0

Elasticsearch for Apache Hadoop, affectionately known as ES-Hadoop, enables Hadoop users and data-hungry businesses to enhance their work-flows with a full-blown search and analytics engine, in real-time. And now, the moment you’ve been waiting for. Drumroll please. Developers and Data Scientists, I am pleased to present to you Elasticsearch for Apache Hadoop 5.0.0!

After several early access releases, a boatload of feedback posts, and much abundant waiting, it’s finally arrived! The Elastic Stack has made it to 5.0, and crossing the finish line along with it is ES-Hadoop! This release contains a substantial number of stability improvements, bug fixes, and shiny new features that we hope all of you will enjoy. And so, without further ado…

What’s New in ES-Hadoop 5.0?

Out with the Old … In with the new!

Sometimes you need to step backward to move forward. We’ve bumped up the versions for a handful of integrations. In doing so, we’ve removed support for some older versions. If you are using the older versions, it would be best to update them before moving to ES-Hadoop 5.0 for maximum compatibility.

Hello Hive 1.0, Goodbye Hive 0.13 and 0.14

Hive 1.0 has been released for quite a while and the majority of distributions have already moved to it. As such, support for Hive 0.13 and Hive 0.14 (two releases that were plagued by serious issues) has now been dropped, cleaning up the code base.

Hello Storm 1.x, Goodbye Storm 0.9

Storm support has been upgraded to 1.0.x. As this version is not backwards compatible with Storm 0.9.x, support for these versions had to be dropped.

Hello Spark 2.0, Goodbye Spark 1.0-1.2

Our support for Spark has been updated with the recent release of Spark 2.0. This version of Spark is not backwards compatible with any previous Spark versions. We have decided to keep support for Spark 1.3-1.6 as a separate compatibility artifact. SparkSQL was originally released in Spark 1.0-1.2 as an alpha component. Since then SparkSQL has become stable in Spark 1.3, but the API has significantly changed. Supporting three very different versions of Spark is a bit much. Because of this, support for Spark 1.0-1.2 has been removed.

HDFS Repository

The HDFS Repository has experienced a substantial upgrade and is now part of Elasticsearch proper. Because of this upgrade, we have removed it from the ES-Hadoop project. Note that the HDFS plugin in Elasticsearch 5.0 is not just conveniently packaged but also better integrated. Among these improvements is no longer needing to disable the JVM SecurityManager - an option that isn’t even available anymore.

(Hadoop/Spark) + Slice API = More Parallel

A substantial change has been added to support the use of Elasticsearch’s new Scroll Slicing functionality. Now you can state the maximum number of documents you wish to see per input task and the framework will attempt to subdivide input splits to increase your computing parallelism. Isn’t sharing beautiful?

Ingest Node

We heard about this cool new feature called the Ingest Node that was available in the alpha releases and coming out in Elasticsearch v5.0. We thought “Oh man, we ingest stuff, this node ingests stuff. We need to schedule a brunch with it immediately to trade gossip.” With the release of ES-Hadoop 5.0 you can now specify an ingest pipeline to send your data to, as well as target only ingest nodes to cut down on unnecessary traffic. We’re still waiting to hear back from you about brunch, Ingest Node. Call us!

Native Support for Spark Streaming

Spark is pretty fast, but sometimes you need your data even faster. We loved hearing that some of you were using ES-Hadoop with Spark Streaming, but we also felt the same heartache about the limitations that you were running into. We decided to do something about it. ES-Hadoop now natively supports consuming DStreams from Spark Streaming! We’ve included some fixes for the most commonly reported Spark Streaming issue of running out of connection resources during small processing windows. May your TIMED_WAIT’s be few, and your Spark Streaming Jobs live long and prosper.

Fast Acting Bug Repellant

Computers are hard. We thank our lucky stars everyday that our friends in the community are so helpful when it comes to reporting issues. When you open up your copy of ES-Hadoop, you’ll find a fresh batch of bug fixes already applied. These bugs range from issues with overwriting data with SparkSQL, memory leaks in the network code, sub-fields in your mapping named “properties”, and a bunch more. If we listed them all out here there would be no room for anything else. Cheers to the bug hunters out there! This one’s for you.

Feedback

As always, we love to hear from our users about what we’re doing well and what needs improving. So drop us a line some time on Twitter, GitHub or on the forum. Operators are standing by.

Special Thanks

We on the ES-Hadoop team would like to especially thank all of the early adopters for aiding us through the last few months of alpha and beta releases. 5.0 is the best release that it can be thanks to all of you. Stay classy.

Context engineering

Vector database

Search powered applications

Logs

Threat protection

Workflows

Elasticsearch

Kibana (Discover, Dashboards)

Elastic Agent Builder

AutoOps

Piped query language

Jina AI search models

Elastic Cloud Serverless

Elastic Cloud Hosted

Self-managed Elasticsearch

Ecommerce search

Customer support search

Search-driven apps

Log analytics

Infrastructure monitoring

Digital experience monitoring

App performance monitoring

AIOps

LLM observability

Next-gen SIEM

Workflows for security

XDR and endpoint security

AI for security

10x your data's value

Cloud providers

Elastic AI Ecosystem

Search AI Partner Program

AV-Comparatives

Forrester Wave™ XDR

Gartner Magic Quadrant Leader

IDC MarketScape

Search

Security

Observability

Get started

Demo gallery

Downloads

Integrations

Docs

Elasticsearch Labs

Elastic Security Labs

Elastic Observability Labs

Blog

Community

Events

Webinars

Discuss

Training

Support

Consulting