News

Elasticsearch for Apache Hadoop 2.1 RC1 is out

The first release candidate of Elasticsearch for Apache Hadoop (ES-Hadoop) 2.1 is here.

Exactly 50 days after Beta4, RC1 completes the features scheduled for 2.1, improves the stability of the code and polishes the documentation.

The goodies are described below; however we understand if you want to grab the binaries right away from the download page or Maven.

Spark SQL Push Down

2.1 RC1 not only supports the just released Spark 1.4 but also introduces full push-down support for Spark SQL. That is, through ES-Hadoop, Spark SQL is translated to Elasticsearch Query DSL so the operations are actually pushed down to the storage and thus efficiently executed so that only the needed results are returned back to Spark. 

For example:

SELECT reason FROM trips WHERE id>=1 AND airport="OTP"

is translated into:

{
  "query" : {
    "filtered" : {
      "query" : {
        "match_all" : {}
      },
      "filter" : {
        "and" : [{
            "query" : {
              "match" : {
                "airport" : "OTP"
              }
            }
          }, {
            "range" : {
              "gte" : 1
            }
          }
        ]
      }
    }
  }
}

This significantly increases performance and also again minimizes the amount of I/O, memory and CPU required when working against Elasticsearch. Note that the translation applies even if the user has a query defined (on the DataFrame for the RDD and can be configured to work on analyzed (default) and not-analyzed terms so, whatever the Elasticsearch index configuration, one can get started right away.

Moverover in terms of usage, ES-Hadoop not only implements all of Spark SQL Filters API but also the *Relation so all of Spark SQL operations (including table creation, insertion and SaveModes) are fully supported:

CREATE TABLE esIndex USING org.elasticsearch.spark.sql OPTIONS (path "spark/docs")
INSERT OVERWRITE TABLE esIndex SELECT * FROM existingSparkTable
// Spark 1.4 style
val df = sqlContext.read.load("examples/src/main/resources/users.parquet")
df.select("name", "favorite_color").write.mode(SaveMode.ErrorIfExists)
  .format("org.elasticsearch.spark.sql").save("users/colors")

Note that the excellent Spark compatibility (1.0-1.4) is maintained.

Security improvements

On the security front, the SSL/TLS handshake has been improved so that protocol errors are better diagnosed and properly exposed to the user. In addition ES-Hadoop officially supports PKI (Public Key Infrastructure) authentication and, to ease setup, features a dedicated documentation page on Security.

Also the HDFS snapshots/restore plugin has been improved to better work in Kerberos environments and the FileSystem acquisition reworked to minimize the number of client connections within the cluster.

Last but not least, the various libraries have been upgraded while maintaining backwards compatibility - this includes upgrading to Apache Storm 0.9.5, Apache Hive 1.2, Apache Pig (0.15 - pay attention if you are using it in a Hadoop 1.x environment) and of course, Elasticsearch 1.6 (do upgrade to it).

Feedback

Let us know what you think about RC1! We love to hear from you on GitHub, Twitter or the forums. (IRC works too).

Cheers,