28 April 2015

Elasticsearch Hadoop 2.1.0.Beta4 released

By Costin Leau

The 4th Beta of Elasticsearch for Apache Hadoop (aka es-hadoop) has been released. This release adds a plethora of new features and enhancements to the connector in various areas:

Results returned as JSON

Since Beta4, data read from Elasticsearch can be returned in JSON format through es.input.json property (in a highly efficient manner). This is useful for serialization purposes such as saving information to the disk or sending it over the wire. And as always, the feature is available in all library integrations.

Obtain document metadata

Additionally, it is now possible to return the metadata for each document, alongside the actual source information. Whether one is interested in an index type or a document version or the remaining time-to-live, this information is now made available without any extra network cost.

Inclusion / Exclusion of fields

On the mapping front, it is now possible to specify what fields to be included or excluded for data about to be written to Elasticsearch. This makes it quite handy not only for doing quick transformation of the data but also specifying document metadata without storing it:

# extracting the id from the field called 'uuid'
es.mapping.id = uuid
# specifying a parent with id '123'
es.mapping.parent = <123>
# combine include / exclude for complete control
# include
es.mapping.include = u*, foo.*
# exclude
es.mapping.exclude = *.description

Client-node routing

For clusters in restrained environments, it is now possible to use the connector through client nodes only. That is, rather than accessing the cluster data nodes directly, the connector will use the client nodes instead (which do need to have the HTTP(S) port opened) and ask those to do the work on its behave. This will impact parallelism as the connector will not communicate directly with the nodes however, unless a lot of data is read/written and locality is not of importance, the performance penalty is insignificant.

Spark improvements

The various libraries have been upgraded and enhanced however by far the most updates were applied to the Spark integration. Spark 1.2 and 1.3 are officially supported in es-hadoop Beta4 - both Core and SQL. Unfortunately, due to some breaking changes in Spark SQL, Elasticsearch Hadoop provides now two different versions - one for Spark 1.0-1.2 and another for Spark 1.3 (and hopefully higher). Core users can transparently migrate between them however those using Spark SQL need to adapt from SchemaRDD to the newly introduced DataFrame API.

Speaking on Spark SQL, the DataSource API is now supported in both styles (Spark SQL 1.2 and 1.3) so one can use the connector in a fully declarative fashion:

val dataframe = sql.load("spark/index", "org.elasticsearch.spark.sql")

Further more, through the DataSource API, the connector is able to understand the Spark SQL operations applied on it and thus it is able to push down to the store through optimizing the queries made.

Various enhancements were made such as introducing  saveToEsWithMetadata (to allow metadata to be specified separately for each document, at runtime) and esJsonRDD (to return the data in unprocessed, JSON format), out-of-the-box indexing of Scala case classes and JavaBeans and also, providing binaries not just for Scala 2.10 (the default) but also Scala 2.11 (make sure to use the 2_11 suffix).

Strata London

Next week, Elastic will be attending Strata London. If you're interested in Elasticsearch or the ELK stack, please drop by our booth (#408). Further more, you're cordially invited on Thursday at 10:55 to hear about "Search evolved" by yours truly. 

Also join us for the meetup (please RSVP, seats are limited) on Wednesday, May 6th, in London at the new Elastic office.

We look forward to your feedback on Elasticsearch Hadoop 2.1.Beta4 – you can find the binaries are available on the download page and the new features explained in the reference documentation. As always, you can file bugs or feature requests on GitHub.