I am happy to announce that the second milestone of Elasticsearch for Apache Hadoop 1.3 (also known as es-hadoop) has been released. M2 brings several new major features to the table, including:
Support for Elasticsearch 1.0 (RC1+)
es-hadoop supports Elasticsearch 1.0 RC1 or higher while preserving compatibility with the Elasticsearch 0.90.x line. The code base automatically detects the target Elasticsearch version and uses the appropriate features.
es-hadoop 1.3 M1 enabled new indexes to be created in Elasticsearch directly from Hadoop jobs. M2 takes this feature several steps forward, enabling both index creation and updating. Furthermore, one can specify all the meta-data options if needed: document parent, time-to-live, routing and timestamp.
Higher level abstractions on Map/Reduce, like Cascading, Apache Hive and Apache Pig provide mapping and data types on top of raw data for easier manipulation. With es-hadoop, one can also define field aliasing, decoupling the Hadoop libraries structural declaration from the underlying document fields for cleaner syntax and improved readability.
Performance is a top priority for es-hadoop. As such, M2 only loads the data that it must from Elasticsearch; rather than loading the entire source document, es-hadoop retrieves just the required fields, also known as projection. This implementation of projection results in less network, CPU and memory usage.
Besides supporting the Map/Reduce, Cascading, Hive and Pig data types, M2 also allows JSON data to be indexed. Don’t forget that es-hadoop can now extract the document metadata (such as id) if instructed to do so.
Parallel writes and ingest mitigation
Users with high ingestion volumes will be happy to hear that M2 provides significant updates on the write front. Now, all writes to Elasticsearch are parallelized on the target shards; this implementation prevents network bottlenecks as the load is spread across multiple nodes. In case of excessive loads on Elasticsearch, regardless of the reason, 1.3 M2 automatically retrieves only the rejected payload before ingesting further data. This temporary throttling prevents data spillage.
In a similar vein, es-hadoop 1.3 M2 provides automatic cluster discovery and, in case of network errors, automatic fall back to the rest of the available nodes.
Snapshot/Restore for HDFS
Last but not least, 1.3 M2 introduces snapshot and restore support for HDFS through a separate, umbrella project (elasticsearch-repository-hdfs). Once installed in the Elasticsearch cluster, this standalone plugin enables any DFS storage (from the omnipresent HDFS to pluggable implementations such as Amazon S3 or Google Cloud Storage) to be used for backing up and restoring data within running Elasticsearch clusters.
P.S. If you are migrating from M1, you might want to read up on the configuration changes in M2.