We’re excited to announce the release of Elasticsearch for Apache Hadoop (aka ES-Hadoop) 6.5.0 built against Elasticsearch 6.5.0.
Getting ready for 7.0
Java 6 and 7 are deprecated
Java 7’s support ended in April of 2015. The Hadoop ecosystem has long since moved on to Java 8 and it’s high time that ES-Hadoop follows. In 6.5.0, ES-Hadoop will log a warning when starting up on any version of Java below Java 8. In 7.0.0 the classes will be compiled to target Java 8 instead of Java 6 as it has in the past.
Changes to max docs per partition
In Elasticsearch 5.0.0, the concept of sliced scrolls was introduced. Clients are able to request that a scroll operation is split into smaller pieces and only one of those pieces is consumed per request. This allows multiple clients to read the contents of a scroll query at the same time. We added support for this feature in ES-Hadoop when it came out. ES-Hadoop will sample the data before reading it to determine what the right number of slices to use might be for each shard. In some cases, this sampling round can take longer than the job would take if it were to just read the data from each shard. Additionally, the default value for the maximum number of splits per partition is often either too low or misunderstood as being exact. In 6.5.0 we are deprecating the default value of the
es.input.max.docs.per.partition setting, and instead choosing to leave it empty in 7.0.0. If you find that the setting is more performant for you, you should set it explicitly to a value before upgrading to 7.0.0.
Field type promotion on read
Some may remember that in 6.0 we overhauled ES-Hadoop’s mapping handling code. This allowed us to resolve multiple indices to a single combined mapping instead of picking a random mapping and ignoring everything else. A problem arose with this change of what we should happen if two mappings contained the same field names but with different field types. Now, in 6.5.0, instead of throwing an error when this is discovered, ES-Hadoop will attempt to promote fields to the most common shared type. When resolving mappings that have conflicting numeric field types, the final field type that ES-Hadoop will use when reading will be a number that can fit the max value from either field. If a numeric value conflicts with a textual value then ES-Hadoop treats both fields as a string or keyword. We hope that this smoothes over the snags that users have hit in regards to reading multiple indices that have updated mappings over time.
This release contains a healthy handful of bug fixes, including improvements to client handling code when overwriting empty dataframes in Spark, better handling of internal data encoding, some fixes to reading metadata values and more. As always, you can find a list of fixed items in our release notes