When using elasticsearch-hadoop, it is important to be aware of the following Hadoop configurations that can influence the way Map/Reduce tasks are executed and in return elasticsearch-hadoop.
Unfortunately, these settings need to be setup manually before the job / script configuration. Since elasticsearch-hadoop is called too late in the life-cycle, after the tasks have been already dispatched and as such, cannot influence the execution anymore.
As most of the tasks in a job are coming to a close, speculative execution will schedule redundant copies of the remaining tasks across several nodes which do not have other work to perform. Therefore, the same input can be processed multiple times in parallel, to exploit differences in machine capabilities.
|-- Yahoo! developer network|
In other words, speculative execution is an optimization, enabled by default, that allows Hadoop to create duplicates tasks of those which it considers hanged or slowed down. When doing data crunching or reading resources, having duplicate tasks is harmless and means at most a waste of computation resources; however when writing data to an external store, this can cause data corruption through duplicates or unnecessary updates. Since the speculative execution behavior can be triggered by external factors (such as network or CPU load which in turn cause false positive) even in stable environments (virtualized clusters are particularly prone to this) and has a direct impact on data, elasticsearch-hadoop disables this optimization for data safety.
Please check your library setting and disable this feature. If you encounter more data then expected, double and triple check this setting.
Speculative execution can be disabled for the map and reduce phase - we recommend disabling in both cases - by setting to
false the following two properties:
One can either set the properties by name manually on the
jobConf.setSpeculativeExecution(false); // or configuration.setBoolean("mapred.map.tasks.speculative.execution", false); configuration.setBoolean("mapred.reduce.tasks.speculative.execution", false);
or by passing them as arguments to the command line:
$ bin/hadoop jar -Dmapred.map.tasks.speculative.execution=false \ -Dmapred.reduce.tasks.speculative.execution=false <jar>
Apache Hive has its own setting for speculative execution through namely
hive.mapred.reduce.tasks.speculative.execution. It is enabled by default so do change it to
false in your scripts:
Note that while the setting has been deprecated in Hive 0.10 and one might get a warning, double check that the speculative execution is actually disabled.
Out of the box, Spark has speculative execution disabled. Double check this is the case through the
spark.speculation setting (
false to disable it,
true to enable it).