Architectureedit

At the core, elasticsearch-hadoop integrates two distributed systems: Hadoop, a batch-oriented computing platform and Elasticsearch, a real-time search and analytics engine. From a high-level view both provide a computational component: Hadoop through Map/Reduce and Elasticsearch through its search and aggregation.

elasticsearch-hadoop goal is to connect these two entities so that they can transparently benefit from each other.

Map/Reduce and Shardsedit

A critical component for scalability is parallelism or splitting a task into multiple, smaller ones that execute at the same time, on different nodes in the cluster. The concept is present in both Hadoop through its splits (the number of parts in which a source or input can be divided) and Elasticsearch through shards (the number of parts in which a index is divided into).

In short, roughly speaking more input splits means more tasks that can read at the same time, different parts of the source. More shards means more buckets from which to read an index content (at the same time).

As such, elasticsearch-hadoop uses splits and shards as the main drivers behind the number of tasks executed within the Hadoop and Elasticsearch clusters as they have a direct impact on parallelism.

Hadoop splits as well as Elasticsearch shards play an important role regarding a system behavior - we recommend familiarizing with the two concepts to get a better understanding of your system runtime semantics.

Reading from Elasticsearchedit

Shards play a critical role when reading information from Elasticsearch. Since it acts as a source, elasticsearch-hadoop will create one Hadoop InputSplit per Elasticsearch shard; that is given a query that works against index I, elasticsearch-hadoop will dynamically discover the number of shards backing I and then for each shard will create an input split (which will determine the number of Hadoop tasks to be executed). With the default settings, Elasticsearch uses 5 primary shards per index which will result in the same number of tasks on the Hadoop side for each query.

elasticsearch-hadoop does not query the same shards - it iterates through all of them (primaries and replicas) using a round-robin approach. To avoid data duplication, only one shard is used from each shard group (primary and replicas).

A common concern (read optimization) for improving performance is to increase the number of shards and thus increase the number of tasks on the Hadoop side. Unless such gains are demonstrated through benchmarks, we recommend against such a measure since in most cases, an Elasticsearch shard can easily handle data streaming to a Hadoop task.

Writing to Elasticsearchedit

Writing to Elasticsearch is driven by the number of input splits (or Hadoop tasks) available. elasticsearch-hadoop detects the number of (primary) shards where the write will occur and distribute the writes between these. The more splits are available, the more mappers/reducers can write data in parallel to Elasticsearch.

Data co-locationedit

Whenever possible, elasticsearch-hadoop shares the Elasticsearch cluster information with Hadoop to facilitate data co-location. In practice, this means whenever data is read from Elasticsearch, the source nodes IPs are passed on to Hadoop to optimize task execution. If co-location is desired/possible, hosting the Elasticsearch and Hadoop clusters within the same rack will provide significant network savings.