The Search AI Company

Build tailored experiences with Elastic.

Elastic Search AI Platform overview

Scale your business with Elastic Partners

Partner overview

ELK Stack

Search and analytics, data ingestion, and visualization – all at your fingertips.

ELK Stack overview

By developers, for developers

Elastic Cloud

Unlock the power of real-time insights with Elastic on your preferred cloud provider.

Elastic Cloud overview

Generative AI

Prototype and integrate with LLMs faster using search AI.

Generative AI overview

Search

Discover a world of AI possibilities — built with the power of search.

Search Labs

Search overview

Security

Protect, investigate, and respond to cyber threats with AI-driven security analytics.

Security Labs

Security overview

Observability

Unify app and infrastructure visibility to proactively resolve issues.

Observability Labs

Observability overview

By solution

See how customers search, solve, and succeed — all on one Search AI Platform.

All customer stories

Industries

Exceed customer expectations and go to market faster.

Industries overview

Customer spotlight

Cisco saves 5,000 support engineer hours per month

Sitecore automates 96 percent of security workflows with Elastic

Comcast transforms customer experiences with Elastic Observability

Research

Stay at the forefront of innovation with technical tips from the experts.

Build

Code with other developers to create a better Elastic, together.

Learn

Unleash the possibilities of your data and grow your skill set.

Connect

Keep informed about the latest tech and news from Elastic.

Have questions?

New

The executive guide to generative AI

About us Partners Support|Login

Elasticsearch for Apache Hadoop Cloud or restricted environments

In an ideal setup, elasticsearch-hadoop achieves best performance when Elasticsearch and Hadoop are fully accessible from every other, that is each node on the Hadoop side can access every node inside the Elasticsearch cluster. This allows maximum parallelism between the two system and thus, as the clusters scale out so does the communication between them.

However not all environments are setup like that, in particular cloud platforms such as Amazon Web Services, Microsoft Azure or Google Compute Engine or dedicated Elasticsearch services like Cloud that allow computing resources to be rented when needed. The typical setup here is for the spawned nodes to be started in the cloud, within a dedicated private network and be made available over the Internet at a dedicated address. This effectively means the two systems, Elasticsearch and Hadoop/Spark, are running on two separate networks that do not fully see each other (if at all); rather all access to it goes through a publicly exposed gateway.

Running elasticsearch-hadoop against such an Elasticsearch instance will quickly run into issues simply because the connector once connected, will discover the cluster nodes, their IPs and try to connect to them to read and/or write. However as the Elasticsearch nodes are using non-routeable, private IPs and are not accessible from outside the cloud infrastructure, the connection to the nodes will fail.

There are several possible workarounds for this problem:

Collocate the two clusters

The obvious solution is to run the two systems on the same network. This will get rid of all the network logistics hops, improve security (no need to open up ports) and most importantly have huge performance benefits since everything will be co-located. Even if your data is not in the cloud, moving a piece of it to quickly iterate on it can yield large benefits. The data will likely go back and forth between the two networks anyway so you might as well do so directly, in bulk.

Make the cluster accessible

As the nodes are not accessible from outside, fix the problem by assigning them public IPs. This means that at least one of the clusters (likely Elasticsearch) will be fully accessible from outside which typically is a challenge both in terms of logistics (large number of public IPs required) and security.

Use a dedicated series of proxies/gateways

If exposing the cluster is not an option, one can chose to use a proxy or a VPN so that the Hadoop cluster can transparently access the Elasticsearch cluster in a different network. By using an indirection layer, the two networks can transparently communicate with each other. Do note that usually this means the two networks would know how to properly route the IPs from one to the other. If the proxy/VPN solution does not handle this automatically, Elasticsearch might help through its network settings in particular network.host and network.publish_host which control what IP the nodes bind and in particular publish or advertise to their clients. This allows a certain publicly-accessible IP to be broadcasted to the clients to allow access to a node, even if the node itself does not run on that IP.

Configure the connector to run in WAN mode

Introduced in 2.2, elasticsearch-hadoop can be configured to run in WAN mode that is to restrict or completely reduce its parallelism when connecting to Elasticsearch. By setting es.nodes.wan.only, the connector will limit its network usage and instead of connecting directly to the target resource shards, it will make connections to the Elasticsearch cluster only through the nodes declared in es.nodes settings. It will not perform any discovery, ignore data or client nodes and simply make network call through the aforementioned nodes. This effectively ensures that network access happens only through the declared network nodes.

Last but not least, the further the clusters are and the more data needs to go between them, the lower the performance will be since each network call is quite expensive.