Tech Topics

Tribe Node

At Elasticsearch, we like to think different about data and how to manage it. We know that companies can innovate only if they truly understand all aspects of their business through their data.

We see Elasticsearch being used for many different use cases within individual organizations: logging critical visitor data in one division, analyzing financial transactions in another, and driving insights from social data in yet another division. Often, these departments are spread out across the globe. What if, instead of each division having their own data silo, we could produce one coherent view of data from across the entire organization? What new insights could be gained if we were able to connect all of this data? To answer this question, we created the Tribe node.

The Tribe node connects to multiple Elasticsearch clusters and allows you to view them as if they were one big cluster. You may remember that this functionality was partially the promise of Federated Search. What makes the Tribe node unique is that it doesn’t impose any restrictions on cross-cluster search or, for that matter, on any other core APIs.

Technically, the Tribe node is quite simple: it connects to multiple clusters and registers to receive their cluster events. Any time a cluster event occurs, the Tribe node acts on it by merging the different cluster events/states into a single global cluster state that can be used by all the different APIs. This means that searching on 10 shards in a single cluster, or searching across 10 shards where 6 of them exist on 1 cluster and the other 4 exist on another, is exactly the same operation.

The Tribe node supports almost all APIs, with the exception of meta-level APIs such as the Create Index API. Meta-level APIs must be processed by the elected master node, but the Tribe node effectively has no single elected master. Instead, operations like creating an index must be executed on the individual cluster.

Another important point to note when using the Tribe node: index names must be unique across all clusters. Because the cluster states from multiple clusters are merged into a single global cluster state, if an index with the same name exists on two clusters then one of the two indices will simply be ignored.

Using the Tribe node couldn’t be simpler. Here is a demonstration of how to test it out by running two clusters on your local machine:

# first, start a single node LDN cluster
bin/elasticsearch --cluster.name ldn

# second, start another single node HK cluster
bin/elasticsearch --cluster.name hk

Elasticsearch makes it easy to run multiple instances on the same machine for testing purposes. We have just started two nodes, each belonging to a different cluster: the first LDN cluster uses ports 9300/9200 and the second HK cluster uses ports 9301/9201.

# now, let's start a Tribe node that connects to both clusters
bin/elasticsearch --tribe.ldn.cluster.name ldn --tribe.hk.cluster.name hk

This Tribe node started with ports 9302/9202. Now, let’s see which nodes are part of the Tribe node cluster state:

# see all the nodes that are part of the Tribe node (9202) state
curl 'localhost:9202/_cluster/state/nodes?pretty'

# response
{
  "nodes" : {
    "lykJOKu2Shaa0v4jjt9d4g" : {
      "name" : "Man-Eater/hk",
      "transport_address" : "inet[/10.12.1.196:9303]",
      "attributes" : {
        "tribe.name" : "hk",
        "client" : "true",
        "data" : "false"
      }
    },
    "fuFL42E1S_GSe_miEQKOvg" : {
      "name" : "Man-Eater/ldn",
      "transport_address" : "inet[/10.12.1.196:9304]",
      "attributes" : {
        "tribe.name" : "ldn",
        "client" : "true",
        "data" : "false"
      }
    },
    "I8iGiHehQES9G-ZWbw2roQ" : {
      "name" : "Stygyro",
      "transport_address" : "inet[/10.12.1.196:9300]",
      "attributes" : {
        "tribe.name" : "ldn"
      }
    },
    "4rVSDX2vQNe4b6WGMqhH7A" : {
      "name" : "Sepulchre",
      "transport_address" : "inet[/10.12.1.196:9301]",
      "attributes" : {
        "tribe.name" : "hk"
      }
    },
    "s7QC7w2gTpyu_pDM14JitQ" : {
      "name" : "Man-Eater",
      "transport_address" : "inet[/10.12.1.196:9302]",
      "attributes" : {
        "client" : "true",
        "data" : "false"
      }
    }
  }
}

Let me explain the above output, as it helps understand how the Tribe node works:

  • Stygyro: This is the first node we started (9300). The Tribe node automatically added an attribute to it called tribe.name (ldn) to show which cluster it belongs to.
  • Sepulchre: The second node we started (9301). The Tribe node automatically added tribe.name (hk) to show which cluster it belongs to.
  • Man-Eater: The Tribe node we started (9302).
  • Man-Eater/hk: The internal client node to connect to the hk cluster within the Tribe node.
  • Man-Eater/ldn: The internal client node to connect to the ldn cluster within the Tribe node.

Now, let’s see how we can interact with the nodes and the Tribe node. We’ll start by creating 2 indices, one on each cluster. Remember, those indices need to be explicitly created on each cluster.

# Create index ldn_index on ldn cluster (9200) directly
curl -XPUT localhost:9200/ldn_index

# Create index hk_index on hk cluster (9201) directly
curl -XPUT localhost:9201/hk_index

Once created, we can easily index data through the Tribe node (9202) to the respective indices:

curl -XPUT localhost:9202/ldn_index/data/1 -d '{
  "desc" : "heya from ldn"
}'


curl -XPUT localhost:9202/hk_index/data/1 -d '{
  "desc" : "heya from hk"
}'

Note: the Tribe node automatically redirected each indexing request to the correct cluster.

Now, let’s do a simple search that spans all the data across all clusters and all indices:

# execute search
curl 'localhost:9202/_search?pretty'

# response
{
  "took" : 5,
  "timed_out" : false,
  "_shards" : {
    "total" : 10,
    "successful" : 10,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 1.0,
    "hits" : [ {
      "_index" : "ldn_index",
      "_type" : "data",
      "_id" : "1",
      "_score" : 1.0, "_source" : {
  "desc" : "heya from ldn"
}
    }, {
      "_index" : "hk_index",
      "_type" : "data",
      "_id" : "1",
      "_score" : 1.0, "_source" : {
  "desc" : "heya from hk"
}
    } ]
  }
}

You see how the Tribe node fanned out our search request across both clusters and merged the results from each cluster into a single global result set? Any other core operation (get, suggest, percolator, etc) is also fully supported by the Tribe node.

I will end with one note on a feature in Kibana that was added in order to better support the Tribe node.

Imagine having two different logging clusters, fed by Logstash. The cluster in Hong Kong creates index names with a hk[time] pattern, and the cluster in London uses a ldn[time] pattern. Kibana now allows you to specify multiple index patterns when querying, which makes it easy to explore data that exists on either cluster or both of them, using the Tribe node to create a single coherent view. This single view allows you to put your data into context, making it easier to spot patterns that would otherwise be missed.

Hopefully, this post gives you a glimpse of how the Tribe node can be used. We look forward to your feedback on the mailing list or Twitter!