11 4월 2017 엔지니어링

Multi data path bug in Elasticsearch 5.3.0

By Clinton Gormley

If you use a custom data path with Elasticsearch 5.3.0, you may be subject to a bug which could cause data loss unless properly handled.

The bug is triggered as follows:

  • default.path.data is configured on the command line — it is configured by default in the RPM and Debian packages.
  • path.data is configured in the elasticsearch.yml file as an array containing one or more paths.

The default.path.data command line setting is used to tell Elasticsearch which default data path to use unless path.data is configured either in the config file or on the command line. The bug occurs because path.data, when specified as an array, is merged with default.path.data instead of replacing it.

How to tell if you are affected

First, this bug affects Elasticsearch 5.3.0 only. You can tell if you are affected by comparing the expected list of data paths with those returned by the _nodes API.

For example, imagine your elasticsearch.yml file contains something like the following:

path.data:
  - /mnt/path_1
  - /mnt/path_2
  - /mnt/path_3
        

or

path.data: [ /mnt/path_1, /mnt/path_2, /mnt/path_3 ]
        

Retrieve the path settings for all nodes with the following request:

curl -XGET "http://localhost:9200/_nodes?pretty&filter_path=nodes.*.settings.path.data"
        

which returns a response like this:

{
  "nodes": {
    "GrrMUWcCTlKprhcvROUIoQ": {
      "settings": {
        "path": {
          "data": [
            "/var/lib/elasticsearch",
            "/mnt/path_1",
            "/mnt/path_2",
            "/mnt/path_3"
          ]
        }
      }
    }
  }
}
        

You see the extra /var/lib/elasticsearch entry? That is coming from default.path.data. If an extra entry is present, then you are affected and need to take action.

The impact of the bug

There are two possible outcomes from this bug.

  • When multiple data paths are specified, Elasticsearch allocates each shard to one of the data paths. This means that you may have one or more shards located in the path specified in default.path.data.
  • If you are running more than one node on a single machine, the second node may refuse to start because the default.path.data path is already locked by the first node.

Working around the bug

This fix can be applied node-by-node with rolling restarts. It does not require a full cluster restart. Do not try to apply this fix on a running node.

Stop the first Elasticsearch node, then:

Fix the configuration

Change your path.data configuration to use a comma-separated string instead of an array, as shown in this example:

path.data: /mnt/path_1,/mnt/path_2,/mnt/path_3
            

This will overwrite the default.path.data setting instead of merging settings.

Fix data directories

Next, you will need to move any data from the path specified in default.path.data to one or more of the other data paths.

For instance, assuming your default.path.data is set to /var/lib/elasticsearch, you will need to copy any data in that path to one of the other configured paths: /mnt/path_1, /mnt/path_2, or /mnt/path_3.

If one of the other paths has sufficient space to hold all of the contents of /var/lib/elasticsearch then you can copy all the data to a single path as follows:

cp -vr /var/lib/elasticsearch/ /mnt/path_1/
            

The trailing / are important!

If there is too much data in /var/lib/elasticsearch to fit in a single path, then you can copy individual indices to different paths.

First, list the indices in /var/lib/elasticsearch:

ls /var/lib/elasticsearch/nodes/0/indices/
            

In our example there are three indices which need to be copied:

lL6xLDIrSfSqysFp-fnk8g/ lW0CidrcR9aIBYdI-wbyBg/ n6EnZ0MMSMmSVl4ktoX9ig/
            

To copy the lL6xLDIrSfSqysFp-fnk8g/ index to /mnt/path_1, we need to create the correct path in case it doesn’t already exist:

mkdir -p /mnt/path_1/nodes/0/indices/lL6xLDIrSfSqysFp-fnk8g/
            

Then copy the index:

cp -vr /var/lib/elasticsearch/nodes/0/indices/lL6xLDIrSfSqysFp-fnk8g/ /mnt/path_1/nodes/0/indices/lL6xLDIrSfSqysFp-fnk8g/
            

Repeat this process for all remaining indices.

Restart and check

Finally, restart the Elasticsearch node and check that the path.data settings for this node are correct, using the same _nodes request as above:

curl -XGET "http://localhost:9200/_nodes?pretty&filter_path=nodes.*.settings.path.data"
            

Check the cluster health and make sure that the status is either yellow or green:

curl -XGET "http://localhost:9200/_cat/health?v"
            

If the status is yellow, wait for it to turn green before continuing this same process with the next node. Once cluster health has returned to green, you can delete the contents of /var/lib/elasticsearch.

If the status is red, then you have forgotten to copy some data from default.path.data. You can check which shards are not recovering correctly using the _cat/shards API:

curl -XGET "http://localhost:9200/_cat/shards?v"
            

Going forward

Elasticsearch 5.3.1 and above will come with a bug fix for this bad configuration merging. It will also check whether you may have suffered from this bug in the past by looking at your current settings and the contents of the default.path.data path to see whether it contains any shard data. If it finds data there, the node will refuse to start.

To solve this issue, you will need to make sure that the path listed in default.path.data is empty. To be on the safe side, either rename the directory or move the data to a new directory rather than just deleting the directory. Once your cluster is green, you can safely delete the backup copy.