26 mars 2014

Snapshot and Restore

Par Konrad Beiske

UPDATE: This article refers to our hosted Elasticsearch offering by an older name, Found. Please note that Found is now known as Elastic Cloud.

Behind the Scenes

Let's take a closer look at Elasticsearch's snapshot and restore module and the files used to store snapshots, exemplified with snapshots on S3.

Introduction

With the new Snapshot and Restore API introduced in Elasticsearch 1.0, you can create snapshots of your data and store it in a repository. In essence, a snapshot is a fancy word for backup, and combined with the ability to restore, snapshots are also useful backups. In this article we will look into the files used by Snapshot & Restore, the anatomy of a repository, and some of its implications. The aim is to uncover which Elasticsearch and Lucene features paved the way for this new and cool feature.

I will include all Elasticsearch commands used, so no previous experience with the Snapshot & Restore API is required, but a general idea of how indexes, shards and Lucene segments are related will be beneficial. The short version of that story is that Elasticsearch divides all its indexes into shards and each shard is a Lucene index which in turn is built up of segments that either reside in memory or on disk. One very important feature of those segments is that they are immutable. Elasticsearch utilizes this in many ways, for example in relation to filter caches. For a better introduction to these topics I recommend reading Elasticsearch from the bottom up.

Nomenoclature

A snapshot is a copy of all the cluster data and may contain both indexes and cluster settings. A snapshot resides within a repository. Several repositories may be defined for a cluster and each repository has a type. The two types available in core Elasticsearch are fs and url. The fs type requires a shared file system that is mounted on every instance in the cluster. The url type only requires the repository files to be readable from a url, however it is limited to read only. In particular, if you are running Elasticsearch on Amazon EC2, you may want to store your snapshots on Amazon S3. The elasticsearch-cloud-aws plugin allows you to this by providing a repository type for S3.

Creating Your First Repository and Snapshot

For this demonstration I will reuse a cluster from a previous article. As detailed in the image below it has three indexes with one shard each. The biggest one, the oslo3 index, receives about one hundred documents every five minutes.

Indexes and their segments
Indexes and their segments

We define a repository with the below curl command. As all snapshot and restore commands, the endpoint used is _snapshot followed by the repository name.

curl -XPUT https://-eu-west-1.foundcluster.com:9243/_snapshot/myRepo -d'
{
"type": "s3",
"settings": {
  "bucket": "myBucket",
  "region": "eu-west-1",
  "base_path": "myCluster"
}
}'

With the elasticsearch-cloud-aws plugin installed on the cluster I can specify type s3. The only required settings are the bucket name and the access credentials. The latter is specified in elasticsearch.yml. The next upcoming version of the plugin is expected to allow repository specific credentials. More settings are available through the plugin documentation.

Once the repository is created we can create the first snapshot with the following command:

curl -XPUT https://-eu-west-1.foundcluster.com:9243/_snapshot/myRepo/test?wait_for_completion=true

Two things are worth noting about the above command: every snapshot needs a unique name as specified in the url and, unless otherwise specified, creation of snapshots are done asynchronously. If you want to check progress on a snapshot you can do a GET to the same url.

Looking at the Files Created

Having connected to the bucket using an S3 client I find that the snapshot has created the following files:

myCluster/index
myCluster/indices/kibana-int/0/__0
myCluster/indices/kibana-int/0/__1
myCluster/indices/kibana-int/0/__2
myCluster/indices/kibana-int/0/__3
myCluster/indices/kibana-int/0/snapshot-test
myCluster/indices/kibana-int/snapshot-test
myCluster/indices/my_index/0/__0
myCluster/indices/my_index/0/__1
myCluster/indices/my_index/0/__2
myCluster/indices/my_index/0/__3
myCluster/indices/my_index/0/snapshot-test
myCluster/indices/my_index/snapshot-test
myCluster/indices/oslo3/0/__0
myCluster/indices/oslo3/0/__1
myCluster/indices/oslo3/0/__10
myCluster/indices/oslo3/0/__11
myCluster/indices/oslo3/0/__12
myCluster/indices/oslo3/0/__13
myCluster/indices/oslo3/0/__14
myCluster/indices/oslo3/0/__15
myCluster/indices/oslo3/0/__16
myCluster/indices/oslo3/0/__17
myCluster/indices/oslo3/0/__18
myCluster/indices/oslo3/0/__19
myCluster/indices/oslo3/0/__1a
myCluster/indices/oslo3/0/__1b
myCluster/indices/oslo3/0/__1c
myCluster/indices/oslo3/0/__1d
myCluster/indices/oslo3/0/__1e
myCluster/indices/oslo3/0/__1f
myCluster/indices/oslo3/0/__1g
#skipped 162 similar lines
myCluster/indices/oslo3/0/__w
myCluster/indices/oslo3/0/__x
myCluster/indices/oslo3/0/__y
myCluster/indices/oslo3/0/__z
myCluster/indices/oslo3/0/snapshot-test
myCluster/indices/oslo3/snapshot-test
myCluster/metadata-test
myCluster/snapshot-test

From this rather lengthy list we can deduce the following structure in every repository:

  • index
  • indices
  • <index_name>
    • <shard_number>
    • <segment_id>
    • snapshot-<snapshot_name>
    • snapshot-<snapshot_name>
  • metadata-<snapshot_name>
  • snapshot-<snapshot_name>

A Second Snapshot

To get a better understanding of the files, we index some more data and then create a second snapshot. By checking the timestamps of the files, it’s clear that the following files were created or modified as part of the new snapshot:

myCluster/index
myCluster/indices/kibana-int/0/__4
myCluster/indices/kibana-int/0/snapshot-test2
myCluster/indices/kibana-int/snapshot-test2
myCluster/indices/my_index/0/__4
myCluster/indices/my_index/0/snapshot-test2
myCluster/indices/my_index/snapshot-test2
myCluster/indices/oslo3/0/__55
myCluster/indices/oslo3/0/__56
myCluster/indices/oslo3/0/__57
myCluster/indices/oslo3/0/__58
myCluster/indices/oslo3/0/__59
myCluster/indices/oslo3/0/__5a
myCluster/indices/oslo3/0/__5b
myCluster/indices/oslo3/0/__5c
myCluster/indices/oslo3/0/__5d
#Skipped 90 similar lines
myCluster/indices/oslo3/0/__7w
myCluster/indices/oslo3/0/__7x
myCluster/indices/oslo3/0/__7y
myCluster/indices/oslo3/0/__7z
myCluster/indices/oslo3/0/__80
myCluster/indices/oslo3/0/__81
myCluster/indices/oslo3/0/__82
myCluster/indices/oslo3/0/__83
myCluster/indices/oslo3/0/__84
myCluster/indices/oslo3/0/__85
myCluster/indices/oslo3/0/__86
myCluster/indices/oslo3/0/__87
myCluster/indices/oslo3/0/snapshot-test2
myCluster/indices/oslo3/snapshot-test2
myCluster/metadata-test2
myCluster/snapshot-test2

By comparing the two lists we find that the only preexisting file in the bucket that has been modified is the myCluster/index-file. Let’s have a look at its contents:

{"snapshots":["test","test2"]}

It’s not a big file, nevertheless it contains the names of all the snapshots in the repository.

Furthermore, there are many more files created in each snapshot for the oslo3 index. This is to be expected, as the oslo3 index is the only one receiving data regularly, causing new segments to be created and old segments to be merged. The Kibana index is a different story. As this index is used to store Kibana dashboards, it seldom gets any new data. After creating our previous two snapshots, the following files for the Kibana index exist in the bucket:

#test:
myCluster/indices/kibana-int/0/__0
myCluster/indices/kibana-int/0/__1
myCluster/indices/kibana-int/0/__2
myCluster/indices/kibana-int/0/__3
myCluster/indices/kibana-int/0/snapshot-test
myCluster/indices/kibana-int/snapshot-test
#test2:
myCluster/indices/kibana-int/0/__4
myCluster/indices/kibana-int/0/snapshot-test2
myCluster/indices/kibana-int/snapshot-test2

Now, this is actually interesting as it points at one of the really handy features of snapshot and restore: the snapshots are incremental. Let’s examine further and shed some light on how this is implemented.

First, by comparing the two files myCluster/indices/kibana-int/snapshot-test and myCluster/indices/kibana-int/snapshot-test2 we find that they’re identical JSON documents, and by adding a little whitespace formatting it’s easy to recognize the index settings and mappings:

{
  "kibana-int":{
    "version":2,
    "state":"open",
    "settings":{
      "index.number_of_replicas":"1",
      "index.version.created":"900499",
      "index.number_of_shards":"1"
    },
    "mappings":[
      {
        "dashboard": {
          "properties": {
            "dashboard": {"type":"string"},
            "group": {"type":"string"},
            "title": {"type":"string"},
            "user": {"type":"string"}
          }
        }
      }
    ],
    "aliases":{}
  }
}

Moving on, we investigate the files for shard 0 and start by comparing snapshot-test and snapshot-test2. This time the files are a little different:

myCluster/indices/kibana-int/0/snapshot-test

{
  "name" : "test",
  "index-version" : 5,
  "files" : [ {
    "name" : "__0",
    "physical_name" : "_0.cfs",
    "length" : 8037,
    "checksum" : "trhzg",
    "part_size" : 104857600
  }, {
    "name" : "__1",
    "physical_name" : "_0.cfe",
    "length" : 314,
    "checksum" : "14i5z7r",
    "part_size" : 104857600
  }, {
    "name" : "__2",
    "physical_name" : "_0.si",
    "length" : 270,
    "checksum" : "19azdai",
    "part_size" : 104857600
  }, {
    "name" : "__3",
    "physical_name" : "segments_5",
    "length" : 107,
    "part_size" : 104857600
  } ]
}

myCluster/indices/kibana-int/0/snapshot-test2

{
  "name" : "test2",
  "index-version" : 6,
  "files" : [ {
    "name" : "__0",
    "physical_name" : "_0.cfs",
    "length" : 8037,
    "checksum" : "trhzg",
    "part_size" : 104857600
  }, {
    "name" : "__1",
    "physical_name" : "_0.cfe",
    "length" : 314,
    "checksum" : "14i5z7r",
    "part_size" : 104857600
  }, {
    "name" : "__2",
    "physical_name" : "_0.si",
    "length" : 270,
    "checksum" : "19azdai",
    "part_size" : 104857600
  }, {
    "name" : "__4",
    "physical_name" : "segments_6",
    "length" : 107,
    "part_size" : 104857600
  } ]
}

From these files we can deduce that the first snapshot uses the files: __0, __1, __2 and __3 and that the second snapshot uses the files: __0, __1, __2 and __4. This is how the incremental feature of the snapshots are implemented on the file side. If you wonder how Elasticsearch is able to compare these files to the files it has on disk for each segment, then there is a hint in the physical_name attributes and their extensions. Cfs, cfe and si are all filetypes used by Lucene segments. This implies that the core building blocks of indexes in snapshots are the same as on disk. Considering the fact that Lucene segments are immutable, the process of making an incremental snapshot simply becomes copying the missing segments to the repository and creating a record of which segments are used by the new snapshot.

Snapshots of shard 0 in kibana-int index
Snapshots of shard 0 in kibana-int index

Deletes

Let’s assume we no longer need the test snapshot after having created the test2 snapshot. As you might have guessed from the previously described file structure and unlike most incremental backup solutions, snapshots in Elasticsearch have no special significant ordering or dependency on one another even if they are incremental. Hence, we are able to delete the first snapshot without affecting the latter snapshot, as long as we do the deletes through the Elasticsearch API and don’t start picking files on our own. Demo time:

curl -XDELETE https://-eu-west-1.foundcluster.com:9243/_snapshot/myRepo/test

Looking at the segment for the Kibana index again, only the following files remain:

myCluster/indices/kibana-int/0/__0
myCluster/indices/kibana-int/0/__1
myCluster/indices/kibana-int/0/__2
myCluster/indices/kibana-int/0/__4
myCluster/indices/kibana-int/0/snapshot-test2
Shard 0 in kibana-int index after deleting test
Shard 0 in kibana-int index after deleting test

One obvious implication of this is that deleting a snapshot might not result in freeing up as much disk space as was consumed when the snapshot was created, if another snapshot was created in the meantime. Another implication is that it’s not trivial to define how much disk space is consumed by a snapshot. This is probably why the API does not currently expose any size used per snapshot. A neat feature I would like to see introduced, is a dry run of sorts for deletes, where one could select one or more snapshots and have Elasticsearch calculate the amount of disk space released if they were to be deleted.

Merges

Knowing that segments are the building blocks of indexes and snapshots I suspect that a segment merge might result in having to copy the new merged segment even if there are no changes to the documents or the index settings. I will test this hypothesis by stopping the traffic to the oslo3 index, creating a snapshot, optimize the index down to one segment (it currently has about 17 segments) and create another snapshot.

curl -XPUT https://-eu-west-1.foundcluster.com:9243/_snapshot/myRepo/oslo3-before-merge?wait_for_completion=true -d '{
"indices": "oslo3",
"include_global_state": false
}'

curl -XPUT https://-eu-west-1.foundcluster.com:9243/oslo3/_optimize?max_num_segments=1

curl -XPUT https://-eu-west-1.foundcluster.com:9243/_snapshot/myRepo/oslo3-after-merge?wait_for_completion=true -d '{
"indices": "oslo3",
"include_global_state": false
}'

In chronological order, the above commands resulted in the following files in the bucket:

myCluster/indices/oslo3/0/__az
myCluster/indices/oslo3/0/snapshot-oslo3-before-merge
myCluster/indices/oslo3/0/__b0
myCluster/indices/oslo3/0/__b1
myCluster/indices/oslo3/0/__b2
myCluster/indices/oslo3/0/__b3
myCluster/indices/oslo3/0/__b4
myCluster/indices/oslo3/0/__b5
myCluster/indices/oslo3/0/__b6
myCluster/indices/oslo3/0/__b7
myCluster/indices/oslo3/0/__b8.part0
myCluster/indices/oslo3/0/__b8.part1
myCluster/indices/oslo3/0/__b8.part2
myCluster/indices/oslo3/0/__b9
myCluster/indices/oslo3/0/__ba
myCluster/indices/oslo3/0/__bb
myCluster/indices/oslo3/0/__bc
myCluster/indices/oslo3/0/__bd
myCluster/indices/oslo3/0/snapshot-oslo3-after-merge

As suspected, the merge forced the following snapshot to copy the new segment even if the documents strictly speaking already existed within the repository. One advantage with this approach is that the segments don’t have to be merged again when restoring the snapshot. Nonetheless, it also implies that even if you never delete any documents from your index and do snapshots frequently, your repository will contain a lot of redundant data - that is, if you don’t delete old snapshots. In other words, the fact that snapshots are incremental does not imply that they will never contain any redundant data. The deduplication only work with complete segments and is not able to track which segments where merged together. For a really nice demonstration of how segment merges occur in Lucene while indexing I recommend this video.

Restore

Backups have limited value if you are unable to restore them, but to make things a little more interesting, let’s do the restore to a different cluster. I create a new cluster through the Found console, select Elasticsearch version 1.0 and the same region as my previous cluster. I proceed to create the same repository in the new cluster and restore with the following commands:

curl -XPUT https://-eu-west-1.foundcluster.com:9243/_snapshot/myRepo -d'
{
"type": "s3",
"settings": {
  "bucket": "myBucket",
  "region": "eu-west-1",
  "base_path": "myCluster"
}
}'

curl -XPOST https://-eu-west-1.foundcluster.com:9243/_snapshot/myRepo/snapshot-oslo3-after-merge/_restore

Before taking this to the extreme and have all your clusters configured with the same repository, here’s a word of caution. This usage is not mentioned anywhere in the official documentation and there is a number of potential race conditions that can happen with multiple systems accessing the same remote file system. It’s highly likely that the Elasticsearch developers have considered this use case and have plans to endorse it in the future. My guess as to why it hasn’t been documented is that they are yet to implement all the safe guards and test it properly. There is, however, a way to reduce the number of potential problems, and that is to never have more than one cluster with write access to any given repository. By first creating one repository per cluster and then make it available to the other clusters as a url repository, the read only access would be enforced.

Another constraint which is worth noting about restores is that if the index already exists in the cluster it must be closed first. One possible workaround is to rename the index as it is restored, as described in the official documentation.

Possible Use Cases

There are many use cases for snapshot and restore and the most obvious one is of course backups, both ad hoc before large changes, but also ones scheduled on a regular basis. Obviously, the scheduling is something you will have to implement on your own, just don’t forget to make retention policy as well. You can even use snapshot restore for a point in time recovery, but before you start automating a snapshot every minute, remember that while a shard is being snapshot, all segment merges are postponed, and if you end up with too many segments your search performance will suffer. In all likelihood, the Elasticsearch developers decided to limit a cluster to only do one snapshot or restore operation at a time for the very same reasons that I recommend not allowing multiple cluster have write access to a repository.

The coolest thing in snapshot and restore, at least in my opinion, is the possibility to easily duplicate data from production clusters to development or test environments. This is one of those things that is never a problem at the beginning of projects - simply because you just don’t have that much data! - but eventually you get to the point where it takes days to transfer all the data from the production environment to a separate system to start tracking down that bug… Once that happens you really wish you had some way of keeping the development system up to date with what goes on in production. Snapshot and restore to the rescue!