29 Januar 2016 Engineering

Cluster cloning in the cloud

Von Konrad Beiske

UPDATE: This article refers to our hosted Elasticsearch offering by an older name, Found. Please note that Found is now known as Elastic Cloud.

One of the advantages of using a SaaS offering is flexibility for on-demand scaling. With Found this means you can create and delete clusters as you like and only pay for the hour APIs they’re running. Building on the Snapshot and restore API of Elasticsearch, Found makes it easy to duplicate a cluster. In this blog post we will see how this works and how it may be used in a few different scenarios. Finally we will have a brief look at how it is implemented which will also serve to show why the clone operation has absolutely no performance impact on the source cluster.

Use cases

A common denominator in all of these use cases is that they benefit from the isolation that a separate cluster provides, whether that is isolation of hardware resources or data resources.

Ad hoc analytics

For most logging and metrics use cases it is cost prohibitive to have all the data in memory, even if that is what provides the best performance for aggregations. Cloning the relevant data to an ad hoc analytics cluster that can be discarded after use can be a very cost effective way to experiment with your data, and at the same time without risk of hurting performance in production.

Test upgrades

The safest way to check that both your indexes and your application is ready for the next Elasticsearch version is to copy the indexes to a new cluster and test the entire upgrade path by first upgrading Elasticsearch and then making sure that your application works.

Enable your developers

Realistic test data is crucial for uncovering unexpected errors early in the development cycle. What can be more realistic than actual data from the production cluster? Giving the developer team access to experiment with real production data is not only great for breaking down silos, but may also provide new insights about the domain. A safe and isolated playground is the essential benefit for this use case.

Test Mapping changes

Mapping changes almost always require reindexing. Unless your data volume is trivial it also takes some time. Tweaking the parameters to achieve best performance usually takes a little trial and error. While this use case could also be handled by running the scan and scroll query directly against the source cluster, it is worth noting that a long lived scroll has the side effect of blocking merges even if the scan query is very light weight.

Integration testing

Test your application against a real live Elasticsearch instance with actual data. If you automate this, you could also aggregate performance metrics from the tests and use those metrics to detect if a change in your application has introduced a performance degradation.

How to clone a cluster

The steps required:

  1. Prepare target cluster (the cluster which the snapshot will be copied to).
  2. Select snapshot from source cluster
  3. Issue restore

Prepare the target cluster

In order for the restore to be successful the target cluster must meet the following conditions:

  • Same region
  • Compatible Elasticsearch version
  • Index resources used
    • Plugins for custom types
    • Synonym files used by analyzer
    • Dictionary files used by analyzer
  • No conflicting indexes

Both clusters have to be in the same region in order to have access to each others backups. In most cases any Elasticsearch version equal to or higher than the version of the source cluster will be compatible. The exception is if upgrading more than one major version at a time as Elasticsearch then might not have a Lucene version capable of reading the indexes. Same thing happens when an index originally created on an Elasticsearch version older than the current version have segments written in an old Lucene format, but this is usually not a problem as the the format is upgraded automatically during merges.

The target cluster is not required to have all the plugins and dictionaries of the source cluster, just the ones required to open the cloned indexes. In practice this means any dictionary or synonym file referenced by a custom analyser in any of the indexes, aka user bundles on the cluster config page. Slightly less common, but just as hard a requirement is if any of the indexes use a custom field type provided by a plugin like mapper-attachments or murmur3.

The final requirement of the target cluster is that it doesn’t have any conflicting indexes defined. If any of the indexes to be copied already exist in the target cluster they must either be deleted / closed before issuing the restore. Optionally you may use a rename pattern when issuing the restore to have the indexes restored with a different name.

Selecting the snapshot

The cluster cloning capabilities on Found is really an extension of the automated backup service that issues a snapshot every 30 minutes. To select a snapshot, sign in to the console on https://found.elastic.co and go to the clusters page.

clone-cluster-cluster-overview.png

There you will find a “Snapshots” item for each of your clusters in the left hand menu. The snapshots page lists each snapshot by timestamp. Click on one and it takes you to the restore snapshot page.

Issue a restore

Once you’ve selected a snapshot to restore from, you should see a form with the fields indexes, rename pattern, rename replacement and cluster.

clone-cluster-restore.png

The first three correspond to the similar parameters of the restore command in Elasticsearch. The cluster parameter specifies the target cluster of the restore operation. If you don’t specify a cluster, the default behavior is to restore to the same cluster the snapshot was made from. If you don’t see your cluster in the dropdown list, make sure that it running a suitable Elasticsearch version and has had time to start.

Things included and not included

Indexes and index settings are included, but cluster settings are not. This decision is a consequence of the fact that the indexes may be restored to a cluster with a very different plan, but it can be an important distinction to be aware of if performance testing reveals different results between different clusters.

Another consideration for performance comparisons is that copying indexes from one cluster to another with the snapshot and restore API is not the same as copying indexes with a scan/scroll query and bulk indexing. The former copies Lucene segments and the latter JSON documents. If the source index has segments in an old Lucene format, these are not modified, but copied as they are.

Under the hood

For every restore to a different cluster issued by the found console, three things happen:

  • The settings of the found-snapshots repository in the source cluster are inspected
  • A snapshot repository with the same S3 bucket and path is created in the target cluster, but the repository’s id is the id of the source cluster and not found-snapshots.
  • The new repository is used to issue a restore on the target cluster.

One important takeaway here is that the data is not transferred from the source cluster when the restore is issued, but copied from the backup in S3. This means that issuing a restore to a different cluster will not have any performance implication on the source cluster.

After the restore is complete, the repository in the target cluster is not removed, making it easy issue another restore through the Elasticsearch API. If you would like to include this initialization in a script, the steps required are described in our documentation on sharing a repository across clusters.

Conclusion

At the Found team we aim to provide all the cool features in Elasticsearch in combination with the ease and flexibility of a good SaaS offering. If you have any suggestions, we would love to hear them on discuss.