How do Elasticsearch snapshots work?
Elastic offers many instructor-led, in-person and virtual live trainings, as well as on-demand trainings. Our flagship courses are Elasticsearch Engineer, Data Analysis with Kibana, and Elastic Observability Engineer. All of these courses lead to certifications.
We recently released the latest version of Elasticsearch Engineer training in response to increased demand and new features. This course is designed for both new Elasticsearch users and Elasticsearch professionals. It begins with the basics for getting started with the Elastic Stack, then quickly dives deep into topics ranging from optimizing search performance to building efficient clusters. View the detailed course outline to find out more about what you’ll learn. All lessons include hands-on labs.
During this instructor-led “Elasticsearch Engineer” training, one of the most common questions we get while teaching about snapshots is “how each snapshot is logically independent?” In this blog post, I will explain this in detail.
What is a snapshot?
A snapshot is a backup of a running Elasticsearch cluster. You can use snapshots to:
- Regularly back up a cluster with no downtime
- Recover data after deletion or a hardware failure
- Transfer data between clusters
- Reduce your storage costs by using searchable snapshots in the hot, cold and frozen data tiers
Deduplication of snapshots
To back up an index, a snapshot makes a copy of the index’s segments and stores them in the snapshot repository.
Indices are made up of shards. Each Elasticsearch shard is a Lucene index. Each Lucene index is divided into smaller units called segments. When you add new documents to your index, Lucene creates a new segment and writes to it. From time to time, Lucene merges smaller segments into a larger one.
Since segments are immutable, the snapshot only needs to copy any new segments created since the repository’s last snapshot.
Each snapshot is also logically independent. When you delete a snapshot, Elasticsearch only deletes the segments used exclusively by that snapshot. Elasticsearch doesn’t delete segments that are still used by other snapshots in the repository.
Let’s go through this example to get a better understanding.
- Suppose we take a snapshot (snap1) of a simple index with one shard and two segments.
- Some time later as new documents are indexed, a new segment C gets creates in shard0.
- A second snapshot (snap2) will only copy the missing segment(s) to the repository.
- Some time later, segments A, B, and C are merged, creating a new segment D.
- When creating a new snapshot (snap3), the new segment D is copied to the repository.
- Deleting a snapshot (snap1) only deletes segments in the repository that are no longer referenced by any other snapshot.
- In this case, no segments are deleted from the repository.
- Only after deleting snap2, segments A, B, and C will also be deleted from the repository.
In this blog post, I explained how snapshots are automatically deduplicated with the help of some graphics. For more information, please feel free to read through the official documentation.
The Elastic Stack is versatile enough to tackle any use case. Want to learn how to harness the power of that versatility? Become an Elastic expert through free, paid, private, and training subscriptions. Our instructor-led virtual classes are offered globally, in time zones that make learning convenient for you. Enhance your professional visibility and push aside technical boundaries within your company by becoming Elastic certified.
Reach out to us at firstname.lastname@example.org with any questions.
Originally published May 9, 2023; updated November 16, 2023
The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.