This blog post tries to explain our thought processes around how we work in Elasticsearch and Apache Lucene to address resiliency. It ended up being much longer post than I intended, but I hope you will find it well worth the read.
The first question is: why would we even care about resiliency in Elasticsearch? Is it not "just" a search system, one that simply mirrors a primary "source of truth" data source? Surely you can always reindex the data?
Well, to a degree, this is a correct. Many systems out there have their "single source of truth" outside of Elasticsearch. If your SSoT is another datastore or flat files in S3 or HDFS, you can always reindex the data. But, the fact that you can reindex the data doesn't mean that we are happy with the fact that you *have to do this*. We have users of Elasticsearch storing petabytes of data - reindexing all of that data would take an unacceptably long time! For that reason, we strive to ensure that Elasticsearch doesn’t lose data, ever.
A major step towards keeping data safe was the addition of the snapshot/restore API. While it was possible to do backups in Elasticsearch before, it was fiddly and inefficient. With the incremental backups available in snapshot/restore, you can now recover your data even when your cluster suffers massive failure, be it hardware or user induced (think DELETE /*). But restoring data still takes time. We also want to ensure that your live cluster is resilient and never loses data.
How do we address the resiliency aspects of Elasticsearch? Our process is quite simple. We work through known issues based on how likely are they to occur, and how broad the scope of applying the fix.
If we suspect there is a bug in Apache Lucene, we put all our resources on it to try to find and fix it. Quickly! Lucene is at the core of Elasticsearch. We feel deeply responsible for helping Lucene to move forward (see Investing In Lucene), it’s a very high priority for us. Lucene is a widely used library and, outside of the context of Elasticsearch, us helping to fix anything related to resiliency in Lucene means that we have a karma multiplier that goes beyond just Elasticsearch.
We do not take this responsibility lightly.
A lot also happens somewhat behind the scenes. For example, a few months ago we were getting reports of users losing data in Elasticsearch. Among them was GitHub, one of our customers. We had a few engineers fully dedicated to figuring out why it happened. It took us a few months to nail down the problem. We started by examining each step within Elasticsearch itself – when shards are allocated, when shards are deleted – hardening each step to add resiliency.
In the end, it turned out to be a bug in Lucene, which could result in a whole index being deleted! We were super excited that we had managed to track down this bug and commit the fix. (LUCENE-4870).
We’re doing the same thing with the upcoming Lucene 4.8 release. So far there have been two bugs in Lucene 4.8 that we have managed to uncover and fix: LUCENE-5574 (wipes index), LUCENE-5570 (zero byte files).
That’s not to say that there aren't bugs in Elasticsearch itself. A few months ago, we found that using multiple data paths in Elasticsearch could result in Elasticsearch wrongly reporting a corrupted index. Again, this bug was one that took us quite some time to find, but it was a high priority for us.
How do we make sure our system is resilient? There are two aspects to it. First is our test infrastructure. One of the amazing things that happened in Apache Lucene that helped to make it super resilient was the use of randomized testing (see Mike post for background). Think of randomized testing as a test infrastructure that is "predictably irrational." Tests usually stagnate towards stability, and randomized testing allows you to make your test infrastructure a living organism that keeps on trying to be, effectively, "predictably irrational".
Another part of the test infrastructure is what Simon has us all fondly referring to as “evil tests.” These are not Chaos Monkey tests, these are Rabid-Monkey-on-Acid tests! For example, Lucene now has a Directory (effectively a file system) that simulates faulty hardware when doing fsync calls.
Anybody who follows Elasticsearch closely has probably noticed how much work Simon and others in our team have done to bring this amazing randomized testing infrastructure to Elasticsearch as well. Today, each test run in Elasticsearch is significantly different (in a good way) from the previous one, exposing our code to a huge range of conditions which us mere humans would not have considered. When a failure appears, we can replicate the exact conditions for that test run in order to find the problem.
We have also added the ability to simulate many failure modes within our own system. A good example is the
SearchWithRandomExceptionsTests test case in Elasticsearch (another “evil” test). It introduces (predictably) random failures to try to break our system. If Elasticsearch doesn’t cope with the failure, then we harden it. Testing network problems from outside is hard, so we have added the ability to simulate network partitions (or misbehaving nodes), to more easily strengthen our distributed model (more on that later).
The second aspect is reviewing how Elasticsearch and Lucene behave today, analyzing the results, and making them more resilient wherever possible. For instance, it always bothered me that checksumming was not an integral aspect of Lucene (it was done on a higher, Elasticsearch, level for recovery purposes) which makes file corruption difficult to detect. Now, with the additional people we have dedicated to Lucene, we have the capacity to address it. (Add that to the more on that later list, too.)
So, more concretely, what are our plans to address making Elasticsearch and Lucene more resilient? Here is our top priority list as of today. It tends to change based on feedback from users:
Today, we can’t easily tell if a Lucene index is corrupted or not. Lucene 4.8 has added checksums to all files, with checksum validation on reads of small files and some large files, with an option for a more expensive merge process that also validates checksums on all files. We need to make sure we have more checksum verification, at the very least during merges on all files.
In the same way as checksums are used in Lucene, we can improve how we do checksumming in our transaction log. Today we are too lenient.
Validate Checksum on Snapshot
Snapshot/Restore is a new feature in Elasticsearch 1.x. During a snapshot operation, we can actually validate the checksum while reading the data from disk, and reject the snapshot (and failover to a replica) if the checksum doesn't match.
Validate Checksum on Recovery
In the same way as for snapshotting, we can improve shard recovery on replicas by checksumming the Lucene segments as they are copied over from the primary.
We plan on assigning a sequence number to operations that occur on primary shards.This is a really interesting feature that will lay the groundwork for many future features in Elasticsearch. The most obvious one is speeding up the replica recovery process when a node is restarted. Currently we have to copy every segment which is different which, over time, means every segment! Sequence numbers will allow us to copy over only the data that has really changed.
There is a bug in Elasticsearch that can cause split brain when the cluster is faced with a partial partition, i.e. some nodes can see nodes that other nodes can’t see (2488). This bug has been open for a while and has received a lot of attention recently. I would like to talk about how we are dealing with this problem in particular.
First, network disconnections are actually a relatively rare problem with Elasticsearch deployments. A more common problem is an unresponsive node, due to very heavy load or prolonged garbage collections. A node which becomes unresponsive for a long period can appear to be dead. By improving node stability, we also improve cluster stability.
In Elasticsearch, we actively promote the use of dedicated master nodes in critical clusters to make sure that there are 3 nodes whose only role is to be master, a lightweight operational responsibility. By reducing the amount of resource intensive work that these nodes do (no data ingestion or search), we greatly reduce the chance of cluster instability.
Even with that, we found that under large clusters (our users keep on pushing us!), or large cluster states (many thousands of indices), those master nodes were still under an unacceptable load. We worked a lot into improving that: improving our shard allocation algorithm to be 100x faster and more lightweight, using our network stack more efficiently when publishing large cluster states, batching frequent updates to mappings in the cluster state, etc. All of this work has been backported all the way back to the latest 0.90.x releases, and recent 1.x releases.
This reflects the approach that we take to improving resiliency in Elasticsearch. We focus on the problems that end up causing most of the failures we actually see in the field (and our project is quite popular, with more than 500,000 downloads a month, so we see a lot of those), and address those first. Other improvements such as reduced memory usage with compressed fielddata structures, better cache recycling and reducing garbage collection have all indirectly helped to improve node (and thus cluster) stability.
Back to the issue. In order to properly fix it, and potential future issues, we first have to be able to replicate the problem reliably. With our improved testing infrastructure, we now have a "mock transport" which allows us to simulate the condition and reproduce the problem. With a lightweight test for the bug now in place, we can start to address it. We have a good idea of what is needed to fix this issue and have created a branch in Elasticsearch to implement and test the solution. Expect it to be pulled into master in the near future.
The partial-partition split brain is a real problem that we need to fix. Having said that, in practice it is actually very rare. Personally, with all the deployments that I have assisted with, I have never seen this bug. Users often attribute their problem to this bug, but it almost always turns out to be something else. That’s not to say that this bug is not important. It is. But the frequency with which it manifests, especially with the optimized resource usage of recent Elasticsearch versions and with dedicated master nodes, is very very low. By addressing resource usage first, we have helped many more users than we would have been able to if we had focussed just on the split-brain issue.
Last, a frequent question is why don't we use an external system to solve the problem. For example, our discovery module is quite pluggable, why can't we just plug Zookeeper and use it instead of something we build. This, to me, is at the heart of our approach to resiliency. Zookeeper, an excellent product, is a separate external system, with its own communication layer. By having the discovery module in Elasticsearch using its own infrastructure, specifically the communication layer, it means that we can use the “liveness” of the cluster to assess its health. Operations happen on the cluster all the time, reads, writes, cluster state updates. By tying our assessment of health to the same infrastructure, we can actually build a more reliable more responsive system. We are not saying that we won’t have a formal Zookeeper integration for those who already use Zookeeper elsewhere. But this will be next to a hardened, built in, discovery module.
Btw, for a live Q&A session that touches on many of those aspects, please watch the Q&A I gave a few months ago at the NYC Elasticearch meetup, starting at minute 55 here.
If you look at the change log of the last Linux release, or the last Java release, it makes you wonder “how the hell did my app ever work with those bugs?” At the end though, our systems work and work well. Elasticsearch and Lucene are massively popular, and many people use them daily to build amazing products. We continuously invest in fixing ::allthethings::, we hope this blog post gave you some insight into our approach.