Tech Topics

Elastic Cloud Outage: Root Cause and Impact Analysis

Summary

We recently had some partial outages on Elastic Cloud, our hosted and fully managed Elasticsearch service, and we’ve completed our investigation and analysis about what happened and verified the longevity of our fixes. The reliability of our cloud service is extremely important for us – to emphasize that, we wanted to make this note a bit more public than as a small note on our status page.


On March 31st and April 10th we had outages affecting some 1.7.1 clusters in the us-east-1 region.


This post attempts to summarise our understanding of the events, explain our priorities, and mitigations – short and long-term.


First and foremost, we want to apologise to the customers that experienced downtime because of this unfortunate incident. Our platform’s reliability is our bread and butter. As will be detailed, we’re using everything we learned from this to make our platform stronger and more resilient for everyone.


No data was lost. The nature of the incident was that the Elasticsearch clusters themselves were healthy, but some clusters' routing info in the proxies got corrupted, so our proxies were unable to forward requests to the clusters.


This incident affected certain clusters running Elasticsearch version 1.7.1 because of a bug in that version (explained later). All clusters susceptible to this bug have been upgraded across all regions.


Impact

Between March 31st 10:45 and 11:30 UTC, 16% of clusters in us-east-1 were unroutable for a minute, with 4% of clusters having problems for up to 45 minutes. Peak request error rates hit 7%.


April 10th, between 03:45 and 05:00 UTC approximately 4% of clusters in us-east-1 experienced connectivity problems. Peak request error rates hit 3% in this time window.


Later April 10th, starting at 19:49 UTC, approximately 20% of clusters in us-east-1 experienced some connectivity problems for a 30 second period. We saw an overall 3% error rate for requests in this time window. About 4% of the clusters continued to have connectivity issues for longer than this 30 second window. All clusters were fixed by 21:24 UTC.


After the first outage on April 10th, cluster creation and modification was disabled. This was re-enabled around 22:00 UTC, when we started upgrading affected clusters to 1.7.5.


Root Cause

What happened behind the scenes was that our Apache ZooKeeper cluster lost quorum, for the first time in more than three years. After recent maintenance, a heap space misconfiguration on the new nodes resulted in high memory pressure on the ZooKeeper quorum nodes, causing ZooKeeper to spend almost all CPU garbage collecting. When an auxiliary service that watches a lot of the ZooKeeper database reconnected, this threw ZooKeeper over the top, which in turn caused other services to reconnect – resulting in a thundering herd effect that exacerbated the problem.


When this first happened on March 31st, the quorum quickly re-established, and we suspected a network glitch to be the cause. A lot of 1.7.1-clusters were disconnected, which were fixed and we made an internal ticket to upgrade them. After the first April 10th outage, we saw increase in CPU-usage, and upgraded the ZooKeeper instances to more powerful instances. At the second outage on April 10th, we noticed the memory misconfiguration, and the cluster has been stable since. We then proceeded to upgrade the affected 1.7.1 clusters to 1.7.5.


ZooKeeper is our core coordination service, and is essential for our operations. It’s so essential we’ve made sure we can live without it. While it’s typically rock solid (and even in this case, the loss of quorum was our fault), we learned in our beta days that even a single single-point-of-failure is one too many. Almost everything in our backend can cope with  losing ZooKeeper connectivity well, with the exception of the 1.7.1 clusters containing the bug.


With ZooKeeper, there is a concept of a session, which can span connections, if connections can be re-established within a grace period of 30 seconds. There is also a concept of ephemeral nodes. Elasticsearch nodes connect to ZooKeeper and write an ephemeral node with their routing info, which the proxies cache in their in-memory routing tables – both for performance, but also to be able to route even without ZooKeeper. There was a bug where a loss of the session (and not just a reconnect) would cause 1.7.1-clusters to write empty routing info for subsequent sessions, which made the cluster un-routable. The loss of quorum caused a widespread loss of sessions and therefore incorrect routing info to be written. This bug was a regression in a custom plugin we install in Elasticsearch, and it was not present in any other version. The plugin was built with an old version of Apache Curator, which is why only 1.7.1-clusters were affected.


While mitigating the problem we worked on ensuring the platform as a whole, which was at risk. We called in additional engineers to assist fixing the 1.7.1-clusters while we were looking into stabilising the quorum.


Mitigations

To reduce the consequences of such an incident, all clusters on 1.7.1 have been upgraded to 1.7.5. It made an internal problem external, which it should not have. We are very wary when doing forced upgrades, as any upgrade risks bringing with it regressions, but certain bugs and issues causing severe stability or security problems can justify such an upgrading scheme. Hopefully you’ll appreciate the stability and our ability to keep our customer’s clusters running more than the potential downside of upgrading on rare occasions like this. Keep in mind that we will only upgrade within a minor version with no known breaking changes.


We have verified all ZooKeeper configurations to ensure the correct amount of heap space is provided. We also quadrupled the amount of CPU and memory available to ZooKeeper, while also reducing the memory used by ZooKeeper by around 40% by doing some cleanup, such as cleaning up information for deleted clusters, and some historic data for running clusters.


Some services need to scale for “black swan” events where everything goes wrong at the same time. While ZooKeeper used to use average at <1% CPU and peak at 20% CPU when backups were done, it’s now peaking at <5% CPU in the steady state – a state we had continuously for three years until this point.


We’ve expanded our ZooKeeper cluster’s “observer”-layer to soak up the heaviest usage so the main quorum and its voting members aren’t affected. Additionally, we’ve done a thorough investigation of everything we have that uses ZooKeeper and its impact on the server in terms of connection-, memory- and CPU-usage and -growth. While we’ve identified some things we need to improve, we’ve also gained confidence that the mitigations we’ve deployed will be sufficient in the medium to long-term.