Elastic Cloud is on the tail end of eliminating a mix of memory issues that has caused problems for a lot of low-memory nodes, and in some rare cases even large nodes. Following the memory problems, we experienced connectivity issues on a handful of servers in eu-west-1 that affected any cluster with at least one node on these impacted servers.
After investigation, we found that there were a number of small problems that, when combined, created larger issues. This post attempts to summarise the breadth of things that all contributed to the problems, and the extent of testing we’re ramping up to avoid repeating our mistakes. You might learn a thing or two about Linux, Docker, memory accounting, glibc, JVM-settings, Netty, and Elasticsearch as well. We sure have.
Scope of the memory problem
Elastic Cloud runs tens of thousands of Elasticsearch nodes. These nodes run on servers with memory ranging from 30-244 GiB and 4-32 cores, which we call our “allocator pool”. A single server can host a lot of Elasticsearch nodes, each running in containers getting a reserved slice of memory and a (boostable) slice of the CPU. The servers run a small number of different kernel and Docker versions, and we’ll get back to why.
The memory issues can be categorised as:
- High memory pressure from Elasticsearch causing increased GC-load and latencies, and eventually the JVM running out of heap space. There were several things that could lead to this.
- Growth in the JVM’s non-heap memory, eating away memory intended for page cache and possibly causing kernel-level OOM-reaping.
- Because we used long-running containers, kernel memory was being incorrectly accounted, causing the kernel to kill processes in a container almost as soon as the node started.
All of these could affect a single node, but just one would be sufficient to make the node unreliable.
Whenever a process is restarted due to memory issues, we log the event and notify the account owner and operational contacts (rate limiting to max one email per 6 hours). While Elasticsearch keeps getting more careful against running out of memory, it’s not uncommon for an overloaded cluster to run out of memory.
Thus, we know that in the period we had the most issues, approximately 1% of the running clusters were affected. That’s a lot of clusters, but as most mitigations affect every cluster and some required restarting Elasticsearch and/or upgrading, we needed to proceed carefully to not cause any problems to the majority of nodes not having any issues.
On environment variety
Elastic Cloud is based on the acquisition of Found, which launched a hosted Elasticsearch service in 2012. Having managed lots of containers since before Docker even existed and container schedulers were buzzwords, we have a lot of experience in how container-features can cause Linux to crash, or sometimes worse, cause nodes to slow down to a crawl. Even the 4.4 kernel series in Ubuntu LTS recently had OOM-issues.
With the exception of security patches, we’re typically very slow when it comes to upgrading Linux and Docker: issues with these components can severely hurt our reliability or create significant ops workload to clean up containers that are not being created or destroyed correctly. Docker is a fast-moving technology, and generally only the most recent versions receive security patches. This makes it important to keep up, and we were gradually increasing the numbers of servers running more recent versions as we gained confidence in it.
Our server fleet is also composed of servers of varying size. Smaller servers limit the blast radius if there’s an issue, while larger servers are necessary to host the beefier clusters. Depending on available capacity during provisioning, a small 1GB node can end up on a massive server. A node will be allotted the same CPU time regardless of the numbers of cores available, so performance differences are small between servers. There are settings that rely on the core count, however, and we didn’t properly cover all the bases of settings that look at cores. This could pose problems for a small node landing on a large server.
Having run the 3.19-series of the kernel for a long time without issues, it took some time before we suspected it could be the issue. This will be described more later, but we’ve found that Docker ≥1.12 has problems on Linux <4.4.
To start with, we turned every stone related to Elasticsearch and the JVM.
Elasticsearch, Lucene, and the JVM
Elasticsearch needs both heap space and page cache to perform well. A cluster with 1 GB memory on Cloud gets a little less than half the memory for heap space, to leave memory for page cache and non-heap JVM usage.
There were a few issues in the early 5.0s that could cause a small node to quickly OOM as segments grew large enough to consume the available buffer space the S3 snapshotter could use, which was changed to 100 MB. This would be about 20% of a 1GB nodes available space, quickly leading to issues. That was quickly identified and remedied and every cluster was upgraded to apply the fix.
However, we still saw Elasticsearch 5.x use a lot more non-heap memory than 2.x, which we eventually attributed to Elasticsearch 5 upgrading to Netty 4. Disabling Netty’s pooled allocator and recycler further reduced non-heap memory.
That still wasn’t enough, some nodes kept on OOM-ing – but now by the kernel’s OOM-reaper, which triggers if a process in a container with limited memory exceeds its memory. Increased non-heap usage would normally result in performance reductions, and not processes getting killed by the kernel OOM-reaper. So we knew we still had an issue.
We found more tweaks that improved the memory usage issues:
- Upgrading to JVM 8 turned tiered compilation on by default, something not really necessary for Elasticsearch. This would eat up 48MB memory for additional code caches.
- glibc’s memory allocator could waste a lot of memory on servers with many cores. A colleague coming in from Prelert has described the interactions of the JVM and virtual memory on Linux as they relate to that change, which could waste a lot of memory for a small node running on a large server.
- There were a number of small fixes in Elasticsearch between 5.0.0 and 5.2.2 that helped with memory usage. For example, not closing a stream could leak memory.
- We reduced the number of JVM allocated GC threads.
We are also expanding our test suites to include tests that specifically address long running containers under load, and measure memory usage.
Kernel and Docker bugs
After much debugging, we found that specific combinations of kernel version and Docker version create a major problem with memory accounting. In our case, combining kernel version 3.19 with Docker version 1.12 exposed this bug. We had been running the 3.19 kernel for a long time, and it wasn’t immediately obvious that the kernel was a contributing factor to memory issues.
The core of the issue is that Docker 1.12 turns on kmem accounting. In 3.x versions of the Linux kernel, this causes problems because of a slab shrinker issue. In a nutshell, this causes the kernel to think that there is more kernel memory used than there actually is, and it starts killing processes to reclaim memory. Eventually it kills the JVM, which obviously hurts the cluster. There is a fix for the slab shrinker issue in kernel versions >= 4.0. Our testing led us to combine kernel version 4.4 with Docker 1.12. This combination solved the kmem accounting problems.
As you can see, there were a number of issues that combined to create a kind of “perfect storm” of memory issues. We are now at a point where we’re convinced we’ve identified all of the major issues and are a long way toward addressing them throughout our fleet.
The total number of affected clusters in our SaaS environment was around 1%. While this seems like a small number, we’re committed to reaching out to affected customers and offering explanations and help. Although this issue affected clusters of all sizes, smaller clusters were the fastest to be affected due to the already limited amount of memory. Since trial customers tend to run smaller clusters, we’ll be contacting trial customers who were active during the affected time period and offering new or extended trials.