Post-mortem: Memory Leak Hunting on Windows

At Elastic we stand next to our users to help them designing their deployments, architecting their solutions but also troubleshooting problems that can be encountered in day to day operations. Nasty bugs come out now and then and sometime identifying their root cause can be a painful and frustrating exercise, going through debug logs, heap/thread dumps, OS metrics and failed reproduction attempts.

In this blog post, we will go through the story of our investigations to identify a bug which has been impacting a large amount of users and their logstash deployments and tell how we have joined forces between support and development teams in Elastic in order to solve it.

This all started with several reports of abnormal behaviour in terms of memory consumption have been reported by our users to our support team but also through GitHub issues, see 3722 and 3754.

In this case, an analysis of the JVM heap usage over time using JConsole showed no significant anomalies within heap memory consumption, showing the classic healthy “sawtooth”, where garbage collection periodically frees memory:

healthy JVM garbage collection

Interestingly though, the process total memory footprint on Windows would keep growing and growing over time, well beyond the configured JVM limits (1GB).

"2GB? I set Xmx to 1GB!"

So where to start from to understand where is this memory being used?

Reproducing the problem

When there are no steps to follow, in order to recreate a problem, the only approach that usually gives results is just to recreate an environment as close as possible, in terms of OS and application configuration and versions, as the one where the problem has been showing its symptoms: every little detail can make the difference.

Using the same configuration and log files from one of our customers we have started monitoring a test environment where we have set up Logstash on a Windows 2012 instance and started observing the behavior as new files to be processed were automatically generated.

As reported by the users impacted by this bug, the problem with the high memory consumption of the Logstash process would develop in time, only after several hours of usage; this is usually a common pattern when dealing with memory leaks, however as seen in JConsole picture above, no signs of leak were observable within the Java heap.

At this point, we’ve started looking around for alternative more comprehensive memory monitoring/troubleshooting tools as it became evident after hours of tests that what we were looking for was not something that JConsole could show us.

VMMap

VMMap is part of Microsoft Sysinternals and is a process virtual and physical memory analysis utility. It shows a breakdown of a process's committed virtual memory types as well as the amount of physical memory (working set) assigned by the operating system to those types.

Using VMMap, a very granular breakdown of memory usage for the Logstash process was revealed to us, after we’ve attached it to the test Logstash instance; this allowed us to go beyond the simple Heap/Off Heap distinction available in JConsole and keep our investigations going.

What jumped out fairly quickly was the amount of memory reported as “Unusable”:

What is this..Unusable?!

Unusable Memory

Window's virtual memory manager has a 64KB allocation granularity which means that allocating a smaller chunk will result in "wasted" virtual memory. In itself this isn't a problem, but such a big size of Unusable Memory indicates that there's a high number of < 64KB allocations which aren't being freed. Selecting "Unusable Memory" in VMMap confirmed this:

small chunks of unusable memory

All these 56KB chunks meant that the each resulted from a 8KB allocation, which we were able to confirm in "Private Data".

allocated chunks

In VMMap this section refers to the memory used by the process's heap, including both JVM related memory structures and other non JVM memory usage (e.g. through native allocations). Since JConsole and other JVM heap analysis tools showed low memory consumption, this indicated off-heap allocations.

At this stage, after days of testing and some non-negligible amount of frustration, we’ve started seeing the light at the end of the tunnel, we had found a smoking gun: the continuous off-"JVM heap" allocations of 8KB which were never freed, easily observed by their counterpart "Unusable Memory".

Narrowing down the leak root cause

Now that there was reproducible evidence of the problem (the increase in "Unreachable" memory), we started progressively reducing the scope of the test while ensuring VMMap would still report the off heap allocations.

Where to start?

The first task was to find the minimal set of plugins - hopefully one - that caused the problem. All memory leaks used pipelines with file input, many filters (mutate, grok, date, etc.) and elasticsearch output, so first we removed all filters and confirmed the leak still happened, then replaced elasticsearch output with stdout and again confirmed the leak.

This meant we had very good candidate for the root cause: the logstash-input-file plugin!

Suspect: Logstash file input

This plugin uses a library called filewatch, that takes care of observing a list of files, reporting changes and keeping score of read progression for each file.

Knowing that the leak came from file input plugin, and that filewatch is responsible for most of its logic, we hacked a small ruby script that used filewatch directly to watch few files. As a reminder, at this point it was easy and quick to know if the leaky code was still being used because any Unreachable allocations were immediately evident in VMMap. This meant we no longer required stress tests for long periods of time, and each hypothesis could be confirmed in seconds.

Running this script with only filewatch also triggered the leak so we were on the right path!

Suspect: FileWatch ruby library

FileWatch has two functionalities:

  • Watch: produce notifications of file changes (like modification, deletion, truncation)
  • Tail: use Watch's notifications to read contents from watched files

Since the test script used both, the next step was to skip any Tail functionality and just have a script monitoring a list of files using FileWatch::Watch and doing nothing with it. Again, the memory leak was evident in VMMap! Going back to the information previously found in VMMap, we can see that the leak builds up from multiple small allocations and also the data found in private data contained names of files. Looking at the FileWatch::Watch code we can see that at its center is Ruby's File#stat method. File#stat is used to retrieve information about a file such as its permissions, size, modified time and more. At this point we understood this was a serious bug because…

Suspect: JRuby

A simple command line test that continuously requested information for a path was used to check the presence of the leak on JRuby:

jruby -e 'loop { File.stat("bin") }'

Watching the JRuby instance with VMMap made it clear we had found the responsible for the leak as VMMap showed a huge increase in off heap allocations that weren't being freed.

Identifying the version of the JRuby regression

Github issues reported that Logstash 1.4.4 (JRuby 1.7.17) and 1.5.0 (JRuby 1.7.19) were OK and not affected by this which mean the memory leak wasn't always present in Logstash.

To make the lives of the JRuby's core developers easier, we narrowed down version of JRuby that had introduced the leak using the command line loop shown above as reliable reproduction test.

While JRuby 1.7.19 showed no abnormal volumes of "Unusable" memory allocations after 5 minutes:

good!

In JRuby 1.7.20, after 5 minutes, the story was completely different, showing 1.7.20 had introduced the leak:

bad!

The Fix

Fortunately the JRuby core team is very active on IRC and, as soon as we presented the case in the #jruby channel, Tom Enebo and others came forth and started investigating the leak. Issue jruby/jruby#3446 was created as a result and we ended up serving as testers for the proposed fixes. Thanks to an awesome effort of the JRuby team, the bug was traced back into jnr-posix, a POSIX emulation layer for Java. In about two weeks JRuby 9.0.4.0 and 1.7.23 were released with a fix for the memory leak! So all that remains is…

New Logstash releases!

Stay tuned for this week's releases of Logstash 1.5.6 and 2.1.1 which will include JRuby 1.7.23!