Le contenu de cette page n'est pas disponible dans la langue sélectionnée. Chez Elastic, nous mettons tout en œuvre pour vous proposer du contenu dans différentes langues. En attendant, nous vous remercions de votre patience !

13 février 2017

We are out of memory (or: Why systemd process limits ruined my day)

Daniel Mitterdorfer

It all started on a shiny winter day: While we were analyzing build failures in our Jenkins-based CI farm, this one caught our attention:

ERROR   3.63s J0 | TransportTasksActionTests.testTasksDescriptions <<< FAILURES!
   > Throwable #1: java.lang.OutOfMemoryError: unable to create new native thread
   > 	at __randomizedtesting.SeedInfo.seed([8961F9A3D408A4A6:9639C7A793F2E13]:0)
   > 	at java.lang.Thread.start0(Native Method)
   > 	at java.lang.Thread.start(Thread.java:714)
   > 	at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:950)
   > 	at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1368)
   > 	at org.elasticsearch.transport.MockTcpTransport$MockChannel.loopRead(MockTcpTransport.java:342)
   > 	at org.elasticsearch.transport.MockTcpTransport.connectToChannels(MockTcpTransport.java:213)
   > 	at org.elasticsearch.transport.TcpTransport.openConnection(TcpTransport.java:475)
   > 	at org.elasticsearch.transport.TcpTransport.connectToNode(TcpTransport.java:440)
   > 	at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:310)
   > 	at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:297)
   > 	at org.elasticsearch.action.admin.cluster.node.tasks.TaskManagerTestCase.connectNodes(TaskManagerTestCase.java:228)
   > 	at org.elasticsearch.action.admin.cluster.node.tasks.TransportTasksActionTests.testTasksDescriptions(TransportTasksActionTests.java:484)
   > 	at java.lang.Thread.run(Thread.java:745)

Our initial investigation revealed that this failure only happened on our build slaves with SuSE enterprise Linux 12 SP2 for every build of Elasticsearch. Other builds on these build slaves, such as our Lucene builds, were unaffected.

To reproduce the issue we started a Gradle build right from the command line on the affected machine, and it built just fine. Now the question stands: Why is the JVM unable to create a new OS thread? Do we - for some reason - create an excessive amount of threads during the build? Let's add a test to our test suite that just spawns threads until the JVM dies:

import org.elasticsearch.test.ESTestCase;

public class ThreadTests extends ESTestCase {
    public void testCreateThreads() throws Exception {
        int i = 0;
        while (true) {
            logger.info("Starting thread [{}]", i++);
            Thread t = new Thread(new Idler());
            t.start();
        }
    }

    private static class Idler implements Runnable {
        @Override
        public void run() {
            try {
                Thread.sleep(100000000L);
            } catch (InterruptedException e) {
                Thread.currentThread().interrupt();
            }
        }
    }
}

When we ran it via Gradle on the affected system, the test had created roughly 20.000 threads when the JVM finally died. However, the number of threads reported by our "professional" thread monitoring solution (watch 'for pid in `jps -q` ; do echo -n "$pid " && ps huH p $pid | wc -l ; done' ;) ) while Jenkins was running never exceeded a few hundred threads.

Time to open Pandora's box and to find out under which conditions the JVM throws this error. The C++ implementation of the native method java.lang.Thread.start0 is the function JVM_StartThread in jvm.cpp. Towards the end of this function, we can see that an OutOfMemoryError is thrown when the JVM was unable to create a new native thread.

We'll leave out the details, but a native thread on Linux will be created by os::create_thread. After analysis, we came up with two error conditions:

Upon native thread creation, the JVM checks on some platforms (one of them being SuSE Linux) whether the memory addresses of stack and heap are below a certain margin, called the thread safety margin. If this safety check fails, the JVM will terminate with an OutOfMemoryError. It can be disabled by setting -XX:ThreadSafetyMargin=0.
The native thread is ultimately created by pthread_create. Whenever pthread_create returns an error code, the JVM will raise an OutOfMemoryError.

As we were just investigating and needed to reduce the failure paths to check, we started by disabling the thread safety margin check by adding -XX:ThreadSafetyMargin=0 to the JVM options of our build. Needless to say, it was not that easy and disabling the check did not change anything.

So we needed to analyze under which conditions pthread_create can fail. After reading the man page of pthread_create, we concluded that we must hit a system limit and checked:

ulimit -a which reported a limit of 64140 maximum user processes.
/proc/sys/kernel/pid_max which reported 32768. This means that the highest PID on the system can be 32768. After that PIDs wrap. Given that ulimit allows 64140 processes per user, this number was too low, and we increased it temporarily to 131072.
/proc/sys/kernel/threads-max which reported 128280. As this is above the limit of maximum user processes, we kept the value.

All in all, these numbers seemed fine. But the build was still failing. Time to go a level deeper.

pthread_create in glibc ultimately uses the system call clone. Poking through the kernel documentation, we found an interesting feature, namely the cgroup process number controller, which allows to limit the number of processes in a cgroup. This led us to look at our process hierarchy. PID 932 is the Jenkins slave process:

systemd(1)─┬─agetty(1642)
           .
           .
           .

           │
           └─runsvdir(903)─┬─runsv(922)─┬─java(932)─┬─{java}(976)

and indeed runsv had a very conservative limit of 512 processes (see systemd issue 3211):

cat /sys/fs/cgroup/pids/system.slice/runsvdir.service/pids.max
512

For testing purposes, we increased the limit to 4096, started the build again, and it finally turned green.

Note that to raise the limit persistently, you need to define TasksMax in the affected service's configuration file or set DefaultTasksMax for all services in the global systemd config file.

The image at the top of the post has been created by Kristel Rae Barton and is licensed as public domain (original source).