Change is an inevitable part of life, and it is everywhere: people in our lives, our relationships, our environment — and JVM defaults. And the latter can bite you if you are not very careful.
The other day we had a very interesting support case: A customer has upgraded a cluster from 1.7 to 2.4, and they experienced very high CPU usage after some time and on some of the nodes in a large cluster. This is definitely one of the tougher issues to debug. Our team analyzed piles of log files, stared at diagnostic dumps, visualized shard balances across the cluster, and reviewed the cluster configuration.
Then it finally hit us: The startup scripts in their environment set the JVM option
-XX:ReservedCodeCacheSize=64m. Now you may wonder what this JVM option does. Glad you’ve asked! Let me take you on a short detour into the guts of the JVM. When a Java applications starts, the JVM will run it in interpreted mode. After some time, the JVM detects that some methods are called very often and will compile them to native machine code that is optimized for the current platform the application is running on. This machine code needs to be stored somewhere in memory by the JVM, and this part of memory is called the code cache. So the JVM option
-XX:ReservedCodeCacheSize=64m sets the size of the code cache to 64 MB.
The JVM has not one but two JIT compilers: C1, which does only basic optimizations but can compile very quickly, and C2, which optimizes heavily but takes up more system resources. Which one should we choose? The JVM engineers had a trick up their sleeve: The JVM will use them both! This feature is called tiered compilation. So each method will be first run by the interpreter, then if it is invoked often enough, it will get compiled by C1 and finally, if it is called even more often, by C2.
On Java 7, tiered compilation was off by default, and the default code cache size is 48 MB. This was a good default because depending on the underlying hardware, the JVM would choose whether to use C1 or C2, but it would not use both of them. In Java 8, tiered compilation was turned on by default, and, consequently, the default reserved code cache size was increased to 256 MB. This is necessary, because when methods get first compiled by C1 and then by C2, a lot more native code is produced.
Back to our story: Originally our customer ran on Java 7, so they actually increased the reserved code cache size from 48 MB to 64 MB with this setting. However, when they migrated to Java 8, they did not change this flag but tiered compilation was enabled, which lead to a significant decrease from the default of 256 MB. Consequently, depending on the methods that got compiled on the respective nodes, the code cache went full. And when this happens, the JVM will compile no more new methods but instead will only run them in the interpreter. And this results in vastly reduced performance that we have also seen on the cluster nodes where the code cache was full. So we just had to remove this flag from the startup script and everything went back to normal. We also recommend not to change this setting as the default value is sufficient for Elasticsearch.
Morale of the story: While we continuously strive to reduce the complexity within Elasticsearch, it is an application that is running on top of the JVM. So it is not sufficient to understand only Elasticsearch but you need to look at all levels: from hardware, to the operating system, the JVM, and up until the application layer. These complex systems change all the time and a seemingly simple change of a JVM flag can turn into a significant performance problem down the road.