16 July 2015 Engineering

Apache Lucene helps squash JVM bugs

By Michael McCandless

A few weeks ago, Rory O'Donnell from the OpenJDK team notified us on the Lucene developer's list that a new JDK 1.9 snapshot build (b66) was available.

This happens every few weeks and is a familiar routine for us by now: we quickly upgraded the Lucene Jenkins jobs at both Elasticsearch's and Uwe Schindler's build servers so that Lucene's extensive randomized tests would stress the new JDK snapshot. But, we were then suddenly bombarded by all sorts of exotic failures like this one!

Robert dug into the intermittent failures and found that they often happened when the test used Lucene's "spoon feeding reader" (MockReaderWrapper). This helpful test-only class wraps any incoming java.io.Reader and randomly chops up the incoming large blocks of characters into small randomly sized chunks, much like how you would spoon-feed a baby. Its purpose is to tickle any buffering bugs such as this classic, still-open Xerces-J Unicode bug.

Lucene has also had various exciting buffering bugs in its tokenizers in the past, but this time MockReaderWrapper caught a bug in the JVM, specifically in System.arraycopy!

Robert eventually boiled the failing test down to a small test case which finally led to this OpenJDK issue. The issue was quickly fixed (thank you!), but have a look at how it was fixed to see just how hairy it is for the JVM to implement the seemingly innocent System.arraycopy! This is like pulling off the volume knob on your car radio only to discover it has a small nuclear reactor inside.

This collaboration between the OpenJDK team and Lucene developers is win/win: new versions of OpenJDK (and of course Oracle's JDK, nearly the same thing) get more extensive testing before being unleashed to the world and Lucene users gain some confidence that there are no specific Java bugs causing horrible things like silent index corruption such as this nasty Java 1.6.0 bug from the past.

You can see all the exotic JVM bugs Lucene's tests have uncovered recently. The Lucene community also keeps a partial list of unfortunately already released JVM bugs affecting us, and Uwe gave a fun Berlin Buzzwords talk delving into some of the more exciting bugs.

There is one Oracle developer who really stands out in resolving the scary JVM bugs we discover: on behalf of Lucene committers, I'd like to extend a warm thank you to Vladimir Kozlov. We are perpetually in awe of Vladimir because somehow, with even the most cryptic and difficult Lucene test failures, iterating with Dawid or Robert or Uwe or sometimes all three, Vladimir can stare at heaps and heaps of assembly code created by the hotspot compiler and understand and fix the JVM bugs. We are not sure how he does it but he always does!

The silver lining

Things have not always been so rosy.

This disastrous bug for Lucene's users in Oracle's first Java 1.7.0 release caused all sorts of havoc a few years ago, which Uwe describes in this excellent blog post.

But the silver lining in this unfortunate event was the closer collaboration and squashed bugs we see today, not just in Lucene but also many other projects that Rory notifies on new JDK snapshot builds.

Still, things can and should be better.

Inexplicably, only OpenJDK committers ("author status") can open or comment on OpenJDK issues. This is insane and self-defeating: why on earth would any serious open-source project want to put false barriers for users to report and iterate on problems? This can only hurt the quality of your software.

We have no choice but to work around this commercial silliness by resorting to ad-hoc emails to OpenJDK team members and lists when our tests find problems, such as Dawid's response here on this open JDK9 bug, still causing occasional SEGVs in Lucene tests today. Robert had to resort to an email (and pastebin) to the OpenJDK hotspot-compile-dev list instead of opening an issue himself. We are allowed to report incidents at Oracle's Java bugs database, but they remain invisible until approved or somehow moved to OpenJDK's bugs system. Dark ages!

Second, isolating new JVM bugs is horribly time consuming. For example, a new JDK bug still lurks in the latest JDK 9 build but only Dawid has had time to dig in a bit to try to isolate the bugs to small test cases (more volunteers welcome!). It's spooky that even now (b72 at time of writing), we still don't have a recent Java 9 early-access build that consistently passes Lucene's tests. The frequent build failure emails to the Lucene devloper's list for bugs that are not in fact Lucene's cause additional noise and confusion to most readers on that list who don't necessarily understand that we are testing early-access Java builds. At times we feel like a strange extension of Oracle's QA team!

IBM's J9 JDK joins the fun

IBM has its own J9 JDK, and we used to include it in Lucene's tests rotation, but there were too many JVM bugs, such as this mis-compilation of the FST.pack method, causing test failures. Long ago, we never succeeded in getting IBM's attention to resolve them.

But then this recent Elasticsearch discussion led to a renewed effort, and thanks again to Robert we now have a number of specific J9 bugs affecting Lucene.

The interactions with J9 developers is even more commercially limited than OpenJDK, since J9 is closed-source and there is not even a public issue tracking system for us to see the progress on issues, let alone open and comment on them. So instead of seeing how things are being fixed, as we could above with the tricky System.arraycopy bug, we see cryptic comments like this one. Still, this is better than nothing, and beggars can't be choosers.

I hope that some time soon we can declare that J9 won't crash or corrupt Lucene indices.

Overall, it's wonderful that Lucene's exhaustive randomized tests are so effective at finding not only Lucene bugs, but also bugs in the various JVM implementations. We've come a long ways since the buggy 1.7.0 Oracle JDK release, and juicy bugs are being discovered and squashed. Even so, it's not clear this tenuous process is scalable going forward, with the unnecessary friction in how outside users report issues and the sizable time required to isolate new JVM issues. This is time taken away from, say, building a search engine!