Hunting Tricky Apache Lucene Bugs
The series of events leading to these bug fixes is quite illustrative, not only in how seriously we take index corruption issues (no matter how rare) but also in how randomized tests and extensive continuous integration Jenkins jobs alert us to potential issues and how debugging tricky test failures in healthy open source projects works these days.
The story begins with Simon, noticing a spooky-looking test failure in one of Elasticsearch's Jenkins jobs, asking me to dig. The test in question, OldIndexBackwardsCompatibilityTests, is a great test: it unpacks old indices from pre-built ZIP files across all prior Elasticsearch releases and then verifies that the latest version of Elasticsearch starts up and searches and indexes against these indices properly.
Almost always the test passes, but every so often it would fail, with a scary
_N.si FileNotFoundException exception showing the index had somehow become corrupt. The corruption seemed to happen when Lucene 4.10.3 first kissed a 3.x index
IndexWriter's first commit to the index).
Like Lucene, Elasticsearch tests are heavily randomized, but provide a reproducible seed to recreate the failure. Unfortunately, when I ran the test with this seed, with the same Java version and JVM options, it refused to fail. Grrr.
This sometimes happens, e.g. if the test relies on how concurrent threads are scheduled, so I fired up a dedicated Python script to "beast" the test. The script efficiently runs a single test case over and over, bypassing the normally costly Maven dependency checking and compilation steps and just executing the test directly across multiple JVMs. Still, it would not fail for me, after more than 24 hours of beasting. Why not?
Unable to reproduce the failure, but still seeing it occasionally fail in Jenkins, we retrenched.
First, multiple developers stared at the admittedly hairy 3.x upgrade code in in Lucene, and we uncovered LUCENE-6279 which could possibly explain the failure we were seeing; at least it produced the same exception. Excited that we had possibly found the issue, we opened an Elasticsearch issue to track it.
But then, on further digging, Elasticsearch was not re-using index shard directories in this test, so that could not explain the failure. Still, this was a nice separate bug to uncover and fix and this shows a common pattern in open-source software: a test failure, or a new bug, causes sudden excitement and scrutiny by multiple developers on the code in question, and this often uncovers further issues to fix and the project moves forward. We saw a similar pattern with LUCENE-6299, also fixed in Lucene 4.10.4.
Retrenching again, we fell back on looking for any patterns to the 3 or 4 failures we had seen, and one thing we noticed is it was always the 0.20.6 test index (Lucene 3.6.2) that failed, despite other 3.x indices in the test. Why?
One unusual thing about that particular test index was that it was much larger than the others, containing many more segments, which confused me, and I asked about this on the issue. Robert quickly replied, explaining that he had regenerated that one index to make it "beefier" to better test another issue he had fixed in the past. Perhaps something unique about this index tickled a bug that the other indices didn't?
Simon also suggested turning on verbose file deletion logging for this test, a feature we recently added to Elasticsearch for precisely this kind of situation.
Then we waited for the next failure, which came a couple days later, and it clearly indicated where Lucene was incorrectly deleting the file!
Armed with that important information, and then going back and staring at that code some more, we finally isolated the situation that could cause the corruption: it only happens on Lucene 4.x’s first kiss to a 3.x index, if concurrent merges are also running and if the threads are scheduled by the JVM "just right". This explains why only the 0.20.6 index would fail: it was the only 3.x index with many segments that needing merging! It also explains why the test would not easily reproduce: it relied on the specific timing of concurrent merges.
Finally, we were able to make a small dedicated Lucene test case showing the corruption, and a simple fix for the bug. Because the bug is serious (causes corruption), and there were also plenty of other good bug fixes pending for 4.10.4, we then triggered a new Lucene release, now folded into Elasticsearch releases 1.4.5 and 1.5.0.