How Elastic traced a formatting bug in Elasticsearch

blog-thumb-search-charts.png

Building new features gets a lot of excitement, but fixing bugs is an equally important part of building a robust and trustworthy platform. We take even small bugs seriously, and sometimes small bugs lead to surprising places. This is the story of how following a rare test failure led to finding a subtle but real bug in how Java formats date-time values.

How we got here

Elasticsearch has a feature called Composite Aggregation, which can be used to paginate through aggregation results. It uses a fairly standard pattern for this, where each response includes an “after key”, which is a human readable key (as opposed to a hash value, like the scroll API uses) representing the last bucket the aggregation returned in that call. The caller then passes that key back to the next call, and the aggregation picks up at that point. Having the key human readable helps a lot in debugging, and allows for some fancy tricks if you’re feeling cheeky. But it also requires formatting.

We got some weird bugs reported where composite aggregations would “get stuck”, meaning the caller would pass in the after key composite had returned, but get back the same page and after key as they had already gotten. What gives?

Date formatting is what gives. It turns out some date formats weren’t “round tripping” correctly; that is to say, the formatted string didn’t parse back to the original value. At first glance, it might seem like every format should be able to round trip correctly, but in practice that isn’t true. For example, in #72556, the caller sent the sensible-looking format YYYY-MM-dd but that uses week-based-year; month-based-year is lowercase yyyy. Java is happy to print a week-based-year followed by a month, since it can compute both of those from the internal representation. But when parsing, it doesn’t know how to place a month in a week-based-year, and instead of throwing, it defaults everything it can’t parse to 1. So we’d always get back January 1st of whatever year we initially formatted.

It’s hard to look at a format in isolation and tell if it will round trip correctly, especially since it may depend on the value being formatted. So we did the next best thing. When we go to write the after key in Composite, we test if we can round trip the exact value we’re trying to write. If we can’t round trip the value exactly, we fail the query. This prevents a key we can’t parse from creating more problems down the line, like causing a client to miss data or loop forever (following the principle of “Crash Early”).

We also had a bug around using epoch_seconds format with composite, which we wanted to support. So as a prerequisite to the round trip check, we added what turned out to be a pretty easy fix for that, and some testing around it. One of my favorite things about working at Elastic is our robust testing toolkit. In this example, we made it very easy to write tests that randomize their sample data, and since it was easy, that’s what I did when I wrote it. Turns out that was a good idea.

An unexpected test failure

We merged the fixes from above, and everything was good for a few weeks. Then, the test failed. It was strange - the failure reproduced locally sometimes, but not others. Eventually, we realized two things: First, that it was flaky depending on if the JVM in question had received the 2020b timezone update, which corrected the time zone for Antarctica/Macquarie, which the test had randomly selected; and second, that the update had moved Antarctica/Macquarie into a time zone that experienced Daylight Savings Time. Anything DST related is immediately suspect, and we quickly reproduced the failure with any DST transition where the local time moves back.

Once we knew how to reproduce the bug, I wrote a unit test case that just did that and threw it in the debugger. I knew we had some custom code around epoch values, and that was my first suspect. We have an anonymous class method MILLIS#resolve() which looked promising. That’s where we convert milliseconds-since-epoch, which Elasticsearch likes to use internally for dates, to seconds + nanoseconds, which Java likes to use internally for dates. My theory was somehow we were messing this up. So I put some breakpoints in, and stepped through that method, but at the end of it, the fieldValues map (which is sort of the scratch pad for Java’s date parsing) had the right values in it. I hit the “resume program” button to run to my next breakpoint. When I got there, I looked at my fieldValues map again, and my correct value had changed! Now it showed the incorrect value.

Whatever remains, however unlikely

It’s hard to suspect that a core component of a widely used, extremely well tested system is the root cause of a bug. You know in your heart of hearts that it’s probably not broken, and you’re probably just calling it wrong. I think every senior developer has probably cried wolf about a library being broken at some point, and felt quite silly to learn they were doing it wrong. I wanted to be fully confident this wasn’t something we were doing.

I took a look at java.time.format.Parsed#resolve(). Hmm, what’s this ResolverStyle parameter? It has a strict mode and we’re not using it. That must be the problem. But it wasn’t. So I threw some breakpoints on it and ran the debugger again. Just like before, our MILLIS#resolve() code put the right value for INSTANT_SECONDS in the map. Then we entered Parsed#resolveFields() and something strange happened. It took INSTANT_SECONDS out of the map, and created a zoned date time. It had a timezone, America/New_York, but it didn’t have a UTC offset. Strange. Without the offset, it had converted our unambiguous INSTANT_SECONDS value to an ambiguous value (That is to say, a local time that occurs twice. For example in the America/New_York time zone, the time 2020-11-01 01:00:00 occurs twice, once at UTC+5 and once at UTC+4. We say this time is ambiguous, since just from the time and zone, it isn’t clear which is meant.) A few lines later, resolveInstant() put INSTANT_SECONDS back on the map, but with the wrong value. Since the zoned date time was ambiguous, it just picked the lower value, which was wrong in this case (and also explained why transitions that moved the clock forward didn’t break). There was our bug.

From there it wasn’t hard to recreate the problem using just the core JDK. I had to switch from milliseconds to seconds, but otherwise it followed very logically. I tried a few adjacent cases and they didn’t have this issue. I stuffed my work in a gist and floated it around the team to see if anyone saw something I’d missed. I got some feedback on how best to present the issue, and submitted a JDK bug, my first ever (achievement unlocked?). Two weeks later, they fixed it.

Lessons Learned

There are a few takeaways and things to learn from this experience. First, always test your edge cases, and DST transitions are definitely an edge case for anything having to do with time zones. We could have caught this sooner if we’d done that. Second, test your contracts. We found this because we said to ourselves “The formatter should be able to round trip,” and then we tested that. Third, randomized testing is amazing, it finds edge cases you never thought to look for. Finally, sometimes the library is broken. Extraordinary claims require extraordinary evidence, but that doesn’t mean they’re wrong.

If this kind of deep dive bug hunting appeals to you, check out our careers page and see if you’d like to come work with us.

___________________________

1 See The Pragmatic Programmer by Andrew Hunt & Dave Thomas, page 120