13 September 2016 Engineering

The tale of caching and why it matters…

By Simon WillnauerShay Banon

This is the first post in a three-part series about Instant Aggregations. See how the story progresses in The Great Query Refactoring: Thou Shalt Only Parse Once and wraps up in Instant Aggregations: Rewriting Queries for Fun and Profit. Enjoy the trilogy!

It’s early 2013 and an unusually sunny day in Amsterdam and a group of people are meeting around table soccer and ping pong tables for what we call a company all-hands. Just recently Rashid Khan, one of the big characters behind Kibana, joined Elastic and we are still just a handful of engineers. I’m hacking around trying to get checksums to work for recovery, listening to a conversation between Shay and Rashid. It’s Kibana’s initial dashboard slowness that causes this intense conversation. Even though Kibana fires up almost identical searches each time you open the home page, elasticsearch has to recompute everything from scratch. Someone might ask, no caching eh? True!

A closer look under the hood shows that searches are almost the same, but are subject to this annoying property of time: it never stands still. If you have used Kibana yourself you might have realized that a default filter is always based on the current time (NOW) going backwards for a defined time range. In other words you never fire the same query more than once a millisecond.

Then the discussion got serious: Rashid and Shay started talking about caching and adding REST level primitives to control the cache key. Time to stop working on checksums; I gotta get involved! If you try to solve one of the hardest problems in computer science and the discussion is heading towards allowing the user to control it, you are either a really brave engineer or all other options would require you to be a hell of a brave engineer! The discussion continued for a while and ideas basically went through the roof.  You might have experienced this in your day to day job before. Luckily, we had so many other problems to solve at that time that we just dropped the ball on it for a while.

Fast forward: it’s October 1st and my calendar says “The Dudes are in Berlin” meaning that Shay and a bunch of other team leads were coming into town for some planning sessions. That’s usually an intensive time in a distributed company like Elastic since we don’t meet in person more than twice per year. After 3 days of discussions, brainstorming and arguing Shay and I went out for Schnitzel to this awesome Austrian place near my house. Honestly, neither Shay nor I were really up for any more discussions but suddenly the caching thing came up again. I don’t blame anybody; I’m not sure who opened that particular can of worms.

Anyway, this time we came up with a plan! Admittedly, not low hanging fruit, but something that could actually work well, is fully transparent, easy to test and can be disabled if it’s not working. You noticed that escape hatch, did you? Caching is hard but let me explain what we had in mind. Bear with me, I’m going to take a big swing:

Elasticsearch is based on Apache Lucene™ which works based on point-in-time view of an index. Such a snapshot is basically a set of write once index segments, each holding a subset of the indexed documents. The software construct we use to represent such a snapshot is what we call a “top-level” indexreader. We know that, unless the top-level reader changes, queries are idempotent or in other words, cacheable. In Elasticsearch there is exactly one Lucene index per shard, so we can simplify things to use one top-level reader per shard. Now, if we can identify the outer bounds of an index for any date field we could also make much better decisions if for instance all or even no documents at all would match a certain filter and therefore could rewrite the query to match-all  or match-no-docs respectively. If we could manage to do that then we could put queries that appear to be un-cachable into the request cache we added basically just before that Schnitzel brainstorming session. The request cache utilizes Lucene’s top-level reader as well as the search request’s binary representation (the plain unmodified JSON, YAML or CBOR bytes) as a combined cache key. This allowed us to minimize the space requirement at the same time as minimizing the number of objects to represent a cache entry.

Back to the idea, the main driver of it was a new utility in Lucene that allows us to fetch the upper and lower bounds of a field from a top-level reader. With this information we could do anything you could possibly imagine with a query that has time properties. You can imagine when two engineers get excited over Schnitzel and caching is involved it ain’t gonna end well.

With all that excitement I went and asked Mike Mccandless to give the idea a shot. Mike pulled off a prototype quite quickly. As usual after all that excitement we had to face reality, but the basic idea worked, YAY! Well, reality was that the prototype worked but it was far from anything that could go into production any time soon. We had to realize that to effectively modify queries in a generic and safe way (both are good properties to have if you want to use something as a cache key) we need an intermediate representation for our entire request constructs.

At this point we had no way to parse the request, modify it and write it back out into a byte representation that can be used as a cache key. This was kind of a downer for all of us since we had to face the fact that each of these ~70 queries, aggregations, suggest, highlight, sort, etc. classes had to be refactored in order read, normalize, modify and write back its values. What could possibly go wrong.

Is this worth the trouble?

After that prototype and the reality check that our code wasn’t ready we needed to talk about the exciting possibilities again. When you look at your code and you realize you have to basically invest 6 month worth of engineering time to make the first step towards a new feature you think twice about whether it’s worth it.

That said, with the immense growth of time-based data and how the data is structured on a macro level it became obvious why it would make sense to invest in solutions that are done the right way. A typical installation of logging data might have daily indices spanning period of a week or a month. Searches typically span all indices in that time range but, except for the current daily index, all other indices are static: they don’t receive any changes anymore and therefore they maintain the same point-in-time snapshot. This means we’d get 100% cache hits if queries included the entire day for each shard. In other words we could reduce the search workload for these machines dramatically and kibana dashboards would appear instantly.

Well, these were convincing arguments, so now it was time to get things started. In March 2015 we started prototyping how we could represent queries and write them back out. Nobody knew that it would take another 12 months for all the needed parts to come together and yet another 6 months for this feature to be released in an alpha release. I won’t tell the rest of the story because I want to leave that to the hard working folks who implemented all these changes. So, stay tuned for the upcoming articles in this series.