21 janvier 2016

Elasticsearch Queries, or Term Queries are Really Fast!

I remember a couple years ago I was at a convention talking to folks about Elasticsearch or Lucene and one of them said something like "Everyone knows that if you really want to search fast you need to just use term queries." He was certainly right about term queries being fast but I really wanted to run some numbers and see just how right he was. And that felt like a good time to write a blog post to help with the "everyone knows" part.

Term queries are really fast

Term queries are pretty fast! On my development desktop with a single spinning disk I get 8,000 searches containing a single term query a second. Once you start adding multiple terms to the query it gets faster if they are ANDed together and slower if they are ORed together. This is because OR queries find more hits and scoring and counting those hits takes more time.

We can control for this using terminate_after. Term Queries: terminate_after If terminate_after is set then each shard stops searching as soon as it has hit that many results. If the document that would have had the best score wasn't hit because the shard terminated early then it's just not returned. If you are sorting by something other than score this is still a problem because the documents aren't encountered in a useful order. It also breaks the total hit count because each shard stops counting after terminate_after hits. On the other hand it really improves the performance of unselective queries so it's worth thinking about even though it'll only be appropriate for somewhat niche use cases. Values of terminate_after less than 10,000 make ORs faster than ANDs and values greater than 10,000 make ANDs faster than ORs. In my dataset. With my two term queries. This is because ORs accept every candidate document so they have to find fewer candidates.

Testing methodology

Now a note about testing methodology: this isn't super scientific. I'm using a rather old dump of English Wikipedia's search index for its content namespaces (available here, the file looks like enwiki-$date-cirrussearch-content.json.gz). How to load your own copy would be a great topic for another blog post.... Anyway! I'm using wrk for generating the actual load which uses a Lua script to generate the queries. The terms I'm using for the queries come from /usr/share/dict/american-english. I'm running both wrk and Elasticsearch on the same machine. I'm using a single shard and all the default arguments for Elasticsearch 2.1.1 including its 1GB heap. So I have about 30GB of free memory for Linux to use for the page cache. Every query I run has size=0 so I don't spend time loading the _source, highlighting, or any of that stuff that a real search has to do. After loading the index I _optimized it which is totally cheating. A real production system can't optimize an index that is always changing and Wikipedia is always changing. But the index is 124GB and my development desktop's SSD only has 101GB of space total so I'm having to use my spinning drive, a 1TB WD Blue spinning at 7200RPM. If I don't optimize I very quickly saturate the poor disk and my tests just show that I need to buy a better disk. Which doesn't make for a great blog post. I want to talk about the CPU overhead of these queries and optimize does a good job of making that possible. But take everything here with a grain or two of salt. Benchmark for yourself. My goal is to put you in the right ballpark.

I use query_string for all of my queries because it's convenient. I don't believe you should use query_string for queries sent by users because it's too powerful and too brittle but it is super convenient for my lua query generator to just generate a string query rather than worry about generating JSON.

Various query types

All being said, have a look at the graph above which compares some different types of queries. A couple of things: first and foremost I've shifted this graph to log scale. Sorry! Without doing that the fast queries just overwhelm the slow queries and you can't see anything. The horizontal axis is how frequently the term comes from a list of "common" words. All queries contain just two words. This doesn't use terminate_after. Finally, I'm using a home grown notation for the query types. Its short and super readable if you are me but in case you aren't:

AND means ANDing term queries together.
OR means ORing term queries together.
AND*$p means ANDing prefix queries where the length is $p.
AND~$f means ANDing fuzzy queries with fuzziness $f.
PHRASE means terms are wrapped in a phrase query.

A couple things jump out at me:

As the terms get more and more common the whole thing gets slower and slower. This is caused by the Elasticsearch having to do more work to either hit the documents or reject them.
Phrase queries are really fast when all the terms are uncommon - almost as fast as just ANDing the terms together. But they suffer quite a bit as the parts of the phrase get more and more common. This shouldn't be shocking: phrase queries only visit the term positions if the document contains all the terms so if the terms are rare they have to visit very few documents. When the document does have all the terms they have walk the positions lists of each of the terms in parallel which requires quite a bit of work, including reading those lists from disk. If you are trying at home to estimate how expensive phrase queries will be I'd err closer to the right hand side of the graph. Real people are more likely to do phrase queries for things that might form a phrase. My test script is as likely to search for "blighted icepicks" as it is to search for "good job". Do your own benchmarks with your own data and your own query logs if possible.
Other than unselective phrase queries there really is no competition for ANDing terms together. Prefix queries of at least 5 terms is about an order of magnitude slower and everything else is slower still.
Fuzzy queries with fuzziness=AUTO are surprisingly zippy. I don't know why they become progressively slow as their terms get more and more common. fuzziness=2 is devastatingly slow, potentially because English has so many short words and applying a fuzziness of 2 to a 3 letter word finds tons of terms. Its worth mentioning that I run these queries with all the default settings and that the rewrite parameter might affect their their run times.
Prefix queries are really dependent on the length of the prefix. Obviously short prefixes find more documents, but they have more overhead in general because they have to walk more of the terms dictionary. Prefix queries where the prefix is just a single letter are so slow I had to leave them out of the graph. I have yet to measure it but anything that allows a leading wildcard is going to be pretty nasty as well! I have yet to test it but it stands to reason that the actual prefix matters as well. More common prefixes will generate more terms to lookup and more hits, taking more time.

`terminate_after` makes them faster

Above is the same chart of different query types, this time with terminate_after set to 10000. I picked 10000 because its pretty that inflection point between AND and OR which just seemed fun. I also kept the same range on the horizontal axis. Here is what I noticed:

Performance is better all over the place. Not in every query but in a ton of them.
As expected OR ties with AND 100% uncommon terms but is faster than AND for more common terms. This is because it has to do less work to hit the 10,000 results.
Phrase queries suffer the most as the frequency of common terms goes up. They spend a ton of effort rejecting documents so they get the least benefit from terminate_after because they are very selective.

Configured queries

You'll notice that I ignored phrase slop when talking about phrase queries. Its because the slop value doesn't really matter. A slop of 0 is slightly faster than a slop of 1 but beyond that all slops are just as fast as one another.

i also skipped over fuzziness of 1 in the graphs above and only had AUTO and 2. Its not super clear to me why, but 1 has the same performance as AUTO.

As expected the longer the prefix on the prefix query the faster it finishes. Here is as good a place as any to describe how I build these prefix queries because its important that you know: I use the random word picker to get a word, slice it down to the prefix length or smaller if it was fewer code points than the prefix length, and then I make a prefix query out of it. This mixes in shorter prefix queries with longer ones which artificially depresses the scores, especially closer to the right hand side of these graphs which have more common, shorter words. For a while I was using the word picker over and over again until I got a long enough word. This artificially increased the scores of long prefixes because they found far fewer documents than the term queries I was comparing them against because the term queries contained more shorter, more common words. The moral of this story is that graphs hide lots of complexity.

Conclusion

So my advice:

Query the terms you want directly if possible. That means to use queries like match and term instead of match_phrase and prefix.
Prefer AND to OR if you can get away with it.
Use fuzzy queries if you must but prefer fuzziness=AUTO over other fuzziness values. fuzziness=2 is super slow.
Prefix and wildcard style queries are OK if the prefix is long. They get progressively worse as the prefix gets shorter. 5 characters is about the same as fuzziness=AUTO, 3 as fuzziness=1, and 1 is just plain horrible.
Stay away from phrase queries if at all possible. Their worst case performance is the worst.
If you must use phrase queries then use whatever phrase slop is appropriate for your use case. A slop of 0 is only marginally faster than any other slop so its not worth worrying about it.

It's worth thinking about whether terminate_after is appropriate for your use case. While it provides a nice boost it does so at the cost of not necessarily returning the best result and not getting an exact count of results. It stops looking at the results on each shard once it hits the limit so it's quite possible that the highest scoring result is won't have been inspected. Its probably limited to the niche use cases where you can tell users "Sorry but your query was too general so we aren't sure that we found the best results for it. Please make it more specific by adding more terms."

There are other options for limiting the performance impact of slow queries. Wrapping phrase queries in a rescore is quite common and probably worth yet another blog post with more pretty graphs. Another option is to use the shingle token filter to create a query with phrase-ish characteristics. You can also limit the number of hits when there are many OR clauses using minimum_should_match, common_terms_query, and stop words which are all knobs worth tweaking that I simply ran out of time to cover in this post.