May 23, 2017 Engineering

In which order are my Elasticsearch queries/filters executed?

By Adrien Grand

We often get questions about the order in which filters are executed, whether filters get executed before or after queries, etc. Those are indeed important questions: the recipe for quickly executing a query is often related to running the cheap bits before the expensive ones. You might have heard or read in the past that filters are executed before queries. While this statement is a good way to convey the fact that we do not compute the score on documents that do not match filters, the truth is a bit more complicated. Also the phrasing of the question suggests that we iterate over all documents and check whether queries/filters match one by one, while again the truth is more subtle: our index structures can help efficiently skip documents that can't possibly match.

How does a query/filter run by the way?

I can't give you insights into the order in which queries get executed before explaining how query execution works. Queries and filters expose the following operations:

  • cost(): Approximation of how many documents the query/filter matches.
  • docID(): The current doc ID, initialized at -1.
  • advance(target): Find the first document beyond target that might match.
  • nextDoc(): Find the next (in doc id order) doc that might match. This is an optimized version of advance(docID() + 1).
  • matches(): Check whether the current document matches.
  • matchCost(): An estimation of the cost of calling matches.
  • score(): Compute the score of the current document.

This is a bit complex, but there are good reasons for all these operations to exist! The most important thing to notice at that stage is that matching documents happens in two phases. There is first an approximation which allows to efficiently iterate over a superset of the documents that match the query, those are the nextDoc/advance operations. And then, there is a verification phase that aims at verifying whether the current document actually matches with the matches operation. The goal of this design is to make sure we only start running the costly bits after we reached agreement between all approximations, which should be cheap. Also, something interesting to note is that the only difference between a query and a filter is that we never call score() on filters.

To better understand what those operations do, let me give you simple examples:

Term queries

Term queries are the most efficient queries that Elasticsearch supports: their matches are pre-computed in the inverted index structure.

  • cost(): Return the document frequency, which is encoded in the inverted index.
  • advance(target): Skip to the first match that is greater than or equal to target, using skip lists.
  • nextDoc(): Read the next entry from the postings list.
  • matches(): Always returns true.
  • matchCost(): Return 0.
  • score(): Compute the score of the current document based on index stats.

Disjunctions (a OR b OR ...)

  • cost(): Return the sum of the costs of sub clauses.
  • advance(target): Call advance(target) on all sub clauses and return the minimum of the results.
  • nextDoc(): Call nextDoc() on all sub clauses and return the minimum of the results.
  • matches(): Iterate over sub clauses that are positioned over the current doc id, and return true as soon as one of them returns true for matches(). Return false if none of them matches.
  • matchCost(): Return the sum of the match costs of sub clauses, weighted by their cost (because we are more likely to call matches() on clauses that match many documents than on clauses that match a couple documents).
  • score(): Return the sum of the scores of all sub clauses that match the current document.

Since disjunctions are essentially about merging sorted iterators, we use a heap for efficiency.

Conjunctions (a AND b AND ...)

  • cost(): Return the minimum of the costs of sub clauses.
  • advance(target):
    • Call advance(target) on the clause that has the least cost() value. This returns the next candidate C.
    • Iterate over other clauses in ascending cost() order and call advance(C). If they all return C, then we have a match. Otherwise the returned doc ID gives us a new candidate, that we again need to validate against other clauses. Repeat until a match is found.
  • nextDoc(): Like advance(target) except that we initially call nextDoc() on the clause that has the least cost() value.
  • matches(): Iterate over all clauses in ascending order of matchCost() and return false as soon as one of them does not match. Return true otherwise.
  • matchCost(): Return the sum of the match costs of sub clauses.
  • score(): Return the sum of the scores of all sub clauses that match the current document.

This is sometimes referred to as "leapfrog" given how we alternatively advance clauses until we find common matches.

Phrase queries

Phrase queries are interesting because they are the reason why we got the matches() and matchCost() operations in the first place. Phrase queries are essentially a conjunction, where we perform some additional operations on a per document basis in order to check whether they match or not.

  • cost(): Same as conjunctions.
  • advance(target): Same as conjunctions.
  • nextDoc(): Same as conjunctions.
  • matches(): Iterate over positions for the current document until terms are found at consecutive positions, we do not need to iterate further.
  • matchCost(): The formula is a bit complicated, but in short this is proportional with the average frequency of the terms that are used in the phrase, ie. the total number of times they exist in the index divided by the total number of documents that contain these terms.
  • score(): Keep iterating over positions in order to find the number of times that the phrase occurs in the current document, and use this phrase frequency as a basis for the score computation.

This also gives you some insights into why filters perform better than queries. In this case, not only can filters skip the score computation, but they can also stop iterating over positions as soon as one match is found, they do not need to count them all.

Back to execution order

If you read back the description of how conjunctions run, execution order for nextDoc() and advance(target) is decided based on cost(), and execution order for matches() is decided based on matchCost().

So if you search for the AND quick AND fox, we will first look at index statistics to find which one of the terms is the rarest, iterate over documents that contain this term and check whether they contain other terms as well.

Now a more complicated example: imagine that you search for "the fox" AND "lazy dog" and terms have the following index statistics:

Term Doc frequency Average term frequency
the 100 24
fox 10 5
lazy 40 3
dog 20 10

"the fox" has a cost of min(100,10)=10 so we will execute its approximation before "lazy dog" which has a cost of min(40, 20)=20. However once we reach agreement between both approximations, ie. documents that contain all 4 terms the, fox, lazy and dog, then we will call matches() on "lazy dog" first, since it has 3+10=13 positions per document in total while "the fox" has 24+5=29 of them. As you can see, there is no simple answer to "which query runs first"!

Conclusion

Hopefully this blog post gave you some insights into how query execution works and how Elasticsearch decides on which bits to execute first. Metadata from the inverted index like term frequencies and doc frequencies are not only useful for scoring, but also for figuring out the optimal execution order. Before I leave you, here are some frequently asked questions that you might find interseting:

  • Q: Does the order in which I put my queries/filters in the query DSL matter?
  • A: No, because they will be automatically reordered anyway based on their respective costs and match costs.
  • Q: Do filters get executed before or after queries?
  • A: Neither, really. Everything is interleaved, regardless of whether they are queries of filters. Conjunctions get executed in a way where the clause with the least cost is used to lead iteration and other clauses are advanced to check whether they match too. However we only compute scores when we verified that all clauses match.
  • Q: How can I check which query/filter got executed first?
  • A: We don't really expose this information, which is very internal. However if you check the output of the profile API, you can count how many times nextDoc/advance have been called on the one hand, and matches on the other hand. Query nodes that have the higher counts have been run first.