Optimizing sort queries in Elasticsearch for faster results

blog-thumb-search-charts.png

It is very common in Elasticsearch to request results to be sorted by a certain field. We have invested a great deal of time and effort to optimize sort queries to make them faster for our users. This blog will describe some of the sort optimizations for numeric and date fields.

How sorted queries work

When you want to find documents matching a filter and request results to be sorted by a certain field, Elasticsearch would examine this field’s doc values for all documents matched by the filter and choose the top K documents with the top values. In the worst case of a very broad filter (e.g. match_all query), we have to examine and compare doc values for all documents in an index. For big indices, this may take a substantial amount of time.

One way to optimize sort queries on a particular field is to use index sorting and sort the whole index in this field. If the index is sorted by a field, its doc values are also sorted. So to get top K documents sorted by a field, we simply need to take the first K documents, and don’t even need to examine the rest, which makes sorted queries super fast.

Doc values for a field <X> (left) and the same doc values for the field<X> in an index sorted by the field <X> (right).

Doc values for a field (left) and the same doc values for the field in an index sorted by the field (right).

Index sorting is a great solution, but you can do index sorting only in a single way. Index sorting is not useful for sort queries that use different sort criteria, such as descending vs. ascending, different fields, or different combinations from the ones defined in an index sorting definition. So other more flexible approaches to speed up sort queries were needed.

Optimizing numeric sort queries with distance_feature query

In the past, we made significant speedups in term-based queries sorted by _score by storing in for each block of documents its maximum impact – a combination of term frequency and document length. During query time we can quickly judge if a block of documents is competitive by looking at its maximum impact. If a block is not competitive, we can skip this whole block of documents, which makes queries significantly faster.

We thought we could apply a similar approach to speed up sort queries on numeric or date fields. It turned out to be possible by substituting sort with a distance_feature query. distance_feature query is an interesting query that returns top K documents closest to a given origin. If we use the minimum value for the field as the origin, we get top K documents sorted by the ascending order. Using the maximum value as the origin will yield us top K documents in the descending order.

The most interesting property of the distance_feature query for us was that it can efficiently skip non-competitive blocks of documents. It does this by relying on the properties of BKD trees which are used in Elasticsearch for indexing numeric and date fields. Similar to how a postings index for a text field is divided into blocks of documents, a BKD index is divided into cells with each cell knowing its min and max values. Thus, a distance_feature query just by examining cells’ min and max values can efficiently skip non-competitive cells of documents. For this sort optimization to work, a numeric or date field needs to be both indexed and have doc values.

By substituting sorting on doc values with a distance_feature query we could achieve great speedups (on some datasets up to 35x gains). We introduced this sort optimization on date and long fields in Elasticsearch 7.6.

Optimizing sort queries with search_after

We were happy to see these speedups, but we still did not have a good solution for sorting with a search_after parameter. Sorting with search_after is very common, as users are often interested not only in the first page of results, but also in subsequent pages. We have decided that instead of our current approach of rewriting sort queries in Elasticsearch, a better solution would be to make comparators and collectors within Lucene do this sort optimization and skip non-competitive documents. As comparators and collectors in Lucene already deal with search_after, this would allow us to have a solution for this problem as well. The same Elasticsearch code that distance_feature query was using to skip blocks of non-competitive documents was added to Lucene’s numeric comparators.

We have introduced this sort optimization with search_after parameter in Elasticsearch 7.16. We immediately saw great speedups (up to 10x) on some of our nightly performance benchmarks:

We're hiring

Work for a global, distributed team where finding someone like you is just a Zoom meeting away. Flexible work with impact? Development opportunities from the start?