10. Februar 2013

Understanding "Query Then Fetch" vs "DFS Query Then Fetch"

In our last article on starts-with phrase matching, we ran into a situation where the returned scores were suspicious. As a refresher, here is the query in question:

$ curl -XGET localhost:9200/startswith/test/_search?pretty -d '{
        "query": {
        "match_phrase_prefix": {
           "title": {
             "query": "d",
             "max_expansions": 5
           }
         }
       }
     }' | grep title

      "_score" : 1.0, "_source" : {"title":"drunk"}
      "_score" : 0.30685282, "_source" : {"title":"dzone"}
      "_score" : 0.30685282, "_source" : {"title":"data"}
      "_score" : 0.30685282, "_source" : {"title":"drive"}

See how the document “drunk” receives a score of 1.0, while the rest have a score of 0.3? Shouldn’t these docs all have the same score, since they match the query for “d” equally the same? The answer is yes, but there is a very good reason for this scoring discrepancy.

Relevancy Scoring

Part of the scoring algorithm used by Elasticsearch (and Lucene underneath) includes “Term Frequency – Inverse Document Frequency” statistics to help calculate relevancy of documents in the index.

A lot has been written on the subject of TF-IDF, but it basically says “the more a term appears in a document, the more relevant this document is. But the relevancy is dampened by how often the term appears in the entire index”.

Rare terms are only present in a few documents, which means any query matching a rare term becomes highly relevant. Conversely, common terms are found everywhere, so their relevancy to the query is low.

Elasticsearch faces an interesting dilemma when you execute a search. Your query needs to find all the relevant documents…but these documents are scattered around any number of shards in your cluster.

Each shard is basically a Lucene index, which maintains its own TF and DF statistics. A shard only knows how many times “pineapple” appears within the shard, not the entire cluster.

But the relevancy algorithm uses TF-IDF…doesn’t it need to know how the TF and DF for the entire index, not for each shard?

Default search type: Query Then Fetch

The answer is yes and no. By default, Elasticsearch will use a search type called “Query Then Fetch“. The way it works is as follows:

Send the query to each shard
Find all matching documents and calculate scores using local Term/Document Frequencies
Build a priority queue of results (sort, pagination with from/to, etc)
Return metadata about the results to requesting node. Note, the actual document is not sent yet, just the scores
Scores from all the shards are merged and sorted on the requesting node, docs are selected according to query criteria
Finally, the actual docs are retrieved from individual shards where they reside.
Results are returned to the client

This system usually works fine. In most cases, your index has “enough” documents to smooth out the Term/Document frequency statistics. So while each shard may not have complete knowledge of frequencies across the cluster, results are “good enough” because the frequencies are fairly similar everywhere.

But in the case of our query mentioned at the beginning of this article, the default search type sometimes fails.

DFS Query Then Fetch

In the last article, we built an index without specifying shard count – ElasticSearch used the default of 5 shards. We then inserted a measly five documents into the index and demanded ES return relevant results and accurate scores. Not quite fair, is it?

The scoring discrepancies were caused by the Query Then Fetch search type. Each shard only contained 1 or 2 documents (the hashing algorithm used by ES ensures a relatively random distribution). When we asked Elastic to compute scores, each shard only had a tiny view of the five-doc index…so scores were inaccurate.

Luckily, Elasticsearch doesn’t leave you out to dry. If you have a situation where this scoring discrepancy is problematic, ES provides a search type called “DFS Query Then Fetch”. The procedures is almost identical to Query then Fetch, except it performs a pre-query to calculate global document frequencies.

Prequery each shard asking about Term and Document frequencies
Send the query to each shard
Find all matching documents and calculate scores using global Term/Document Frequencies calculated from the prequery.
Build a priority queue of results (sort, pagination with from/to, etc)
Return metadata about the results to requesting node. Note, the actual document is not sent yet, just the scores
Scores from all the shards are merged and sorted on the requesting node, docs are selected according to query criteria
Finally, the actual docs are retrieved from individual shards where they reside.
Results are returned to the client

If we apply this new search type to our previous query, we get scoring results that make sense (e.g. they are all identical):

$ curl -XGET 'localhost:9200/startswith/test/_search?pretty=true&search_type=dfs_query_then_fetch' -d '{
        "query": {
        "match_phrase_prefix": {
           "title": {
             "query": "d",
             "max_expansions": 5
           }
         }
       }
     }' | grep title

      "_score" : 1.9162908, "_source" : {"title":"dzone"}
      "_score" : 1.9162908, "_source" : {"title":"data"}
      "_score" : 1.9162908, "_source" : {"title":"drunk"}
      "_score" : 1.9162908, "_source" : {"title":"drive"}

Conclusion

Of course, better accuracy doesn’t come for free. The prequery causes an extra round-trip between the shards, which could cause a performance hit depending on size of the index, number of shards, query rate, etc etc. And in most cases, it is totally unnecessary…having “enough” data solves the problem for you.

But sometimes you’ll run into strange scoring situations, and in those cases, it’s useful to know how to tweak the search execution plan with DFS Query then Fetch.