April 23, 2014

Multi-Field Search Just Got Better

The match query is the go-to query for matching on a single field. It understands the field mapping and uses the appropriate analyzer for the field, it can match any word or require all words (with operator set to "or" or "and"), or it can match a minimum number or percentage of words with minimum_should_match. It can do fuzzy matching and phrase or proximity matching. In short, it is very flexible and very powerful.

Multi-field search, on the other hand, is hard. Elasticsearch provides the multi_match query, which makes multi-field search look simple:

{
    "multi_match": {
        "query":    "quick brown fox",
        "fields": [ "title", "body" ]
    }
}

But in reality, it is not as simple as it looks. Unless you understand how the multi_match query works, you will often use it incorrectly and get suboptimal results. Elasticsearch v1.1.0 added some new features to the multi_match query which make multi-field search much more powerful and easier to use.

Types of multi-field search

How you search across multiple fields depends on how your data is indexed and the type of search that you need. There are three main scenarios:

Best matching field

When searching across multiple fields for a single “concept”, you want to look for as many words as possible within the same field. For instance, “brown fox” in a single field is more meaningful than “brown” in one field and “fox” in the other. In other words, you’re looking for the single best matching field.

This type of query can be executed by running a match query against each field, and choosing the relevance _score from the best matching field, using the dis_max query:

{
    "dis_max": {
        "queries": [
            { "match": { "title": "quick brown fox" }},
            { "match": { "body":  "quick brown fox" }}
        ]
    }
}

The multi_match query accepts a type parameter which tells it how to execute the query. The default type is best_fields, which results in exactly the same dis_max query as we have above:

{
    "multi_match": {
        "query":    "quick brown fox",
        "fields": [ "title", "body" ],
        "type":     "best_fields"      # default
    }
}

This query, as written above, will choose the single best matching field, but will ignore other lesser matches. We can still take these secondary matches into account by specifying the tie_breaker parameter:

{
    "multi_match": {
        "query":       "quick brown fox",
        "fields":    [ "title", "body" ],
        "type":        "best_fields",
        "tie_breaker": 0.2
    }
}

The above query will still use the _score from the best matching field, but will also add in the _score from any other matching fields, multiplied by 0.2.

Most matching fields

Often we index the same text with several different analyzers, perhaps as stemmed and unstemmed, with synonyms, with shingles for proximity matching, with edge-ngrams for autocomplete etc. In this case, we want to query all of the fields and add up the _score from each match to find the documents with the most matching fields.

We could write such a query by wrapping individual match clauses with a bool query:

{
    "bool": {
        "should": [
            { "match": { "title":          "quick brown fox" }},
            { "match": { "title.stemmed":  "quick brown fox" }},
            { "match": { "title.synonym":  "quick brown fox" }},
            { "match": { "title.shingle":  "quick brown fox" }},
            { "match": { "title.edge_ng":  "quick brown fox" }}
        ]
    }
}

This is the same query that would be executed by the multi_match query when the type parameter is set to most_fields:

{
    "multi_match": {
        "query":    "quick brown fox",
        "fields": [ "title", "title.*" ],
        "type":     "most_fields"
    }
}

You can give extra “weight” to one or more fields by specifying a boost on that field, using the caret (^) syntax:

{
    "multi_match": {
        "query":    "quick brown fox",
        "fields": [ "title^2", "title.*" ],
        "type":     "most_fields"
    }
}

In the above query, the title field is twice as important as the other fields.

Cross field matching

Finally, we often need to search for entities whose data is spread across multiple fields, such as when we search for "John Smith" in the first_name and last_name fields of a user object. In this case, we want to find as many individual words as possible in any field. The most_fields approach may appear to be the answer here, but there are several reasons why it will not give good results.

Both best_fields and most_fields are field-centric queries — they match each field separately. This means that:

The operator and minimum_should_match operators would apply to each field, rather than to each word in any field. Requiring both John and Smith with the and operator would never match any documents, because they never occur in the same field.
With the most_fields approach, if the same word appears in multiple fields, it will be counted multiple times, instead of just being counted once.
Term frequencies in each field are different. Imagine we had a user whose name was “Smith Jones”. Smith as a last name is very common, but as a first name is very uncommon. A most_fields query for “Peter Smith” may well return the “Smith Jones” user as the first result, as the high weight of Smith-as-a-first-name trumps all documents with Smith-as-a-last-name.

One solution to this problem is just to index the data from first_name and last_name into the single field full_name, which we can do automatically with this mapping:

{
    "first_name": { "type": "string", "copy_to": "full_name" },
    "last_name":  { "type": "string", "copy_to": "full_name" },
    "full_name":  { "type": "string"                         }
}

Then we can just query the full_name field with a simple match query. That said, it is often useful to be able to achieve the same thing across multiple fields. Elasticsearch v1.1.0 added the new word-centric cross_fields execution type which allows you to do just that:

{
    "multi_match": {
        "query":    "Peter Smith",
        "fields": [ "first_name", "last_name" ],
        "type":     "cross_fields"
    }
}

The cross_fields approach first analyzes the query string into individual terms, then it looks for each term in any field, much like this:

{
    "bool": {
        "should": [
            { "dis_max": {
                "queries": [
                    { "term": { "first_name": "peter" }},
                    { "term": { "last_name":  "peter" }}
            ]}},
            { "dis_max": {
                "queries": [
                    { "term": { "first_name": "smith" }},
                    { "term": { "last_name":  "smith" }}
            ]}}
        ]
    }
}

The operator and minimum_should_match parameters would work as you expect, as each word is queried (and so can be counted) separately. But this still leaves the problem of term frequencies. In the above query, Smith-as-a-first-name would still score higher than Smith-as-a-last-name.

In fact, the cross_fields approach doesn’t use dis_max queries. Instead it uses a special blended query which combines the term frequency of Smith-as-a-first-name with the term frequency of Smith-as-a-last-name and uses that value for both fields. In other words, it treats first_name and last_name as if they were one big field.

It has certain advantages over the one-big-field approach:

It is a search-time solution rather than having to be setup at index time.
The index will be smaller without the copy_to field.
Individual fields can be boosted, which can’t be done with the copy_to field.
Each field preserves its own length-norm, which gives more weight to shorter fields
like the title field

Note about analysis

All fields used in a cross_fields query should use the same analyzer so that they all produce the same list of query terms. If fields with different analyzers are queried, then they will be grouped together by analyzer. Each group will be queried with the cross_fields approach, then the scores from all groups will be combined with a bool query.

Alternatively, you can force the same analyzer across all fields by specifying an analyzer in the query:

{
    "multi_match": {
        "query":    "Quick brown fox",
        "fields": [ "title", "body" ],
        "type":     "cross_fields",
        "analyzer": "standard"
    }
}

Conclusion

The cross_fields feature is a really important addition to Elasticsearch. It adds functionality that it was impossible to replicate client side. You can read more about this topic in the Multi-field search chapter in our upcoming book: The Definitive Guide to Elasticsearch.