31 mai 2016 Technique

Lost in Translation: Boolean Operations and Filters in the Bool Query

Par Tyler Fontaine

With 5.0 on the horizon, a number of query types deprecated in 2.x will be removed. Many of those are replaced by functionality of the bool query, so here’s a quick guide on how to move away from filtered queries; and, or, not queries; and a general look into how to parse boolean logic with the bool query.

For the examples used in this article, let's assume this scenario: A school surveyed its students on their preferences for fruit snacks. Now you're putting together an application so administrators can view this data. Because of the many grades, categories, and other features of the data, you may find you have some arbitrarily complex boolean logic to deal with.

Matching Boolean Operations with the Bool Query Fields

Let's get to the heart of these boolean operations and how they'd look without the and, or, not queries. In the bool query, we have the following fields:

  • must
  • must_not
  • should
  • filter

Must is analogous to the boolean AND, must_not is analogous to the boolean NOT, and should is roughly equivalent to the boolean OR. Note that should isn't exactly like a boolean OR, but we can use it to that effect. And we’ll take a look at filter later on.

Boolean AND and NOT are easy, so let's look at those two first. If you want documents where preference_1 = Apples AND preference_2 = Bananas the bool query looks like this:

{
  "query" : {
    "bool" : {
      "must": [{
          "match": {
              "preference_1": "Apples"
          }
      }, {
          "match": {
              "preference_2": "Bananas"
          }
      }]
    }
  }
}

If you want documents where preference_1 != Apples:

{
  "query" : {
    "bool" : {
      "must_not": {
          "match": {
              "preference_1": "Apples"
          }
      }
      }
    }
  }

But what about OR? That's where the should parameter comes in. If you want the set of documents where preference_1 = Apples OR preference_1 = Raspberries:

{
  "query" : {
    "bool" : {
      "should": [{
          "match": {
              "preference_1": "Apples"
          }
      }, {
          "match": {
              "preference_1": "Raspberries"
          }
      }]
    }
  }
}

So these, then, can all be combined into much more complex boolean logic, because we can easily nest bool queries. So let's look at this boolean logic: (preference_1 = Apples AND preference_2=Bananas) OR (preference_1 = Apples AND preference_2 = Cherries) OR preference_1 = Grapefruits

So in this case, you are searching for documents that match this set of rules, so documents where preference_1 is Apples and term 2 is either Bananas or Cherries, OR preference_1 is grapefruit, regardless of what term 2 equals.

That logic translates into a query that looks like this:

{
    "query": {
        "bool": {
            "should": [{
                "bool": {
                    "must": [{
                        "match": {
                            "preference_1": "Apples"
                        }
                    }, {
                        "match": {
                            "preference_2": "Bananas"
                        }
                    }]
                }
            }, {
                "bool": {
                    "must": [{
                        "match": {
                            "preference_1": "Apples"
                        }
                    }, {
                        "match": {
                            "preference_2": "Cherries"
                        }
                    }]
                }
            }, {
                "match": {
                    "preference_1": "Grapefruit"
                }
            }]
        }
    }
}

To break this down a bit further:

Note that the whole of this query is wrapped in a should which satisfies the three OR clauses, and each individual piece is its own nested bool query.

So the (preference_1 = Apples AND preference_2=Bananas) piece is

{
    "bool": {
        "must": [{
            "match": {
                "preference_1": "Apples"
            }
        }, {
            "match": {
                "preference_2": "Bananas"
            }
        }]
    }
}

And because this is all wrapped in a should, the next in the chain would be an OR

So the preference_1 = Apples AND preference_2 = Cherries piece would be another bool:

{
    "bool": {
        "must": [{
            "match": {
                "preference_1": "Apples"
            }
        }, {
            "match": {
                "preference_2": "Cherries"
            }
        }]
    }
}

And then finally the single term:

{
                "match": {
                    "preference_1": "Grapefruits"
                }

The query can be arbitrarily complex, to fit your particular boolean requirement. Each piece can be broken down and turned into its elementary boolean expressions, then chained together as shown above, to make sure you're retrieving the right documents. It’s also worth noting here that you can set minimum_should_match to a value you choose. This is the prime difference of the should function from the boolean OR. By default , minimum_should_match defaults to 1, but if you would like for more than one should clause to match for a document to be returned, you can increase this value.

Filtered Queries

Because filtered queries have also been deprecated in 2.x, the new method is the filter field in the bool query. So let's take our boolean logic from before: (preference_1 = Apples AND preference_2=Bananas) OR (preference_1 = Apples AND preference_2 = Cherries) OR preference_1 = Grapefruits.

Let’s say an administrator is focusing on grade 2, so they want to see only the results for that grade. It would just be a matter of adding a filter to the bool to get only those documents. If you are filtering documents, keep in mind that the relevance score for the returned results is only affected by the query, not by the filter, so it’s possible that your relevance scoring may not be what you expect, since it’s the query that sets the score, not the filter.

Here’s what the filter looks like:

"filter" : {
                "term": {
                   "grade": "2"
                }
            }

So the whole query looks like this:

{
    "query": {
        "bool": {
            "should": [{
                "bool": {
                    "must": [{
                        "match": {
                            "preference_1": "Apples"
                        }
                    }, {
                        "match": {
                            "preference_2": "Bananas"
                        }
                    }]
                }
            }, {
                "bool": {
                    "must": [{
                        "match": {
                            "preference_1": "Apples"
                        }
                    }, {
                        "match": {
                            "preference_2": "Cherries"
                        }
                    }]
                }
            }, {
                "match": {
                    "preference_1": "Grapefruit"
                }
            }],
            "filter" : {
                "term": {
                   "grade": "2"
                }
            }
        }
    }
}

As you can see, the bool query is quite versatile, but sometimes parsing out the complex boolean operations into the bool query syntax can be a challenge. Hopefully, this has given you a better idea of how each of the bool query fields map to traditional boolean operators, and how you can chain those together for complex boolean logic for better search results.

Also, this hopefully has given you some ideas as you work to remove filtered and and, or, not queries using the bool query to make sure you are ready for 5.0 when it is released. As ever, please make sure you review all breaking changes and thoroughly test against your specific use case prior to upgrading.