Filters

Many applications need to give users the power to customize queries in ways that complement what search queries alone can do. In this chapter you are going to learn about filtering, a technique that makes it possible to specify that a search query is executed only on the subset of the documents contained in an index that satisfy a given condition.

Introduction to Boolean Queries

Before you can implement filters you have to understand how compound queries are implemented in Elasticsearch.

A compound query allows an application to combine two or more individual queries, so that they execute together, and if appropriate, return a combined set of results. The standard way to create compound queries in Elasticsearch is to use a Boolean query.

A boolean query acts as a wrapper for two or more individual queries or clauses. There are four different ways to combine queries:

  • bool.must: the clause must match. If multiple clauses are given, all must match (similar to an AND logical operation).
  • bool.should: when used without must, at least one clause should match (similar to an OR logical operation). When combined with must each matching clause boosts the relevance score of the document.
  • bool.filter: only documents that match the clause(s) are considered search result candidates.
  • bool.must_not: only documents that do not match the clause(s) are considered search result candidates.

As you can probably guess from the above, boolean queries involve a fair amount of complexity and can be used in a variety of ways. In this chapter you are going to learn how to combine the multi-match full-text search clause implemented in the previous chapters with a filter that restricts results to one category of documents. Recall that the dataset used with this tutorial includes a category field that can be set to sharepoint, teams or github.

Adding a Filter to a Query

The multi-match query that is currently implemented in the tutorial application uses the following structure:

{
    'multi_match': {
        'query': "query text here",
        'fields': ['name', 'summary', 'content'],
    }
}

To add a filter that restricts this search to a specific category, the query must be expanded as follows:

{
    'bool': {
        'must': [{
            'multi_match': {
                'query': "query text here",
                'fields': ['name', 'summary', 'content'],
            }
        }],
        'filter': [{
            'term': {
                'category.keyword': {
                    'value': "category to filter"
                }
            }
        }]
    }
}

Let's look at the new components in this query in detail.

First of all, the multi_match query has been moved inside a bool.must clause. The bool.must clause is usually the place where the base query is defined. Note that must accepts a list of queries to search for, so this allows multiple base-level queries to be combined when desired.

The filtering is implemented in a bool.filter section, using a new query type, the term query. Using a match or multi_match query for a filter is not a good idea, because these are full-text search queries. For the purpose of filtering, the query must return an absolute true or false answer for each document and not a relevance score like the match queries do.

The term query performs an exact search for the a value in a given field. This type of query is useful to search for identifiers, labels, tags, or as in this case, categories.

This query does not work well with fields that are indexed for full-text search. String fields are assigned a default type of text, and have their contents analyzed and separated into individual words before they are indexed. Elasticsearch assigns string fields a secondary type of keyword, which indexes the field contents as a whole, making them more appropriate for filtering with the term query. By using a field name of category.keyword in the filter portion of the query, the keyword typed variant of the field is used instead of the default text one.

Specifying a Filter

Before the filtered query can be implemented, it is necessary to add a way for end users to enter a desired filter. The solution implemented in this tutorial will look for a category:<category-name> pattern in the text of the search query. Let's add a function called extract_filters() to app.py to look for filter expressions:

def extract_filters(query):
    filters = []

    filter_regex = r'category:([^\s]+)\s*'
    m = re.search(filter_regex, query)
    if m:
        filters.append({
            'term': {
                'category.keyword': {
                    'value': m.group(1)
                }
            }
        })
        query = re.sub(filter_regex, '', query).strip()

    return {'filter': filters}, query

The function accepts the query entered by the user and returns a tuple with the filters that were found in the query, and the modified query after the filters were removed. To look for the filter pattern it uses a regular expression. The function is designed to be expanded with additional filters.

When a filter is found, the filters list is extended with a corresponding filter expression, which in this case is based on the term query, as discussed above.

To better understand how this function works, start a Python session (make sure the virtual environment is activated first) and run the following code:

from app import extract_filters
extract_filters('this is the search text category:sharepoint')

The returned tuple from the function should be:

{'filter': [{'term': 'category.keyword': {'value': 'sharepoint'}}]}, 'this is the search text'

What remains to do is to change the handle_search() function to send an updated query that combines the full-text search expression with a filter, if one is given by the user. Below is the new version of this function:

@app.post('/')
def handle_search():
    query = request.form.get('query', '')
    filters, parsed_query = extract_filters(query)
    from_ = request.form.get('from_', type=int, default=0)

    results = es.search(
        query={
            'bool': {
                'must': {
                    'multi_match': {
                        'query': parsed_query,
                        'fields': ['name', 'summary', 'content'],
                    }
                },
                **filters
            }
        },
        size=5,
        from_=from_
    )
    return render_template('index.html', results=results['hits']['hits'],
                           query=query, from_=from_,
                           total=results['hits']['total']['value'])

The query has now been changed to send a bool expression, and the search expression was moved inside a must section under it. The extract_filters() function returns the filter portion of the query in the form it needs to be sent to Elasticsearch, so it is inserted in the query dictionary also under the top-level bool key.

Try a search query such as work from home category:sharepoint to see how only documents from the given category are returned.

Range Filters

Elasticsearch supports a variety of filters besides the term filter. Another one that is commonly used is the range filter, which works with numbers and dates. Let's add a year filter that can be used to restrict results based on the year they were last updated, which is given in the updated_at field.

Below is an updated version of the extract_filters() function that looks for both category:<category> and year:<yyyy> as filters:

def extract_filters(query):
    filters = []

    filter_regex = r'category:([^\s]+)\s*'
    m = re.search(filter_regex, query)
    if m:
        filters.append({
            'term': {
                'category.keyword': {
                    'value': m.group(1)
                }
            },
        })
        query = re.sub(filter_regex, '', query).strip()

    filter_regex = r'year:([^\s]+)\s*'
    m = re.search(filter_regex, query)
    if m:
        filters.append({
            'range': {
                'updated_at': {
                    'gte': f'{m.group(1)}||/y',
                    'lte': f'{m.group(1)}||/y',
                }
            },
        })
        query = re.sub(filter_regex, '', query).strip()

    return {'filter': filters}, query

This version adds a second regular expression to find year:yyyy in the query string. It creates a range filter for the updated_at field, and sets the low and high bounds of the range to the year that is given after the colon, which is captured in the regular expression match as m.group(1).

There is a small complication, because the updated_at field contains full dates, and in this filter only needs to look at the year. Luckily, when the range filter is used with date field the bounds of the range can be enhanced with date math. The ||/y suffix that is added to the gte (lower bound) and lte (upper bound) parameters of the range indicates that the given value is a year that must be completed to form a full date that can be compared against the field.

With this change, you can include a query such as year:2020 work from home to see results from the requested year only. The query can include the two filters as well, for example year:2020 category:teams work from home.

The match-all query

Before moving on to a new topic, try entering only a filter in the search query text field, for example category:github. Unfortunately this does not return any results, but the expected behavior in this case would be to receive all the results that match the requested category.

What happens is that the extract_filters() function returns a tuple with the filter(s) in the first element and an empty query string in the second element. The multi_match query receives the empty string, and returns an empty list of results, because nothing matches an empty string.

To address this special case, the multi_match query can be replaced with match_all when the search text is empty. The version of the handle_search() function below adds logic to do this. Update the function in app.py.

@app.post('/')
def handle_search():
    query = request.form.get('query', '')
    filters, parsed_query = extract_filters(query)
    from_ = request.form.get('from_', type=int, default=0)

    if parsed_query:
        search_query = {
            'must': {
                'multi_match': {
                    'query': parsed_query,
                    'fields': ['name', 'summary', 'content'],
                }
            }
        }
    else:
        search_query = {
            'must': {
                'match_all': {}
            }
        }

    results = es.search(
        query={
            'bool': {
                **search_query,
                **filters
            }
        },
        size=5,
        from_=from_
    )
    return render_template('index.html', results=results['hits']['hits'],
                           query=query, from_=from_,
                           total=results['hits']['total']['value'])

With this version, you can ask for all the documents that match a category. Note how all the results that are returned come back with the same score of 1.0, because there are no search terms to compute scores.

Share this article