August 26, 2013

Stop Stopping Stop Words: a Look at Common Terms Query

Stop words are a fundamental part of Information Retrieval lore. Common wisdom dictates that you should identify and remove stop words from your index. This increases both performance (fewer terms in your dictionary) and more relevant search results.

Elasticsearch supports stop word removal through the Stop token filter, but a new query was recently added which makes this filter unnecessary: Common Terms Query.

Background

Before we go much farther, Why would you want to remove stop words anyway? In most cases, stop words add little semantic value to a sentence. They are filler words that help sentences flow better, but provide very little context on their own. Stop words are usually words like “to”, “I”, “has”, “the”, “be”, “or”, etc etc.

Even worse than being void of useful information…they are everywhere! Stop words are the most frequent words in the English language. Because of this, most sentences share a similar percentage of stop words. Stop words bloat your index without providing any extra value.

If they are both common and lacking in much useful information, why not remove them?

Removing stop words helps decrease the size of your index as well as the size of your query. Fewer terms is always a win with regards to performance. And since stop words are semantically empty, relevance scores are unaffected.

Stop words are useless…

…unless you actually need them. The classic example is a phrase from Shakespeare:

“To be or not to be.”

This sentence is composed entirely of stop words, and would therefore not be represented in your index. That’s a problem. At this point, people usually start maintaining multiple mappings of their data: one with stop words removed, and one that has stop words intact. They are boosted differently, to favor phrases with no stop words.

This procedure, however, eliminates the benefits of stop word removal. Instead of decreasing the number of terms being searched (and indexed), we just doubled it by indexing the field in two different ways!

Enter Common Terms

The new Common Terms Query is designed to fix this situations, and it does so through a very clever mechanism. At a high level, Common Terms analyzes your query, identifies which words are “important” and performs a search using just those words. Only after documents are matched with important words are the “unimportant” words considered

The motivation behind Common Terms is to leverage the power of stop word removal (faster searches) without eliminating stop words entirely (because they can contribute to score sometimes). You can have your cake and eat it too!

How Common Terms works

Let’s first look at how you construct a Common query:

{
  "common": {
    "body": {
      "query":            "this is bonsai cool",
      "cutoff_frequency": 0.001
    }
  }
}

Note: Common Terms has also been incorporated into the Match query and can be enabled by setting `cutoff_frequency` to a value like 0.001

Several things happen when you execute this query:

The query is sent to each shard in the search
At each shard, Elasticsearch looks at the document frequencies for each term
If a term has a document frequency less than 0.1% (0.001) then it is considered “low frequency”. Otherwise, it is moved to a secondary “high frequency” list
The “low frequency” list is rewritten into a Boolean Should clause (logical OR). In this example, it would contain “bonsai”, “cool”
Any documents that match the low frequency list are then scored against the remaining high frequency list (“this”, “is”)

Internally, the query is rewritten into roughly this representation:

{
  "bool": {
    "must": [
      { "term": { "body": "bonsai"}},
      { "term": { "body": "cool"}}
    ],
    "should": [
      { "term": { "body": "this"}}
      { "term": { "body": "is"}}
    ]
  }
}

It should be easy to see how this can greatly affect performance. The low frequency words act as a pre-filter, drastically reducing the number of terms that need to be evaluated/scored against the full index.

Adaptive Stop Word Evaluation

With traditional stop word schemes, you must first create a list of stop words. Every domain is unique when it comes to stop words: there are no “pre-made” stop word lists on the internet.

As an example, consider the word “video“. For most businesses, “video” is an important word – it shouldn’t be removed. But if you are Youtube, “video” is probably mentioned in thousands of places…it is definitely a stop word in this context. Traditional stop word removal would need a human to sit down, compile a list of domain-specific stop words, add it to Elasticsearch and then routinely maintain the list with additions/deletions.

In contrast, Common Terms adapts to your domain. Words with high frequency will automatically be recognized as stop words. Different companies, using the same query, will get different stop word behavior due to the unique set of documents in their index.

Customizing Common Terms

Common Terms allows considerable customization. Since it is effectively creating two Bool queries, there are a number of options you can twiddle. Let’s look at an example that uses all the options:

{
  "common": {
    "body": {
      "query":             "nelly the elephant not as a cartoon",
      "cutoff_frequency":   0.001,
      "low_freq_operator":  "or",
      "high_freq_operator": "or",
      "minimum_should_match": {
          "low_freq" : "60%",
          "high_freq" : "20%"
       }
    }
  }
}

In this query, we are specifying that both the high- and low-frequency term booleans should use an “Or” operator (e.g. go into a Should clause). Alternatively, we could make one or both of these options an “And” (go into a Must clause).

We are also specifying how many of the terms we want to match, for both high- and low-frequencies. This can be either a percentage or an exact value. By fiddling with the cuttoff and minimum_should_match values, you can very easily hone in on the behavior that you want.

Revisiting Shakespeare

Let’s revisit that Shakespearean phrase we saw earlier. What happens when we query for “To be or not to be” with Common Terms?

{
  "common": {
    "body": {
      "query":            "To be or not to be",
      "cutoff_frequency": 0.001
    }
  }
}

In most corpuses, all those terms are going to be well over 1%, so they are all going to be categorized as highly frequent terms. What does Common Terms do in this situation?

Since all terms are highly frequent and there are no low frequency terms, Elasticsearch converts the high-frequency list into a Must clause:

{
  "bool": {
    "must": [
      { "term": { "body": "to"}},
      { "term": { "body": "be"}},
      { "term": { "body": "or"}},
      { "term": { "body": "not"}}
    ]
  }
}

The theory is that, individually, these terms carry very little semantic information and will match too many documents. But when they are all required, they become much more exclusive and match a limited subset of documents.

Sometimes this approach can be too strict, so it may make sense to relax the requirements with a minimum_should_match:

{
  "common": {
    "body": {
      "query":            "To be or not to be",
      "cutoff_frequency": 0.001,
      "high_freq_operator": "or",
      "minimum_should_match": {
          "high_freq" : "70%"
       }
    }
  }
}

Which will convert the Must clause into a Should that must match 70% of the terms. This is more restrictive than a simple OR, but less restrictive than an AND.

Conclusion

The Common Terms Query is exciting, since it is truly a win-win. You gain the speed of stop word removal, but maintain the precision of leaving stop words in the index. Stop words will still have their uses, but we envision the majority of queries converting to a Common query.

It is a powerful optimization that doesn’t sacrifice relevance for speed. Give it a shot with your data and see if it helps boost performance!