07 mars 2016 Technique

Phrase Queries: a world without Stopwords

By Gabriel Moskovicz

Analysis and query processing can be confusing when you start out with Elasticsearch. It’s one of the most common problems that users run into, and one clear example is phrase matching. In this article we will dig into the process of understanding some of the complexity of Phrase queries, while working with languages and stopwords.

The key of working with Elasticsearch is to have a better understanding of what a query is doing. Furthermore, we need to understand that the art of searching starts when a document is indexed, and all its field content is analyzed and finally indexed into Lucene. Most of the time, we are only thinking about how to query our data source, but in Elasticsearch we should first think about the insight that we want to get from our data, so we may then discover the best way to accomplish this. To achieve our goals in Elasticsearch we need to think about the entire picture: indexing and querying.

Indexing means not only adding the structured document in Elasticsearch, but also including the analysis of each field. Each field is analyzed, and then the result of the analysis is indexed by Elasticsearch.

But what is analysis? By definition it is the process of breaking a - complex - topic (or substance) into smaller parts in order to gain a better understanding of it. This word comes from an Ancient Greek word ἀνάλυσις (analysis, “a breaking up”, from ana- “up, throughout” and lysis “a loosening”).

In concrete terms, Elasticsearch analysis is a process that consists of the following steps:

  • Tokenizing a block of text into individual terms.
  • Normalizing those terms to improve their “searchability”.

For more information and details about the full analysis process please visit The Definitive Guide.

Once the document is indexed, we can execute queries to retrieve the results based on different conditions. The query execution and matching will be strongly related to the way that we index data, hence why we say that Elasticsearch is not only about querying, but about creating a solution that suits user needs.

Languages and a world with Stopwords

The stopword definition is very simple: they are the most common words in a language. The important fact about stopwords is that, since they are very common, these are words that are going to frequently appear in our language, phrases and probably most of our text fields. Some examples of stop words are: “a”, “and”, “but”, “how”, “or”, and “what”. Usually, when we search for something, we want to exclude them. This can be tricky, because excluding words is easy, but it will impact the results of certain queries.

enter image description here

The art of Phrasing

In linguistic analysis, a phrase is a group of words (or possibly a single word) that functions as a constituent in the syntax of a sentence: a single unit within a grammatical hierarchy. The phrase is composed of different words, one followed by the other, with a certain order. While two phrases can have the same meaning, the order of the words will create a completely different phrase. As an example: “This fox is brown” is totally different to “Is the fox brown”. Why is this? Because each word has a position within the phrase that causes the meaning of the sentences to be similar, but the word chain to be completely different.

As we were explaining in the introduction, the indexing process in Elasticsearch will execute the analysis process. Analysing phrases in Elasticsearch is simple, and can help to explain why two phrases are different. In Elasticsearch, we provide the Analyze API that can be used with a simple set of parameters to understand what a specific analysis process is doing. In the following example we will be using the English analyzer that is predefined in Elasticsearch to understand why the following phrases are different.

First example

GET _analyze?text=This fox is brown&analyzer=english 
{
  "tokens": [
    {
      "token": "fox",
      "start_offset": 5,
      "end_offset": 8,
      "type": "",
      "position": 1
    },
    {
      "token": "brown",
      "start_offset": 12,
      "end_offset": 17,
      "type": "",
      "position": 3
    }
  ]
}

Second example

GET _analyze?text=Is the fox brown&analyzer=english
{
  "tokens": [
    {
      "token": "fox",
      "start_offset": 7,
      "end_offset": 10,
      "type": "",
      "position": 2
    },
    {
      "token": "brown",
      "start_offset": 11,
      "end_offset": 16,
      "type": "",
      "position": 3
    }
  ]
}

The output that we get from this API is not only the tokens generated by the analysis process, but its position and character offset, which is very useful to understand the different phrases. In both examples, the tokens generated are only brown and fox, however you can see that the position of each token within the phrase is different. Please note that in the first example the position from one word to the other differ in 2 locations.

Searching our Phrases

Searching for content is very straightforward, one can easily search for all those documents or fields that contain certain words. However, searching for full phrases is a completely different problem. Not only the words are important, but also the word chain and position of each word in the chain is important to ensure that the phrase is matching. This means that extra conditions are needing to be met in order for a document to match a certain phrase.

To have a better understanding of the problems that we can get we will index some documents, creating an index that uses the English analyzer for a string field. Here is the snippet code for this:

# Create the index, with a test_type that contains a text field that uses the english analyzer
PUT test
{
  "mappings": {
    "test_type": {
      "properties": {
        "text": {
          "type": "string",
          "analyzer": "english"
        }
      }
    }
  }
}

# Index the first document
POST test/test_type
{
  "text": "This fox is brown"
}

# Index the second document
POST test/test_type
{
  "text": "Is the fox brown"
}

Now we are ready to execute some phrase searches. In the following example, i will search for the phrase “fox brown”, so i can get both results. The following simple query is the one that you can execute to retrieve this results:

POST test/_search
{
  "query": {
    "match_phrase": {
      "text": "fox brown"
    }
  }
}

However, when we execute this search we can verify that a single document is matching. The actual result of this query will be:

{
  "took": 3,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0.38356602,
    "hits": [
      {
        "_index": "test",
        "_type": "test_type",
        "_id": "AVMtW9pP50vx9-KwTUbS",
        "_score": 0.38356602,
        "_source": {
          "text": "Is the fox brown"
        }
      }
    ]
  }
}

Why is this? Because in “fox brown” the position of fox in the phrase is 1, and the position of brown in the phrase is 2, hence the only document that will match this query will be the one that contains the word fox followed (differing a single position) by brown. This is one of the common mistakes that we make when we try to search for phrases expecting some results that we are actually not getting. Another example will be searching for “fox in brown”. We will expect no results to be found by this phrase, however let’s take a deeper look at this:

# The following is the query to execute
POST test/_search
{
  "query": {
    "match_phrase": {
      "text": "fox in brown"
    }
  }
}

And the result will be:

{
  "took": 3,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0.38356602,
    "hits": [
      {
        "_index": "test",
        "_type": "test_type",
        "_id": "AVMtW9W050vx9-KwTUbR",
        "_score": 0.38356602,
        "_source": {
          "text": "This fox is brown"
        }
      }
    ]
  }
}

So we now see that “fox in brown” is matching “This fox is brown”, while we don’t expect this. So why is this document matching now? Let’s go back to the analysis of our document. In the beginning of this article, we analyzed “This fox is brown” resulting in two tokens, with specific positions:

  • Fox with position 1
  • Brown with position 3

And when we are searching for the phrase “fox in brown” we are going to be searching for documents that:

  • Contains Fox and Brown
  • Fox is at the first position
  • Fox is followed by Brown
  • Fox and Brown differ in 1 position

Since “in” is a stopword, it is removed in the analysis process, as well as “is” was removed before, and all this will impact the result of all our queries.

The conclusion of this article is that we need to strongly consider the entire process to understand if the queries that we are creating are going to match the information that we need. For all this, we should keep an eye on the way that we index the data to verify the entire process. The art of searching in Elasticsearch is not only about creating the exact queries, but mixing with a smart analysis process to sympathise with these queries. After all, Elasticsearch is all about analyzing (indexing) and querying.