27 juin 2016 Technique

Elasticsearch Percolator Continues to Evolve

Par Martijn van Groningen

In 5.0 the percolator is much more flexible and has many improvements. For example, to be able to skip evaluating most queries. All of this is part of the second major refactoring since Elasticsearch 1.0.0, which made the percolator scale with the number of shards and nodes in your cluster. However the underlying mechanism of the percolator hasn’t been changed since this feature was released back in version 0.15.0. If you didn’t make use of query metadata tagging the execution time of the percolator was always linear to the amount of percolator queries, because all percolator queries had to be evaluated all the time. The main purpose of this refactoring was to address this, so that in many cases not all percolator queries have to be evaluated when percolating a document.

The slowest part of percolating is verifying if a percolator query actually matches with the document being percolated. When percolating, the document being percolated gets indexed into temporary in-memory index. Prior to 5.0, all percolator queries need to be executed on this in-memory index in order to verify whether the query matches. So the idea is that the less queries that need to be verified by the in-memory index the faster the percolator executes.

Percolator field mapper

It is no longer required to index percolator queries in the special .percolator type under the query field. Any field and type (in any index) can contain percolator queries. Instead before indexing percolator queries, you must configure the percolator field mapping in the type you’re going to index percolator queries into. So let’s take a look at how this looks now:

curl -XPUT "http://localhost:9200/news" -d'
{
  "mappings": {
    "alert": {
      "properties": {
        "query": {
          "type": "percolator"
        }
      }
    },
    "item": {
      "properties": {
        "body" : {
          "type": "string",
          "analyzer": "english"
        }
      }
    }
  }
}'

We created an index with the name news, which has two mappings. The first mapping alert, is for the percolator query documents, in this case the query must be defined inside the query field. The second mapping is for a document being percolated. We need to define this mapping upfront, otherwise the queries that are going to be indexed wouldn’t be analyzed correctly and the process used to skip evaluating percolator queries relies on this analysis. After this we can just index the following document, which holds a match query:

curl -XPUT "http://localhost:9200/news/alert/1" -d'
{
  "query" : {
    "match" : {
      "body" : "space fire"
    }
  }
}'

Besides storing the actual query, the percolator field mapper extracts all terms from the query and indexes them separately into an auxiliary indexed field that is part of query field.

In the above example the following terms will get extracted and indexed: title:space and title:fire. During percolation all the terms from the document to be percolated are extracted. A query is built from these terms, so that the percolator can query this auxiliary field to find candidate percolator queries that may match with the document being percolated. Potentially many percolator queries that don’t match with this query are never evaluated by the in-memory index and thus reducing the time it takes to execute the entire percolate request. It is safe to ignore these percolator queries, because if the queries’ terms don’t appear in the document being percolated then these queries will never match anyway. This is a big win.

The percolator query can’t extract terms from all queries and if that happens these percolator queries get marked and will always be evaluated upon percolating. Most term based queries (like match, multi_match, terms, and most span queries) and compound queries (like bool, dis_max and constant_score queries) are supported.

If you’re wondering how you can figure out which queries the percolator field mapper was able to extract the terms for, you can just execute a query as described here.

Also the percolator will no longer load the percolator queries as Lucene queries into memory as they are instead read from disk. Pre 5.0 if you had thousands of percolator queries they’d take up megabytes of precious JVM heap space, putting pressure on jvm garbage collecting and if not being careful lead to an infamous jvm out of memory error. Back then loading the percolator queries into memory made sense because all the percolator queries were evaluated all the time so we made executing each one as fast as possible. Now with pre-selecting, only percolator queries that are likely to match. We decided to trade speed for stability, removing the caching to free up memory. The speed loss is more than paid for by skipping most queries in most cases.

Also updates made to the percolator query are no longer visible in real time by default. This is because Elasticsearch relies on a query to select percolator queries to execute and that search index is updated by the refresh cycle. If you want changes to a percolator query to be visible in real time, you need to run a refresh as part of the index, delete, or update request. Alternatively you can also use the new wait on refresh option that is available on all write APIs.

Indices holding .percolator types created before upgrading to Elasticsearch 5.0 will continue to work. However because these .percolator types don’t have the new percolator field mapper, the percolator will need to evaluate all the queries all the time. So it is strongly recommended that the percolator queries in these indices are reindexed into a new index. We will drop support for the .percolator type in the next major version of Elasticsearch.

Percolate query

Another big change is that the percolate and multi percolate APIs have been superseded by the search and multi search APIs with the percolate query.

First let’s look at percolating a document in Elasticsearch 2.x and before via percolate API:

curl -XGET "http://localhost:9200/news/item/_percolate" -d'
{
  "doc": {
    "body": "An unmanned cargo ship pulled away from the ISS to experiment on how big fires grow in space, an important test for astronaut safety ...",
    "category": "science"
  }
}'

The percolate API would then respond with a response like this:

{
  "total": 1,
  "matches": [
    {
      "_index": "news",
      "_id": "1"
    }
  ]
}

Percolating a document via the search API in Elasticsearch 5.0 and onwards:

curl -XGET "http://localhost:9200/news/_search" -d'
{
  "query": {
    "percolate": {
      "field" : "query",
      "document_type" : "item",
      "document": {
        "body": "An unmanned cargo ship pulled away from the ISS to experiment on how big fires grow in space, an important test for astronaut safety ...",
        "category": "science"
      }
    }
  }
}'

The search API would then respond with the following response:

{
  "hits": {
    "total": 1,
    "max_score": 0.5260227,
    "hits": [
      {
        "_index": "news",
        "_type": "alert",
        "_id": "1",
        "_score": 0.5260227,
        "_source": {
          "query": {
            "match": {
              "body": "space fire"
            }
          }
        }
      }
    ]
  }
}

As you can see the percolate API response is very minimalistic compared to the search API response. The percolate API basically just returns ids, whereas the the search API returns the source of the percolator and score.

Aside from returning more information, the move to the search API and its infrastructure was huge. Just by this refactoring 7 requested features (pagination support, returning _source and others) were immediately implemented, while at the same time exposing the same functionality the percolate APIs did.

Also because the percolator is now a query, it has become much more flexible. You’re free to use the percolate in any other query. For example in a must_not clause of bool query or just define multiple percolate queries inside of a bool query. For example percolate two documents at the same time via a bool query with two percolate queries in its should clauses:

curl -XGET "http://localhost:9200/news/_search" -d'
{
  "query": {
    "bool": {
      "should": [
        {
          "percolate": {
            "field": "query",
            "document_type": "item",
            "document": {
              ...
            }
          }
        },
        {
          "percolate": {
            "field": "query",
            "document_type": "item",
            "document": {
              ...
            }
          }
        }
      ]
    }
  }
}'

In the above case this would return percolate queries that match with either documents or both.

Scoring in percolator has also changed completely. In Elasticsearch 2.x and before the percolator API didn’t return a score unless a query was specified. The query was meant to query on the percolator query’s metadata and therefore the score didn’t tell anything about how well the percolator query matches with the document being percolated. The score that the percolate query emits for each percolator query is now based on the score the in-memory Lucene index computes.

Note that if your application is using the percolate or mpercolate APIs, that you can still use these APIs after you’ve upgraded to 5.0. However these APIs have been deprecated and will not exist from Elasticsearch 6.0. This will give the opportunity to first upgrade to Elasticsearch 5.0 and then migrate to the percolate query either via the search or msearch APIs. Behind the scene
the percolate and mpercolate APIs will transform the percolate request into a search request and redirect that to the search API or msearch API.

Improving the percolator doesn’t stop. Especially the percolator field mapper will continue to get improvements, so that at search time the percolate query will need to fallback to the in-memory index for match verification less often. We would love for you to try out the latest alpha or beta release in order to test drive the improved percolator. Happy percolating!