25 11월 2013 엔지니어링

Redesigned Percolator

The percolator is essentially search in reverse, which can by confusing initially for many people. This post will help to solve that problem and give more information on the redesigned percolator. We have added a lot more features to it to help users work with percolated documents/queries more easily.

In normal search systems, you store your data as documents and then send your questions as queries. The search results are a list of documents that matched your query.

With the percolator, this is reversed. First, you store the queries and then you send your 'questions' as documents. The percolator results are a list of queries that matched the document.

So what can do percolator do for you? The percolator can be used for a number of use cases, but the most common is for alerting and monitoring. By registering queries in Elasticsearch, your data can be monitored in real-time. If data with certain properties is being indexed, the percolator can tell you what queries this data matches.

For example, imagine a user "saving" a search. As new documents are added to the index, documents are percolated against this saved query and the user is alerted when new documents match. The percolator can also be used for data classification and user query feedback.

So how does it work in Elasticsearch? Data and queries are two separate things, but are both expressed in JSON. By using this common property, we can send queries to Elasticsearch and they will be stored as documents...but that alone isn't enough. Elasticsearch needs to treat it as a query, not a document. The current percolator system stores queries in a dedicated _percolator index. Whenever queries are registered with the percolator index, the query part is extracted and turned into a real query, which is kept around for later usage via the percolate api. The percolate api is the equivalent to the search api for percolating. Also the percolate api is one of the APIs that is fully real-time, meaning that as soon as a query is registered with the percolator it available for use. Let take a look at the percolator already available in many Elasticsearch releases:

# Register a query for <= 0.90.x percolator.
curl -XPUT 'localhost:9200/_percolator/my-index/my-id' -d '{
    "query" : {
        "match" : {
            "body" : "coffee"
        }
    }
}'

# <= 0.90.x percolate api
curl -XPUT 'localhost:9200/my-index/my-type/_percolate' -d '{
    "doc" : {
        "title" : "Coffee percolator",
        "body" : "A coffee percolator is a type of ..."
    }
}'

The _percolator system index that is used to register a query as shown in the first request in the above sample is a single primary shard index that has replica shards on each data node and these settings are fixed. The last request in the above sample sends a percolate request to Elasticsearch and will yield the following result:

{
    "ok" : true
    "matches" : ["my-id", ...]
}

This image below illustrates how a client executes a percolate request. It doesn't matter where the percolate request is executed, because each data node has a _percolate shard (p1 squares), that sits next to all the other shards (other squares) a data node may have.

The current percolator has been around since ES version 0.15.0 and since then more people have started using it with more and more queries. The percolator works fine until a certain amount of queries are store in it, but then the percolate execution time begins to increase to a level where people are less comfortable using it in production. It scales roughly linear to the number of registered queries.

In order to get around this limitation, queries can be partitioned against multiple indices or the percolate query mechanism can be used in order to reduce the percolator execution time. Even with these workarounds, the current percolator has fundamental scaling limits, which becomes obvious when people are percolating massive amounts of queries. So we had to go back to the drawing board and come up with a different solution for how to scale the percolator. The good news is that we did and came up with an alternative solution which we already included in our first 1.0 beta release!

So what has changed in the percolator? The core of the percolator didn't change that much. You still register percolator queries as in the old system, and percolate documents with the percolate api. What has changed is where we store the percolate queries. We dropped the reserved _percolator index and instead percolate queries are now be stored in any index under the .percolator type. This means that any index can be used as a percolate index, which allows for more flexibility when it comes to scaling. The single shard index restriction has been removed.

Each percolator index can be configured with the number of shards necessary to hold your percolator queries. The percolate api has been changed to execute on any index and parallelize the execution between all the shards within that index. The percolate api can now percolate documents against multiple indices and aliases. The percolate API also now supports routing and preference support, just like the search API.

This image illustrates how the redesigned percolator works. Here the index a has a .percolate type and holds registered queries. It is just an ordinary index with 3 primary shards (green squares) and a replica shard for each primary shard (white squares). The registered queries are divided between shards and then the percolate request is executed in parallel on each node that holds these shards.

Obviously, this large-scale redesign means breaking backwards compatibility with the currently released percolator. Breaking changes are a good time to make improvements to existing structures; for example, the percolator response has changed with some new usability improvements. If you're already using the percolator, you can just upgrade to 1.0 and re-import your queries from the _percolator index to an index of your choice. Here's some code to illustrate the new functionality:

# Register a query with the redesigned percolator
curl -XPUT 'localhost:9200/my-index1/.percolator/my-id' -d '{
    "query" : {
        "match" : {
            "body" : "coffee"
        }
    }
}'

# Execute a percolate request with the redesigned percolator
curl -XGET 'localhost:9200/my-index1/my-type/_percolate' -d '{
    "doc" : {
        "title" : "Coffee percolator",
        "body" : "A coffee percolator is a type of ..."
    }
}'
The last request in the above sample to the percolate api will yield the following result:
{
    "took" : 19,
    "_shards" : {
        "total" : 2,
        "successful" : 2,
    "failed" : 0
    },
    "count" : 4,
    "matches" : [
        {
            "_index" : "my-index1",
            "_id" : "my-id"
        },
        ...
    ]
}

The first major change to the percolate response is that we now include a response header. Just like with the search api, we show you how many shards the request should have gone to, how many shards were actually visited and how many shards failed. The biggest change in the response is how percolate query matches are represented. Instead of list all query ids that have matches in a string array, there is now a match object that encapsulates the query id and the concrete index the query resides in. The _index part is important now because the same query id can be used in multiple indices. We also include a total count in the response, which just tells how many queries matched the percolated document. By default all matching query ids are returned, but that is controllable.

The new features are described in our reference guide. So, are you curious about the redesigned percolator? Try out our latest beta release and tell us about it!