February 4, 2013

Starts-With Phrase Matching

WARNING: This article contains outdated information. We no longer recommend taking its advice.

“Starts-with” functionality is common in search applications. Either you want to return results as the user types (ala Google Instant) or you simply want to search partial phrases and return any matches. This can be accomplished in Elasticsearch several ways.

In this article, we are going to explore phrase matching at query time, instead of building the functionality directly into the index using shingles or nGrams. In particular, we are going to focus on `match_phrase_prefix` query to do the heavy lifting.

This query take the normal `match` query and adds phrase support + fuzzy prefix capability. The phrase matching comes from the ability to look at token offsets, allowing the query to know when tokens follow each other in a phrase. The prefix capability will take the last portion of your query and expand it into new query tokens.

For example, if your query is “dog f” then match_phrase_prefix will expand this into new queries:

dog fight
dog food
dog freight
dog farce
[...]

Handy, right?

Getting Started

Let’s go ahead and insert some data into a newly created index. All the commands necessary to follow along are located in a Gist. Our data looks like this:

$ curl -XGET localhost:9200/startswith/test/_search?pretty -d '{
"query": {
"match_all" : {}
}
}' | grep title
{"title":"data"}
{"title":"drive"}
{"title":"drunk"}
{"title":"river dog"}
{"title":"dzone"}

Our goal is to do a do a phrase match on the single character “d”. Ideally, we would retrieve all the titles that start with “d”, but not return the [river dog] entry because it starts with an “r”.

Let’s go ahead and try the match_phrase_prefix query and see what happens:

$ curl -XGET localhost:9200/startswith/test/_search?pretty -d '{
"query": {
"match_phrase_prefix": {
"title": {
"query": "d",
"max_expansions": 5
}
}
}
}' | grep title
"_score" : 0.70710677, "_source" : {"title":"data"}
"_score" : 0.70710677, "_source" : {"title":"drive"}
"_score" : 0.70710677, "_source" : {"title":"dzone"}
"_score" : 0.44194174, "_source" : {"title":"river dog"}
"_score" : 0.30685282, "_source" : {"title":"drunk"}

Hmm…well that didn’t work. Why not? Because we didn’t specify a mapping for this new index, Elasticsearch simply used the default Standard Analyzer. And if you use Standard on [river dog], you’ll end up with this

curl -XGET 'localhost:9200/startswith/_analyze?analyzer=standard&pretty' -d 'river dog' | grep token
"token" : "river",
"token" : "dog",

And that’s where the problem comes from. Standard tokenizes on whitespace, so [river dog] becomes [river] and [dog]. The match_phrase_prefix query doesn’t match [river], but does match [dog]. You can see this reflected in the score of “river dog”, about half as much as all the other entries because only one token matched.

The solution, as is so often the case, comes in the form of a mapping. We want Elasticsearch to look at the entire field as a single token, not parse it into individual tokens. We could simply set the field to not_analyzed, but that will make it case sensitive.

Instead, we are going to use the <a href="/guide/en/elasticsearch/reference/current/analysis-keyword-tokenizer.html">keyword tokenizer</a> plus the <a href="/guide/en/elasticsearch/reference/current/analysis-lowercase-tokenfilter.html">lowercase filter</a>.

The keyword tokenizer is simple, and difficult to appreciate until you run into a situation like this one. It simply takes the entire field and emits it as a single token. It doesn’t really tokenize at all in fact, just returns what you give it.

`Lowercase` filter is pretty obvious, I should think.

Go ahead and delete the old index, recreate it with the following mapping and re-insert all the data.

{
"settings":{
"index":{
"analysis":{
"analyzer":{
"analyzer_startswith":{
"tokenizer":"keyword",
"filter":"lowercase"
}
}
}
}
},
"mappings":{
"test":{
"properties":{
"title":{
"search_analyzer":"analyzer_startswith",
"index_analyzer":"analyzer_startswith",
"type":"string"
}
}
}
}
}

Now, when we execute the same match_phrase_prefix query from above, we get much better results:

$ curl -XGET localhost:9200/startswith/test/_search?pretty -d '{
"query": {
"match_phrase_prefix": {
"title": {
"query": "d",
"max_expansions": 5
}
}
}
}' | grep title
"_score" : 1.0, "_source" : {"title":"drunk"}
"_score" : 0.30685282, "_source" : {"title":"dzone"}
"_score" : 0.30685282, "_source" : {"title":"data"}
"_score" : 0.30685282, "_source" : {"title":"drive"}

Huzzah! The incorrect [river dog] entry is no longer present in our search results. Let’s add another character to our query, making the search for “dr” instead of just “d”. Does it still work?

curl -XGET localhost:9200/startswith/test/_search?pretty -d '{ "query": {
"match_phrase_prefix": {
"title": {
"query": "dr",
"max_expansions": 5
}
}
}
}' | grep title
"_score" : 1.0, "_source" : {"title":"drunk"}
"_score" : 0.30685282, "_source" : {"title":"drive"}

Yep! Works perfectly, only returning the two documents that start with “dr”.

You may notice that one document (“drunk”) receives a score of 1 while the rest have scores of 0.306…there is a good reason for this, but it is outside the scope of this article. Suffice to say that it is an artifact of having a low document count in the index. You can ignore this peculiarity for now.

Conclusion

So, what have we learned? If you want to build a “starts with” functionality at query time, you still need to have appropriate mapping at index time. In this article, we looked at how a simple keyword tokenizer (plus lowercase filter) can ensure that the match_phrase_prefix query returns the right results, by keeping our fields from being broken into multiple terms.