15 Januar 2013 Engineering

Searching with Shingles

Von Zachary Tong

In this article, I want to introduce Shingles. Shingles are effectively word-nGrams. Given a stream of tokens, the shingle filter will create new tokens by concatenating adjacent terms.

For example, given the phrase “Shingles is a viral disease”, a shingle filter might produce:

  • Shingles is
  • is a
  • a viral
  • viral disease

The shingle filter allows you to adjust min_shingle_size and max_shingle_size, so you can create new shingle tokens of any size.

Do you see why these are awesome? Shingles effectively give you the ability to pre-bake phrase matching. By building phrases into the index, you can avoid creating phrases at query time and save some processing time/speed.

The downside is that you have larger indices and potentially more memory usage if you try to facet/sort on the shingled field.

You can get similar functionality from nGrams, but the overhead is even higher since a single phrase can generate a large number of nGrams. In many cases, search relevance is perfectly fine using shingles over nGrams. Subjectively, I’d argue that shingles give better results.

The only feature you lose is the ability to tolerate spelling mistakes…but there are better ways to handle that behavior (which I’ll cover in a later article).

Using Shingles

Ok, let’s dive into how shingles are actually used. First, check out this gist – I’ve written out a sample mapping that you can create in your Elasticsearch cluster.

The mapping is pretty simple. A single field named `title`, an analyzer and two filters. Let’s work through this piece by piece.

"mappings":{
      "product":{
         "properties":{
            "title":{
               "search_analyzer":"analyzer_shingle",
               "index_analyzer":"analyzer_shingle",
               "type":"string"
            }
         }
      }
   }
The mapping section is simple. We define a type (`product`) and a field (`title`). The field specifies the `analyzer_shingle` analyzer for both search and index.

The fun starts to happen in the analyzer:

"analyzer_shingle":{
   "tokenizer":"standard",
   "filter":["standard", "lowercase", "filter_stop", "filter_shingle"]
}

ES first tokenizes an input text, then filters. So this analyzer first applies the standard tokenizer, then walks through the standard filter, lowercase filter and our two custom filters: `filter_stop` and `filter_shingle`.

First, `filter_stop`:

"stop":{
   "type":"stop",
   "enable_position_increments":"false"
}

`Filter_stop` isn’t actually a custom filter – we just need to adjust a default setting of the `stop` filter. As the stop filter walks through a token stream and removes stopwords (such as “I”, “to”, “are”, etc), it records these locations in the token stream.

In the case of shingles, this is bad. The shingle filter will see these “skipped” tokens and insert an underscore in it’s place. Resulting tokens will look like this:

$ curl -XGET 'localhost:9200/test/_analyze?analyzer=analyzer_shingle&pretty' -d 'Test text to see shingles' | grep "token"

"tokens" : [ {
    "token" : "test",
    "token" : "test text",
    "token" : "test text _",
    "token" : "test text _ see",
    "token" : "test text _ see shingles",
    "token" : "text",
    "token" : "text _",
    "token" : "text _ see",
    "token" : "text _ see shingles",
    "token" : "_ see",
    "token" : "_ see shingles",
    "token" : "see",
    "token" : "see shingles",
    "token" : "shingles",

To get rid of this behavior, we set the `enable_position_increments` property equal to false. Our new token stream will look like this:

$ curl -XGET 'localhost:9200/test/_analyze?analyzer=analyzer_shingle&pretty' -d 'Test text to see shingles' | grep "token"

"tokens" : [ {
    "token" : "test",
    "token" : "test text",
    "token" : "test text see",
    "token" : "test text see shingles",
    "token" : "text",
    "token" : "text see",
    "token" : "text see shingles",
    "token" : "see",
    "token" : "see shingles",
    "token" : "shingles",

Much better! The next, and last, filter is our shingle filter:

"filter_shingle":{
   "type":"shingle",
   "max_shingle_size":5,
   "min_shingle_size":2,
   "output_unigrams":"true"
}

Fairly self-explanatory. Min_shingle_size and max_shingle_size control the length of the new shingle tokens. Min_shingle_size has to be at least 2, but you can enable the output of unigrams (aka single word tokens) with a separate setting.

I'm a fan of shingles because they give you both exact-match and phrase matching. An exact match will hit all the shingled tokens and boost the score appropriately, while other queries can still hit parts of the phrase. And since shingles are stored as tokens in the index, their TF-ID is calculated and rarer phrase matches enjoy a bigger score boost than more common phrases.

I used a stopword filter in my analyzer to get rid of pesky stopwords, but there's no reason you need to. Some applications need stopwords. Similarly, I tokenized with Standard, but a Whitespace or Word-Delimited tokenizer may be more appropriate for you.