Searching with Shingles

In this article, I want to introduce Shingles. Shingles are effectively word-nGrams. Given a stream of tokens, the shingle filter will create new tokens by concatenating adjacent terms.

For example, given the phrase “Shingles is a viral disease”, a shingle filter might produce:

Shingles is
is a
a viral
viral disease

The shingle filter allows you to adjust min_shingle_size and max_shingle_size, so you can create new shingle tokens of any size.

Do you see why these are awesome? Shingles effectively give you the ability to pre-bake phrase matching. By building phrases into the index, you can avoid creating phrases at query time and save some processing time/speed.

The downside is that you have larger indices and potentially more memory usage if you try to facet/sort on the shingled field.

You can get similar functionality from nGrams, but the overhead is even higher since a single phrase can generate a large number of nGrams. In many cases, search relevance is perfectly fine using shingles over nGrams. Subjectively, I’d argue that shingles give better results.

The only feature you lose is the ability to tolerate spelling mistakes…but there are better ways to handle that behavior (which I’ll cover in a later article).

Using Shingles

Ok, let’s dive into how shingles are actually used. First, check out this gist – I’ve written out a sample mapping that you can create in your Elasticsearch cluster.

The mapping is pretty simple. A single field named `title`, an analyzer and two filters. Let’s work through this piece by piece.

"mappings":{
      "product":{
         "properties":{
            "title":{
               "search_analyzer":"analyzer_shingle",
               "index_analyzer":"analyzer_shingle",
               "type":"string"
            }
         }
      }
   }

The mapping section is simple. We define a type (`product`) and a field (`title`). The field specifies the `analyzer_shingle` analyzer for both search and index.

The fun starts to happen in the analyzer:

"analyzer_shingle":{
   "tokenizer":"standard",
   "filter":["standard", "lowercase", "filter_stop", "filter_shingle"]
}

ES first tokenizes an input text, then filters. So this analyzer first applies the standard tokenizer, then walks through the standard filter, lowercase filter and our two custom filters: `filter_stop` and `filter_shingle`.

First, `filter_stop`:

"stop":{
   "type":"stop",
   "enable_position_increments":"false"
}

`Filter_stop` isn’t actually a custom filter – we just need to adjust a default setting of the `stop` filter. As the stop filter walks through a token stream and removes stopwords (such as “I”, “to”, “are”, etc), it records these locations in the token stream.

In the case of shingles, this is bad. The shingle filter will see these “skipped” tokens and insert an underscore in it’s place. Resulting tokens will look like this:

$ curl -XGET 'localhost:9200/test/_analyze?analyzer=analyzer_shingle&#038;pretty' -d 'Test text to see shingles' | grep "token"

"tokens" : [ {
    "token" : "test",
    "token" : "test text",
    "token" : "test text _",
    "token" : "test text _ see",
    "token" : "test text _ see shingles",
    "token" : "text",
    "token" : "text _",
    "token" : "text _ see",
    "token" : "text _ see shingles",
    "token" : "_ see",
    "token" : "_ see shingles",
    "token" : "see",
    "token" : "see shingles",
    "token" : "shingles",

To get rid of this behavior, we set the `enable_position_increments` property equal to false. Our new token stream will look like this:

$ curl -XGET 'localhost:9200/test/_analyze?analyzer=analyzer_shingle&#038;pretty' -d 'Test text to see shingles' | grep "token"

"tokens" : [ {
    "token" : "test",
    "token" : "test text",
    "token" : "test text see",
    "token" : "test text see shingles",
    "token" : "text",
    "token" : "text see",
    "token" : "text see shingles",
    "token" : "see",
    "token" : "see shingles",
    "token" : "shingles",

Much better! The next, and last, filter is our shingle filter:

"filter_shingle":{
   "type":"shingle",
   "max_shingle_size":5,
   "min_shingle_size":2,
   "output_unigrams":"true"
}

Fairly self-explanatory. Min_shingle_size and max_shingle_size control the length of the new shingle tokens. Min_shingle_size has to be at least 2, but you can enable the output of unigrams (aka single word tokens) with a separate setting.

I'm a fan of shingles because they give you both exact-match and phrase matching. An exact match will hit all the shingled tokens and boost the score appropriately, while other queries can still hit parts of the phrase. And since shingles are stored as tokens in the index, their TF-ID is calculated and rarer phrase matches enjoy a bigger score boost than more common phrases.

I used a stopword filter in my analyzer to get rid of pesky stopwords, but there's no reason you need to. Some applications need stopwords. Similarly, I tokenized with Standard, but a Whitespace or Word-Delimited tokenizer may be more appropriate for you.

Ingénierie du contexte

Base vectorielle

Applications optimisées pour la recherche

Logs

Protection contre les menaces

Workflows

Elasticsearch

Kibana (Discover, tableaux de bord)

Elastic Agent Builder

AutoOps

Langage de requête canalisé

Modèles de recherche Jina AI

Elastic Cloud Serverless

Elastic Cloud hébergé

Elasticsearch autogéré

Recherche sur les sites d'e-commerce

Recherche dans le service client

Applications axées sur la recherche

Analyse des logs

Suivi d'infrastructure

Suivi de l'expérience numérique

App : suivi des performances

AIOps

Observabilité des LLM

SIEM nouvelle génération

Workflows pour la sécurité

XDR et sécurité aux points de terminaison

L'IA pour la sécurité

Décuplez la valeur de vos données

Fournisseurs cloud

Écosystème IA d'Elastic

Programme de partenariat Search AI

AV-Comparatives

Leader dans le Forrester Wave™

Leader dans le Magic Quadrant de Gartner

Leader dans IDC MarketScape

Recherche

Security

Observability

Lancez-vous

Galerie de démonstrations

Téléchargements

Intégrations

Documentation

Elasticsearch Labs

Elastic Security Labs

Elastic Observability Labs

Blog

Communauté

Événements

Webinars

Discussion

Formation

Support technique

Conseil

Searching with Shingles

Using Shingles