All About Analyzers, Part One

UPDATE: This article refers to our hosted Elasticsearch offering by an older name, Found. Please note that Found is now known as Elastic Cloud.

In this article we'll survey various analyzers, each of which showcases a very different approach to parsing text.

Introduction

Choosing the right analyzer for an Elasticsearch query can be as much art as science. Analyzers are the special algorithms that determine how a string field in a document is transformed into terms in an inverted index. In this article we'll survey various analyzers, each of which showcases a very different approach to parsing text.

Ten tokenizers, thirty-one token filters, and three character filters ship with the Elasticsearch distribution; a truly overwhelming number of options. This number can be increased further still through plugins, making the choices even harder to wrap one's head around. Combinations of these tokenizers, token filters, and character filters create what's called an analyzer. There are eight standard analyzers defined, but really, they are simply convenient shortcuts for arranging tokenizers, token filters, and character filters yourself. While reaching an understanding of this multitude of options may sound difficult, becoming reasonably competent in the use of analyzers is merely a matter of time and practice. Once the basic mechanisms behind analysis are understood, these tools are relatively easy to reason about and compose.

The Components of an Analyzer

Inside an analyzer is a small processing pipeline consisting of the following phases: 1) Character filtering, 2) Tokenization, and 3) Token Filtering. The ultimate goal of analyzer is, of course, to convert a string into a series of tokens. An example analyzer is illustrated in the diagram below. Try and follow along with it as you read the remainder of this paragraph. The execution flow starts with a string entering into the analyzer. This string is first run through optional character filters, each of which transforms the string in a specific way, say lowercasing the text or substituting words, and outputs a transformed string. The string output of the character filters is then passed into a tokenizer, the only required component in an analyzer, which emits a list of tokens. Each token contains both a string value and a position number indicating where in the token stream it is located. Finally, these tokens are optionally passed through token filters which can further alter the tokens.

Analyzer Pipeline

Analyzer Pipeline

Elasticsearch ships with a handful of default analyzers. Custom analyzers can be configured via the settings API, at either the index or cluster level. The configuration for an example custom analyzer can be seen in the code sample below.

PUT /my-index/_settings
{
  "index": {
    "analysis": {
      "analyzer": {
        "customHTMLSnowball": {
         "type": "custom", 
          "char_filter": [
            "html_strip"
          ], 
          "tokenizer": "standard",
          "filter": [
            "lowercase", 
            "stop", 
            "snowball"
          ]  
        }}}}}

The intent of the analyzer above is as follows:

  • Remove all HTML tags from the source text via the html_strip char filter.
  • Break up the text along word boundaries and remove punctuation with the standard tokenizer.
  • Lowercase all the tokens.
  • Remove tokens that are stopwords, such as 'the', 'and' and other irrelevant common words.
  • Stem all the tokens using the snowball token filter.

Let's follow the data-flow of this example analyzer given the source text The two <em>lazy</em> dogs, were slower than the less lazy <em>dog</em>.. If we pipe our source text through this analyzer, it will take the path shown in the diagram that follows.

A Custom Analyzer
A Custom Analyzer

Picking an Analyzer

The hardest part of building an analyzer is determining the correct components to use. In the following sections we'll build our familiarity with analyzers by looking at some common use cases.

Searching Natural Language

Stemming

When searching bodies of natural language, things like blog posts, magazine articles, legal documents, etc., it is often good to use a stemming token filter, such as the included stemmer, snowball, kstem, or porter_stem token filters. These sorts of analyzers help normalize the spelling of words, for instance translating the tokens 'sing', 'sings', and 'singing' all into the single stem 'sing' as with the snowball analyzer. You can see this in action by executing the following request to test out the analyzer. Also notice the stop words 'they' and 'are' are dropped by the default snowball analyzer.

GET http://localhost:9200/_analyze?text=I%20sing%20he%20sings%20they%20are%20singing&analyzer=snowball
// Output (abbreviated)
{
  "tokens": [
    {"token": "i", "position": 1, ...},
    {"token": "sing", "position": 2, ...},
    {"token": "he", "position": 3, ...},
    {"token": "sing", "position": 4, ...},
    {"token": "sing", "position": 7, ...},
  ]
}

Stemming algorithms exploit the fact that it is usually rare that searches need to distinguish between a search for "singing" and "sings". It may, however be the case that such differences do matter. As we'll see, for searches involving content like human names, product names, etc, stemming can make little sense. As a simple example, the phrases 'fly fishing' and 'flying fish' both stem to 'fli fish' with a snowball stemmer; a terrible result. In a travel guide this could be a real problem. Stemming is not the be all end all of tokenization strategies. In cases where exact spelling matters it may be better to do without a stemmer and use a simpler strategy, only filtering the tokens through a lowercase filter. Simpler strategies also lend themselves to some of the more exotic query types, like fuzzy queries, which can provide unexpected results when coupled with complex analyzers. Finally, remember that you can combine multiple queries with tools like the bool and dis_max queries.

Internationalization

Some languages, like German, Finnish and Korean use compounded words which can be problematic for search engines. Where in English one might use the phrase "Ballet Dancer", in German the words "Ballett" and "Tänzerin" would be combined into a single word "Balletttänzerin". A search for "Balett" would not match the the compound word unless it is split on the proper boundary. To solve this issue the words must be decompounded via a plugin, like the Elasticsearch Word Decompounder Analysis Plugin. This will split such words up along those boundaries.

Optimizing Phrase Searches with Shingles

In an application where phrases are frequently used in queries, the use of a shingle analyzer can both boost performance and improve result quality. Whereas the analyzers we've seen up to now break up words on either letter or word boundaries, the shingle analyzer takes a piece of source text and creates groups of phrases. It is, essentially, an nGram token filter that works at the word level instead of the character level. So the sentence "Beware the Ides of March", processed using a shingle token filter with max_shingle_size: 3 and min_shingle_size: 1, would yield the tokens "Beware", "Beware the", "Beware the Ides", "the", "the Ides", "the Ides of", and so on and so forth.

Generating shingles increases the amount of storage and memory needed for indexes, but comes with two benefits. First, performance for phrase queries can be significantly improved since the phrase can be matched against a single multi-word term, rather than having to be the result of multiple AND-ed terms. Second, scoring can be calculated more accurately since the TF/IDF similarity of the actual phrase, rather than the individual terms will yield a more accurate result. For more on searching with shingles, check out this post titled search with shingles from the Elasticsearch Blog, Slow Queries and Common Words from the Large Scale Search blog, and this stack overflow response from Lucene developer Robert Muir.

While shingles can provide a massive performance boost, they can also cause some performance problems, primarily related to the fact that there are more unique combinations of words than individual words. Setting omit_term_freq_and_positions to true for a shingle field can greatly reduce the size of an index and speed up searches. Since a phrase search usually matches only one shingle, there isn't a huge amount of value in maintaining position info. That being said, longer phrase searches will become impossible without this data, so be sure that the appropriate shingle size is set.

Searching Textual Tokens

Some types of text are harder to classify. Usernames on a social site, category names on a shopping site, tags for blog posts, and other such items require different strategies from the approaches to natural language we previously discussed. Techniques such as stemming can lead to false positives, and even merge terms that should be kept separate. In these cases it's often important to preserve the uniqueness of the tokens.

Searching Tokens Exactly

The simplest strategy for dealing with such tokens is to disable analysis completely by specifying "index": "not_analyzed", in the mapping for that field. The data will be as exact matches for the text as stored in the document. As an example of using a not_analyzed field, let's assume we have a CMS with various roles for various users: "writer", "publisher", and "admin". We may want to be able to search users by role. In this case the user interface will probably have a dropdown select box that can be used to pick an exact user type, eliminating the possibility of typos, or much variation in input. This is an ideal situation to specify the field as not_analyzed.

Searching Tokens Imprecisely

In the previous example the search use case was perfectly suited to the constrained input imposed by the GUI. Such a system would fare poorly when faced with more freeform user input. Usernames are a common use case that serves well as an illustration of this point. In a username search, once again stemming algorithms make little sense. We plainly do need some analysis however. At the very least, a query for "CmdrTaco" should match the user "cmdrtaco", in other words a lowercase token filter seems to be in order at the very minimum. We might also want to show the user a list of best possible matches, it'd be nice if a search for 'cmrdtaco' still matched, handling misspellings would be a great feature here.

For a username search such as we've described there are two approaches that are good starting points. A trivial analyzer composed of a keyword tokenizer, which emits the entire input stream as a token, combined with only a lowercase token filter, will provide the required output. This output will handle match, prefix, and match queries with a fuzziness value provided well. For more information on fuzzy searches specifically, see How to Use Fuzzy Searches in Elasticsearch.

Using NGrams for Advanced Token Searches

An additional strategy might be to index usernames as with an nGram or edgeNGram tokenizer. These tokenizers break up text into configurable sized tuples of letters. For instance, the word "news", run through a min_gram:1, max_gram:2 nGram tokenizer would be broken up into the tokens "n", "e", "w", "s", "ne", "ew", and "ws". This sort of analysis does really well when it comes to imprecise matching. Remember that by default a match query will analyze the input query, then get a list of terms back out, and finally run a giant "OR" query of the terms involved. This means that the items with the largest number of matching terms will be returned. While this does mean a large number of false positives will be returned, it also means that the most relevant results will be at the top.

Using Phonetic Analyzers

Phonetic analyzers are a powerful tool for dealing with things like real names and usernames. These can be used by installing the elasticsearch-analysis-phonetic plugin. The goal of a phonetic analyzer, like metaphone or soundex is to convert the source text into a series of tokens that represent syllabic sounds. This lets users find things that sound like the query text. For instance, when searching a list of names, one might want a query for the user with the name 'schmidt' to be matched if a user searches for the term 'terminates'. The spelling difference doesn't matter because both words sound the same.

An example of a search with a metaphone component can be seen in this example on Play . Notice that the search for 'schmit' matches both the names 'schmidt' and 'schmitt'. The use of an exact matching 'not_analyzed' field is critical for ensuring that exact matches always come first. In the second query in the example, where there is an exact match for 'schmidt', the exact match winds up being scored much higher to ensure its primacy in the results. For a more complete treatment of this sort of search see this forum thread on name searches by Elasticsearch developer Clinton Gormley.

Combining Multiple Approaches

There's no reason why only a single approach need be taken for a given search. Using Elasticsearch's multi_field option one may index a single _source field multiple times, as shown in the example below, taken from the metaphone Play mentioned above.

// Document mapping
{
  "properties": {
    "name": {
      "fields": {
        "name_metaphone": {
          "type": "string", 
          "analyzer": "mf_analyzer"
        }, 
        "name_exact": {
          "index": "not_analyzed", 
          "type": "string"
        }
      }, 
      "type": "multi_field"
    }
  }
}

In this example we can search either the name_metaphone or name_exact fields, to get different results for the same incoming data. There's no reason why one must restrict oneself to a single approach, and by using bool and dis_max queries, multiple query types may be combined for your specific dataset.

Going Forward

There's much more to cover on this topic! If you'd like to learn more, check out the second article in this series: All About Analyzers, Part 2. Becoming proficient with analyzers requires a lot of experimentation, and as far as that goes there's no quicker way to play around with different analyzers than Play, our elasticsearch 'fiddle'. I, in fact, used it to test out the examples in this article.