18 février 2014

All About Analyzers, Part Two

UPDATE: This article refers to our hosted Elasticsearch offering by an older name, Found. Please note that Found is now known as Elastic Cloud.

This is a follow up article where we'll continue to survey various analyzers, each of which showcases a very different approach to parsing text.

This is the second of two articles about analyzers. If you haven't yet read the first article on this topic, “All About Analyzers, Part 1”, it is strongly recommended that you read that first. In this article we'll be diving a little deeper into the catalog of Elasticsearch analyzers. The intention of these articles is to take users on a guided tour of some of the most useful analyzers. There's no need to remember everything here, but if you know all these tools are available you'll be able to create much better queries using Elasticsearch.

Building Custom Analyzers on Top of the Keyword Tokenizer

The keyword tokenizer is the simplest tokenizer available, it simply returns its input as a single token. Generally, if this is the only behavior that is desired it is better to simply designate the field as index: not_analyzed. However, it is vitally useful when constructing an analyzer solely out of token filters. This will allow you to simply pass through the full input to the required token filter(s).

As an example, let's say we want to be able to search city names without regard for case sensitivity, and also allow some leeway for users who don't usually type funky characters like 'ã'. We could index city names lowercased and with an ascii folding filter, which merges converts like 'ã' into 'a'. We really don't want to break city names up on whitespace if we're looking for exact matches only. In this case we'll want to use a keyword tokenizer because all the real work gets done by a 'lowercase' token filter, and an 'asciifolding' token filter. So, our analyzer would look something like the example below.

{
  "analysis": {
    "analyzer": {
      "city_name": {
        "type": "custom",
        "tokenizer": "keyword",
        "filter": ["lowercase", "asciifolding"]
      }
    }
  }
}

The Reverse Token Filter

The reverse token filter does what it says on the tin, it just takes tokens in and reverses them. 'Cat' becomes 'taC' for instance. The primary utility of this token filter is enabling fast suffix searches. One practical use case would be to search file names by extension.

Remember, in Elasticsearch, and most every database, the underlying datastructure is some sort of binary tree like thing. Walking a binary tree is fast - \(\mathcal{O}\left(\mathrm{log}\left(n\right)\right)\) - when searching for a prefix or exact match. Most other search types require full scans of every piece of data in the tree. Elasticsearch can speed some of these slow searches up with fancy automatons, but they're still far slower. Hence, our interest in turning all search problems into prefix search problems for speed and scalability.

Somewhat confusingly, to search by suffix, your query should use either the match_phrase_prefix or the prefix query type since we're searching against a reversed version of the source text. It is highly recommended that the match_phrase_prefix query be used for this case, as it analyzes the source text, while the prefix query does not.

To see why it is generally better to use an analyzed query, like, match_phrase_prefix, let's take a look at this suffix search. Notice that we've created a custom analyzer with the content below.

"fnLowerReverse": {
  "type": "custom",
  "tokenizer": "keyword",
  "filter": ["lowercase", "reverse"]
}

So, our text for a filename like Foo.tar.gz will be indexed as zg.rat.oof. We'll want to perform a prefix search of zg.rat. to match the extension. The problem with the plain prefix query is that we must have our application perform this transformation automatically since the prefix query does not have an analyzer. I think we can all agree that of the two queries below, the first is far superior.

// Better way
{
    "query": {
        "match_phrase_prefix": {
            "lowerReverse": ".tar.gz"
        }
    }
}
// Worse way
{
    "query": {
        "prefix": {
            "lowerReverse": "zg.rat."
        }
    }
}

Also, you may notice that another risk in using the second query is that if you decide to change your analysis settings your app will need to keep step, which could be tricky or near impossible if complicated filters like the asciifolding filter are added in.

When to Avoid the Reverse Token Filter

While the reverse token filter is great, it is not necessarily the best option in every situation where a suffix match is needed. Let's take the case of domain names, and searching by TLD. While a suffix match could work, there's a much better strategy: breaking the domain name up into a set of component parts. So, when you encounter a problem you think could be solved by using the reverse filter, think about whether you really have a suffix problem, or if it is in fact the case that your text isn't broken up into the proper tokens. A good metric here is that you'll want to use reverse for when there is no clear place to put token boundaries. In the next section we will cover the use of patterns to break up text.

Working with Patterns

There are times when you'll want to split up structured, possibly machine generated, text along specific boundaries. A classic example is domain names and IP addresses, which are both highly structured. Another example would be INI files, with their key=value syntax. Working with these sorts of formats can be simplified through the use of pattern based analysis tools.

The Pattern Tokenizer

For starters, there's the pattern tokenizer, which can work in one of two modes. In its default mode the provided regexp is used to split the source text. For instance a pattern of '-+' would split up the source text on sequences of dashes, turning the text 'foo and bar—and baz-bot blah' into ['foo and bar', 'and baz', 'bot blah'].

The pattern tokenizer can also be flipped around to isolate tokens, rather than split them, with the group option, which specifies the index of the regexp's capture group. Let's say we wanted to pull all numbers from a piece of text. Given the text 'Foo 82 Bar 90 Bat 21' we could isolate all numbers with the pattern '[0-9+]' and by setting the group option to '0', which tokenizes all matched text. If capture groups, incidated with '()' in a regexp, are used, the group option specifies the index of the group to be matched.

The Pattern Capture Token Filter

Also available is the pattern_capture. This token filter differs substantially from the pattern tokenizer. For one thing, this filter's patterns parameter accepts a list of regexps, allowing you to match text in multiple ways. It also supports passing through the original tokens alongside the ones it emits via its preserve_original option.

This can be useful when you want to isolate sub-parts of other tokens, for instance, when indexing financial documents, you may want to index terms like 'FY2013' in richer ways. A search for 2013 should probably match 'FY2013', and so should 'FY2013'. We can solve this tricky problem using a pattern_capture token filter.

{
"filter": {
  "financialFilter": {
    "type": "pattern_capture",
    "patterns": ["FY([0-9]{2,4})"],
    "preserve_original": true
  }}}

When working with the pattern_capture token filter it is important to remember that a token is not just a piece of text, but a piece of text and a position. The new tokens will have the same token position as the source text. In the above example, The phrase 'Truly, FY2008 was bad' will be tokenized to [“Truly”¹, “FY2008”², “2008”², “was”³, “bad”⁴“]. Note that the token positions have been marked in superscript, and that “FY2008”, and “2008” share the same position. This is important for the operation of phrase queries, where token position proximity determines the scope of a phrase.

Another great use for the pattern_capture filter is isolating twitter-like hashtags and mentions, e.g. '#elasticsearch' or '¹'. Isolating these hashtags is fairly simple with a regexp like ([@#]\\w+). Be sure to double escape backslashes in your JSON when declaring these regexps for elasticsearch, as a single '' will only escape the JSON formatting. This behavior should be automatically handled by your client library, but some may misbehave.

Lastly, there is the pattern_replace token filter which simply replaces a pattern in a given token with another piece of text. This filter being straightforward, its use will be left as an exercise for the reader in the name of brevity.

Working with Synonyms

Indexing word synonyms in Elasticsearch can be accomplished with the synonym token filter. Synonyms provide a way to map one token to multiple tokens, and are subject to similar token position concerns as the pattern_replace filter mentioned previously. A given position in the tokenized output may contain multiple tokens for multiple synonyms. The operation of the synonym token filter varies depending on the synonym rules supplied to it, it will either add or replace tokens based on these rules.

A simple example of using synonyms would be for a recipe website. One might want to synonymize some UK spellings of words to their American equivalents. For example, the herb 'coriander' is known as 'cilantro' in the USA. We can make searches for both 'coriander' and 'cilantro' work via Elasticsearch synonyms. To achieve this, we'll need to start by creating a synonyms dictionary for use by our analyzer. Elasticsearch supports two different formats for synonym dictionaries, SOLR and wordnet. We'll use the simpler SOLR syntax in our example dictionary file below.

# This is a SOLR synonym file
# Save this in a file uk_us_food_names.txt

# The '=>' will perform replacement of the original token, replacing the token on the left with the one on the right
coriander => cilantro
rocket => arugula
chips => fries

# using a comma ',' will cause both tokens to show up in the stream at the same position
taters, potatoes
candy, sweets

# This syntax will replace either of the terms on the left with the one on the right
mayo, mayonaise => sandwich lube

# This syntax will replace the term on the left with both terms on the right
knife => slicer, chopper

If you plan on aggregating on a field with synonyms in place, strongly consider using the '=>' syntax to fully replace tokens. This can prevent tons of duplicates showing up in aggregates. Additionally, from a query perspective, so long as the query is run through the same synonym token filter as the document (automatic when using many query types), there is no advantage to having both 'sides' of the synonym indexed. A single token can capture the meaning of the word just fine. However, if using non-analyzed filters or queries, it may be useful to have both terms in the index.

One other advantage of the '=>' synonym filter is that many relationships are not bidirectional. All carrots are vegetables, for instance, but not all vegetables are carrots. So, a mapping like 'carrot => vegetable, carrot' is useful, while 'carrot, vegetable' will cause massive headaches for users.

These synonym lists need to actually live somewhere to be used by an analyzer of course. There are two different methods by which synonyms can be managed in Elasticsearch. For very small synonym lists, synonyms can be set in the index settings, using a syntax like the one below.

{
  "settings": {
    "analysis": {
      "analyzer": {
        "product_name": {
          "type": "custom",
          "tokenizer": "whitespace",
          "filter": ["my_synonyms"]
        } 
    },
    "filter": {
      "my_synonyms": {
        "type": "synonym",
          "synonyms": [
            "foo => bar",
            "baz => bot, blah",
           ]
      }}}}}

For larger synonym lists, more than a a few hundred lines for instance, it is much better to use the synonyms_path option rather than the synonyms option. This option points to a file local to the Elasticsearch instance, which is superior for two reasons. First, having large amounts of data in the 'settings' for an index can slow down a cluster since index settings are constantly being replicated between nodes in the cluster. Second, having large amounts of data returned via the settings API is quite cumbersome for clients.

Synonym files must be replicated exactly between hosts. If this is not carefully managed bad things can happen. At Found, this is done automatically when you upload a custom bundle with a synonyms file in it. If you manage your own servers, be sure to use a configuration management tool like Ansible, Chef, or Puppet to synchronize these files between hosts, otherwise you may find that synonyms are interepreted differently depending on which host is handling which requests.

Additionally, when planning to use synonyms it is important to understand that if the synonyms are updated, all old documents will need to be re-indexed to ensure that they accurately reflect the new synonyms. For this reason, be careful when planning out a synonyms implementation.

Ascii Folding

The asciifolding token filter is an often useful filter when dealing with unicode texts. It is sometimes desired that unicode characters be collapsed into their nearest ascii equivalent. For instance normalizing the characters 'aáåä' into 'a'. The asciifolding token filter accomplishes this, ensuring that all characters in tokens fall in the first 127 bytes of the Unicode basic latin block. Be careful when using it with international users, as it can produce confusing results.

In Elastiscearch 1.1.0 this will support a preserve_original setting, which will index each token twice, once folded, once as the original value. This can provide the best of both worlds, at the expense of some storage costs.

Handling Rich Taxonomies with the Path Hierarchy Token Filter

The path_hierarchy analyzer enables one of the more unorthodox and exciting capabilities for Elasticsearch, handling hierarchical taxonomies. This token filter is simple, it takes a string like '/a/path/somewhere', and tokenizes it into '/a', '/a/path', '/a/path/somewhere'. This simple technique is useful in a variety of situations.

The most obvious use of this token filter is to make aggregating file/directory names easy. By faceting on these tokens you can easily see, for instance, how many files are under the path '/usr/bin'. In fact, by aggregating sorting by frequency on a field indexed using the path_hierarchy token filter, one will get back the largest parent directory.

The path_hierarchy token filter isn't limited to simply breaking up file-path like things however. It can be used with domain names, for instance, by setting its delimiter option to . and setting reverse to true, since domain names are organized right to left in terms of hierarchy.

When All Else Fails

This article is by no means a complete accounting of Elasticsearch analyzers. We didn't even have the space to talk about character filters, or cover other interesting analyzers, like the compound_word analyzer. It is highly recommended that readers take the time to scan the Elasticsearch docs for analysis to at least have some basic knowledge of what other tools are out there.

There is no 'complete' set of analyzers to learn. The ones bundled with Elasticsearch are only common solutions to common problems. You may find that your specific application needs additional analysis that cannot be accomplished with the suite of analyzers included in Elasticsearch.

When you run out of pre-built options there are two basic solutions: 1.) do some of your analysis in your application code, 2.) write a custom Elasticsearch analyzer. The first option isn't as bad as it may sound. If, for instance your source data is stored in some wonky format that can't be parsed by Elasticsearch's analyzers, there's no shame in converting it before it goes into Elasticsearch. If, however, you have multiple applications hitting Elasticsearch, and especially if you need queries analyzed via the same process as data is indexed, I'd strongly consider writing a proper analyzer yourself. Lastly, remember that there are a number of plugins for Elasticsearch and Lucene that extend its analysis capabilities. Your problem may already have been solved by another. Happy searching!