July 8, 2014How to

Quick Tips: regex filter buckets

When you help people build applications with Elasticsearch every day, you run into a lot of unique requirements and scenarios. All of us here at Elasticsearch have an ever-growing “bag of tricks”, and it only seems fair to share those tricks with you. We hope you find them interesting, and perhaps useful to your application.

A while ago on Twitter, someone was asking if aggregations could be used to categorize data based on irregular product codes. They were combining data from several legacy systems, so the product codes were erratic and not internally consistent.

There were two classes of product codes:

AB123: Two letters followed by three digits
A99999: One letter followed by five digits

The goal was to determine how many of each type existed, and how many were missing certain meta-data tags. Many people are stumped by situations like this, since it is initially unclear how to categorize two groups of product codes that are only related to each other by the structural format of their ID.

Option 1: Pre-parse with Grok and Logstash

The best solution is to deal with inconsistencies like this at the input level. For example, you can build a Grok filter in Logstash to identify the two different patterns and tag them appropriately. Once tagged, it is trivial to do aggregations on the structured document data. If a new pattern occurs in your data that doesn’t match the pattern, Logstash will emit a _grokparsefailure tag.

This has the benefit of improving search results, since only properly tagged documents will be visible to search. You can go back and fix/reindex the “broken” data at your lesiure and iteratively improve your search results through better-tagged data.

Option 2: Regex to the rescue

But things are not always this simple. What if you just indexed 10TB of data and realized you forgot to include appropriate pre-parsing? Re-indexing would be a pain, or potentially impossible. We need a solution that operates on your existing data.

The key is to use a regular expression and filter buckets. A filter bucket will hold all documents matching its filtering criteria. If we place a regexp filter inside the bucket, we can find all product IDs matching a certain pattern.

Once documents have been sorted into one of the filter buckets, we can apply other bucketing and metrics to derive statistics.

Let’s take a look at a very simple example. First we index some data with mixed product codes:

PUT /test/data/_bulk
{"index":{}}
{"product_code" : "AB123"}
{"index":{}}
{"product_code" : "XY345"}
{"index":{}}
{"product_code" : "AZ987"}
{"index":{}}
{"product_code" : "ZZ192"}
{"index":{}}
{"product_code" : "A99999"}
{"index":{}}
{"product_code" : "A12345"}
{"index":{}}
{"product_code" : "A98765"}
{"index":{}}
{"some_other_field" : "xyz"}
{"index":{}}
{"some_other_field" : "123"}

Then we can run a very simple aggregation to sort out the various codes:

GET /test/data/_search?search_type=count
{
  "aggs" :{
    "total_count" : {
      "global" : {}
    },
    "XX999" : {
      "filter" : {
        "regexp":{
            "product_code" : "[a-z]{2}[0-9]{3}"
        }
      } 
    },
    "X99999" : {
      "filter" : {
        "regexp":{
            "product_code" : "[a-z]{1}[0-9]{5}"
        }
      } 
    },
    "no_format" : {
        "missing" : {
            "field" : "product_code"   
        }
    }
  }
}

This query will give a document count for each product code matching the two regex patterns. The missing bucket, which will get a nice performance boost in Elasticsearch v1.3.0, will show all documents that don’t have any product code at all. From this base, it is easy to add extra metrics, such as the average price of each product, or the average number sold each day for the last month, etc.

The takeaway tip

The key to this tip is the filter bucket. This bucket accepts any Elasticsearch filter, which means you can construct the same kind of complex filtering operations which you already use in search requests. And because these are filters, they enjoy all the performance benefits inherent to filters.

Starting in Elasticsearch version 1.3.0, you will also have access to the filters bucket (note the plural). Although functionally equivalent to using multiple filter buckets, the filters simplifies the syntax for applying multiple filters at the same time.

So next time you need to aggregate some statistics, but are stumped by the irregular or inconsistent nature of your data, remember the filter bucket (and the upcoming filters bucket).