July 8, 2014

Quick Tips: regex filter buckets

When you help people build applications with Elasticsearch every day, you run into a lot of unique requirements and scenarios. All of us here at Elasticsearch have an ever-growing “bag of tricks”, and it only seems fair to share those tricks with you. We hope you find them interesting, and perhaps useful to your application.

A while ago on Twitter, someone was asking if aggregations could be used to categorize data based on irregular product codes. They were combining data from several legacy systems, so the product codes were erratic and not internally consistent.

There were two classes of product codes:

AB123: Two letters followed by three digits
A99999: One letter followed by five digits

The goal was to determine how many of each type existed, and how many were missing certain meta-data tags. Many people are stumped by situations like this, since it is initially unclear how to categorize two groups of product codes that are only related to each other by the structural format of their ID.

Option 1: Pre-parse with Grok and Logstash

The best solution is to deal with inconsistencies like this at the input level. For example, you can build a Grok filter in Logstash to identify the two different patterns and tag them appropriately. Once tagged, it is trivial to do aggregations on the structured document data. If a new pattern occurs in your data that doesn’t match the pattern, Logstash will emit a _grokparsefailure tag.

This has the benefit of improving search results, since only properly tagged documents will be visible to search. You can go back and fix/reindex the “broken” data at your lesiure and iteratively improve your search results through better-tagged data.

Option 2: Regex to the rescue

But things are not always this simple. What if you just indexed 10TB of data and realized you forgot to include appropriate pre-parsing? Re-indexing would be a pain, or potentially impossible. We need a solution that operates on your existing data.

The key is to use a regular expression and filter buckets. A filter bucket will hold all documents matching its filtering criteria. If we place a regexp filter inside the bucket, we can find all product IDs matching a certain pattern.

Once documents have been sorted into one of the filter buckets, we can apply other bucketing and metrics to derive statistics.

Let’s take a look at a very simple example. First we index some data with mixed product codes:

PUT /test/data/_bulk
{"index":{}}
{"product_code" : "AB123"}
{"index":{}}
{"product_code" : "XY345"}
{"index":{}}
{"product_code" : "AZ987"}
{"index":{}}
{"product_code" : "ZZ192"}
{"index":{}}
{"product_code" : "A99999"}
{"index":{}}
{"product_code" : "A12345"}
{"index":{}}
{"product_code" : "A98765"}
{"index":{}}
{"some_other_field" : "xyz"}
{"index":{}}
{"some_other_field" : "123"}

Then we can run a very simple aggregation to sort out the various codes:

GET /test/data/_search?search_type=count
{
  "aggs" :{
    "total_count" : {
      "global" : {}
    },
    "XX999" : {
      "filter" : {
        "regexp":{
            "product_code" : "[a-z]{2}[0-9]{3}"
        }
      } 
    },
    "X99999" : {
      "filter" : {
        "regexp":{
            "product_code" : "[a-z]{1}[0-9]{5}"
        }
      } 
    },
    "no_format" : {
        "missing" : {
            "field" : "product_code"   
        }
    }
  }
}

This query will give a document count for each product code matching the two regex patterns. The missing bucket, which will get a nice performance boost in Elasticsearch v1.3.0, will show all documents that don’t have any product code at all. From this base, it is easy to add extra metrics, such as the average price of each product, or the average number sold each day for the last month, etc.

The takeaway tip

The key to this tip is the filter bucket. This bucket accepts any Elasticsearch filter, which means you can construct the same kind of complex filtering operations which you already use in search requests. And because these are filters, they enjoy all the performance benefits inherent to filters.

Starting in Elasticsearch version 1.3.0, you will also have access to the filters bucket (note the plural). Although functionally equivalent to using multiple filter buckets, the filters simplifies the syntax for applying multiple filters at the same time.

So next time you need to aggregate some statistics, but are stumped by the irregular or inconsistent nature of your data, remember the filter bucket (and the upcoming filters bucket).

Context engineering

Vector database

Search powered applications

Logs

Threat protection

Workflows

Elasticsearch

Kibana (Discover, Dashboards)

Elastic Agent Builder

AutoOps

Piped query language

Jina AI search models

Elastic Cloud Serverless

Elastic Cloud Hosted

Self-managed Elasticsearch

Ecommerce search

Customer support search

Search-driven apps

Log analytics

Infrastructure monitoring

Digital experience monitoring

App performance monitoring

AIOps

LLM observability

Next-gen SIEM

Workflows for security

XDR and endpoint security

AI for security

10x your data's value

Cloud providers

Elastic AI Ecosystem

Search AI Partner Program

AV-Comparatives

Forrester Wave™ XDR

Gartner Magic Quadrant Leader

IDC MarketScape

Search

Security

Observability

Get started

Demo gallery

Downloads

Integrations

Docs

Elasticsearch Labs

Elastic Security Labs

Elastic Observability Labs

Blog

Community

Events

Webinars

Discuss

Training

Support

Consulting

Quick Tips: regex filter buckets

Option 1: Pre-parse with Grok and Logstash

Option 2: Regex to the rescue

The takeaway tip