Engineering

Structuring Elasticsearch data with grok on ingest for faster analytics

As well as being a search engine, Elasticsearch is also a powerful analytics engine. However, in order to take full advantage of the near real-time analytics capabilities of Elasticsearch, it is often useful to add structure to your data as it is ingested into Elasticsearch. The reasons for this are explained very well in our schema on write vs. schema on read blog post, and for the remainder of this blog series, when I talk about structuring data, I am referring to schema on write.

Because structuring data can be so important, in this three-part blog series we will explore the following:

But first … ECS

As a side note, if you are going to put in the effort to structure your data, you should consider structuring your data so that it conforms to the Elastic Common Schema (ECS). ECS helps users to more easily visualize, search, drill down, and pivot through their data. ECS also streamlines the implementation of automated analysis methods, including machine learning-based anomaly detection and alerting. ECS can be used in a wide variety of use cases, including logging, security analytics, and application performance monitoring.

An example of adding structure to unstructured data

It is not uncommon to see documents sent to Elasticsearch that are similar to the following:

{ 
    "message": "55.3.244.1 GET /index.html 15824 0.043 other stuff" 
} 

The message field in the above document contains unstructured data. It is a series of words and numbers that are not appropriate for near-real-time analytics.

In order to take full advantage of the powerful analytics capabilities of Elasticsearch, we should parse the message field to extract relevant data. For example, we could extract the following fields from the message: 

"host.ip": "55.3.244.1"  
"http.request.method": "GET" 
"url.original": "/index.html" 
"http.request.bytes": 15824 
"event.duration": 0.043 

Adding such a structure will allow you to unleash the full power (and speed) of Elasticsearch on your data. Let’s take a look at how we can use grok to structure your data.

Using grok to structure data

Grok is a tool that can be used to extract structured data out of a given text field within a document. You define a field to extract data from, as well as the grok pattern for the match. Grok sits on top of regular expressions. However, unlike regular expressions, grok patterns are made up of reusable patterns, which can themselves be composed of other grok patterns. 

As a quick note to Logstash users, Elasticsearch has grok processing capabilities just like Logstash. In this blog, we will be using the ingest node grok processor, not the Logstash grok filter. It is relatively straightforward to convert between ingest node grok patterns and Logstash grok patterns.

Before going into details of how to build and debug your own grok patterns, let’s take a quick look at what a grok pattern looks like, how it can be used in an ingest pipeline, and how it can be simulated. Don’t worry if you don’t fully understand the details of the grok expression yet, as these details will be discussed in-depth later in this blog.

In the previous section we presented an example document that looks as follows:

{ 
    "message": "55.3.244.1 GET /index.html 15824 0.043 other stuff" 
} 

The desired structured data can extracted from this example message field in this document by using the following grok expression: 

%{IP:host.ip} %{WORD:http.request.method} %{URIPATHPARAM:url.original} %{NUMBER:http.request.bytes:int} %{NUMBER:event.duration:double} %{GREEDYDATA} 

We can then use this expression in an ingest pipeline. Here’s an example of a pipeline called example_grok_pipeline which contains this grok pattern inside a grok processor:

PUT _ingest/pipeline/example_grok_pipeline 
{ 
  "description": "A simple example of using Grok", 
  "processors": [ 
    { 
      "grok": { 
        "field": "message", 
        "patterns": [ 
          "%{IP:host.ip} %{WORD:http.request.method} %{URIPATHPARAM:url.original} %{NUMBER:http.request.bytes:int} %{NUMBER:event.duration:double} %{GREEDYDATA}" 
        ] 
      } 
    } 
  ] 
} 

This pipeline can be simulated with the following command:

POST _ingest/pipeline/example_grok_pipeline/_simulate 
{ 
    "docs": [ 
    { 
      "_source": { 
        "message": "55.3.244.1 GET /index.html 15824 0.043 other stuff" 
      } 
    } 
  ] 
} 

Which responds with a structured document that looks as follows: 

{ 
  "docs" : [ 
    { 
      "doc" : { 
        "_index" : "_index", 
        "_type" : "_doc", 
        "_id" : "_id", 
        "_source" : { 
          "host" : { 
            "ip" : "55.3.244.1" 
          }, 
          "http" : { 
            "request" : { 
              "method" : "GET", 
              "bytes" : 15824 
            } 
          }, 
          "message" : "55.3.244.1 GET /index.html 15824 0.043 other stuff", 
          "event" : { 
            "duration" : 0.043 
          }, 
          "url" : { 
            "original" : "/index.html" 
          } 
        }, 
        "_ingest" : { 
          "timestamp" : "2020-06-24T22:41:47.153985Z" 
        } 
      } 
    } 
  ] 
} 

Voila!

This document contains the original unstructured  message field, and it also contains additional fields which have been extracted from the message. We now have a document that contains structured data!

Grokking out your ingest pipelines 

In the above example we simulated execution of an ingest pipeline that contains our grok pattern, but didn’t actually run it on any real documents. An ingest pipeline is designed to process documents at ingest time, as described in the ingest node documentation. One way to execute an ingest pipeline is by including a pipeline name when using the PUT command, as follows: 

PUT example_index/_doc/1?pipeline=example_grok_pipeline 
{ 
  "message": "55.3.244.1 GET /index.html 15824 0.043 other stuff" 
} 

And the document that has been written can be seen by executing:

GET example_index/_doc/1 

Which will respond with the following:

{ 
  "_index" : "example_index", 
  "_type" : "_doc", 
  "_id" : "1", 
  "_version" : 2, 
  "_seq_no" : 2, 
  "_primary_term" : 1, 
  "found" : true, 
  "_source" : { 
    "host" : { 
      "ip" : "55.3.244.1" 
    }, 
    "http" : { 
      "request" : { 
        "method" : "GET", 
        "bytes" : 15824 
      } 
    }, 
    "message" : "55.3.244.1 GET /index.html 15824 0.043 other stuff", 
    "event" : { 
      "duration" : 0.043 
    }, 
    "url" : { 
      "original" : "/index.html" 
    } 
  } 
} 

Alternatively (and likely preferably), the ingest pipeline can be applied by default to all documents that are written to a given index by adding it to the index settings:

PUT example_index/_settings 
{ 
  "index.default_pipeline": "example_grok_pipeline" 
} 

After adding the pipeline to the settings, any documents that are written to example_index will automatically have the example_grok_pipeline applied to them. 

Testing the pipeline

We can verify that the pipeline is working as expected by writing a new document to example_index as follows:

PUT example_index/_doc/2 
{ 
  "message": "66.3.244.1 GET /index.html 500 0.120 new other stuff" 
}  

And the document that has been written can be seen by executing:

GET example_index/_doc/2 

Which, as expected will return the document that we just wrote. This document has the new fields that were extracted from the message field:

{ 
  "_index" : "example_index", 
  "_type" : "_doc", 
  "_id" : "2", 
  "_version" : 3, 
  "_seq_no" : 2, 
  "_primary_term" : 1, 
  "found" : true, 
  "_source" : { 
    "host" : { 
      "ip" : "66.3.244.1" 
    }, 
    "http" : { 
      "request" : { 
        "method" : "GET", 
        "bytes" : 500 
      } 
    }, 
    "message" : "66.3.244.1 GET /index.html 500 0.120 new other stuff", 
    "event" : { 
      "duration" : 0.12 
    }, 
    "url" : { 
      "original" : "/index.html" 
    } 
  } 
} 

Understanding the grok pattern

In the previous section, we presented an example document with the following structure:

{ 
    "message": "55.3.244.1 GET /index.html 15824 0.043 other stuff" 
} 

And we then used the following grok pattern to extract structured data from the message field: 

"%{IP:host.ip} %{WORD:http.request.method} %{URIPATHPARAM:url.original} %{NUMBER:http.request.bytes:int} %{NUMBER:event.duration:double} %{GREEDYDATA}" 

As described in the Grok Processor documentation, the syntax for grok patterns comes in three forms, %{SYNTAX:SEMANTIC}, %{SYNTAX}, and %{SYNTAX:SEMANTIC:TYPE}, all of which we can see in the above grok pattern.  

  • The SYNTAX is the name of the pattern that will match your text. Built-in SYNTAX patterns can be seen on GitHub
  • The SEMANTIC is the name of the field that will store the data that matches the SYNTAX pattern. 
  • The TYPE is the data type you wish to cast your named field.

Identifying IP addresses

The first part of our grok pattern is the following:

%{IP:host.ip} 

This declaration matches an IP address (corresponding to the IP grok pattern) and stores it in a field called host.ip. For our example data, this will extract a value of 55.3.244.1 from the message field and will store it in the host.ip field.

If we want more details on the IP grok pattern, we can look into the grok patterns on GitHub, and we will see the following definition: 

IP (?:%{IPV6}|%{IPV4}) 

This means that the IP pattern will match one of the IPV6 or IPV4 grok patterns. To understand what the IPV6 and IPV4 patterns are, once again we can look into the grok patterns on GitHub to see their definitions, and so on. 

Identifying the http request method

The next part of the grok pattern is a single whitespace character followed by the following expression:

%{WORD:http.request.method} 

This portion of the grok expression extracts the word GET from the contents of the message and stores it into the http.request.method field. If we want to understand the definition of the WORD pattern, we can look at the grok patterns on GitHub

One can do the same kind of analysis to understand the patterns that match the url.original, request.bytes, and event.duration fields, which would be great practice for anyone learning grok (hint… hint…).

Identifying the rest with GREEDYDATA

Finally, the last statement in the grok pattern is the following:

%{GREEDYDATA} 

This expression does not have a SEMANTIC part, which means that the matching data is not stored into any field. Additionally, the GREEDYDATA grok pattern will consume as much text as it can, which means that in our example it will match everything after the event.duration field. The GREEDYDATA expression will come in handy when debugging complex grok patterns, as discussed in the upcoming parts of this blog series.

Stay tuned

In this post, we’ve explored structuring data within an ingest pipeline using grok patterns and grok processors. This (along with the ingest node documentation) should be enough for you to start structuring your own data. So try it out locally, or spin up a 14-day free trial of Elastic Cloud and give it a whirl on us.

If you're ready to learn how to structure your data exactly the way you want it —  no matter what your use case is — check out our next blog on constructing new grok patterns.