As well as being a search engine, Elasticsearch is also a powerful analytics engine. However, in order to take full advantage of the near real-time analytics capabilities of Elasticsearch, it is often useful to add structure to your data as it is ingested into Elasticsearch. The reasons for this are explained very well in our schema on write vs. schema on read blog post, and for the remainder of this blog series, when I talk about structuring data, I am referring to schema on write.
Because structuring data can be so important, in this three-part blog series we will explore the following:
- How to add structure to unstructured documents by using an ingest node with the Grok Processor.
- How to construct new grok patterns.
- How to debug errors in grok patterns. This will also explore other topics like publicly available grok patterns and a brief mention of the Dissect Processor as a possible alternative to grok.
But first … ECS
As a side note, if you are going to put in the effort to structure your data, you should consider structuring your data so that it conforms to the Elastic Common Schema (ECS). ECS helps users to more easily visualize, search, drill down, and pivot through their data. ECS also streamlines the implementation of automated analysis methods, including machine learning-based anomaly detection and alerting. ECS can be used in a wide variety of use cases, including logging, security analytics, and application performance monitoring.
An example of adding structure to unstructured data
It is not uncommon to see documents sent to Elasticsearch that are similar to the following:
{ "message": "55.3.244.1 GET /index.html 15824 0.043 other stuff" }
The message
field in the above document contains unstructured data. It is a series of words and numbers that are not appropriate for near-real-time analytics.
In order to take full advantage of the powerful analytics capabilities of Elasticsearch, we should parse the message
field to extract relevant data. For example, we could extract the following fields from the message:
"host.ip": "55.3.244.1" "http.request.method": "GET" "url.original": "/index.html" "http.request.bytes": 15824 "event.duration": 0.043
Adding such a structure will allow you to unleash the full power (and speed) of Elasticsearch on your data. Let’s take a look at how we can use grok to structure your data.
Using grok to structure data
Grok is a tool that can be used to extract structured data out of a given text field within a document. You define a field to extract data from, as well as the grok pattern for the match. Grok sits on top of regular expressions. However, unlike regular expressions, grok patterns are made up of reusable patterns, which can themselves be composed of other grok patterns.
As a quick note to Logstash users, Elasticsearch has grok processing capabilities just like Logstash. In this blog, we will be using the ingest node grok processor, not the Logstash grok filter. It is relatively straightforward to convert between ingest node grok patterns and Logstash grok patterns.
Before going into details of how to build and debug your own grok patterns, let’s take a quick look at what a grok pattern looks like, how it can be used in an ingest pipeline, and how it can be simulated. Don’t worry if you don’t fully understand the details of the grok expression yet, as these details will be discussed in-depth later in this blog.
In the previous section we presented an example document that looks as follows:
{ "message": "55.3.244.1 GET /index.html 15824 0.043 other stuff" }
The desired structured data can extracted from this example message
field in this document by using the following grok expression:
%{IP:host.ip} %{WORD:http.request.method} %{URIPATHPARAM:url.original} %{NUMBER:http.request.bytes:int} %{NUMBER:event.duration:double} %{GREEDYDATA}
We can then use this expression in an ingest pipeline. Here’s an example of a pipeline called example_grok_pipeline
which contains this grok pattern inside a grok processor:
PUT _ingest/pipeline/example_grok_pipeline { "description": "A simple example of using Grok", "processors": [ { "grok": { "field": "message", "patterns": [ "%{IP:host.ip} %{WORD:http.request.method} %{URIPATHPARAM:url.original} %{NUMBER:http.request.bytes:int} %{NUMBER:event.duration:double} %{GREEDYDATA}" ] } } ] }
This pipeline can be simulated with the following command:
POST _ingest/pipeline/example_grok_pipeline/_simulate { "docs": [ { "_source": { "message": "55.3.244.1 GET /index.html 15824 0.043 other stuff" } } ] }
Which responds with a structured document that looks as follows:
{ "docs" : [ { "doc" : { "_index" : "_index", "_type" : "_doc", "_id" : "_id", "_source" : { "host" : { "ip" : "55.3.244.1" }, "http" : { "request" : { "method" : "GET", "bytes" : 15824 } }, "message" : "55.3.244.1 GET /index.html 15824 0.043 other stuff", "event" : { "duration" : 0.043 }, "url" : { "original" : "/index.html" } }, "_ingest" : { "timestamp" : "2020-06-24T22:41:47.153985Z" } } } ] }
Voila!
This document contains the original unstructured message
field, and it also contains additional fields which have been extracted from the message. We now have a document that contains structured data!
Grokking out your ingest pipelines
In the above example we simulated execution of an ingest pipeline that contains our grok pattern, but didn’t actually run it on any real documents. An ingest pipeline is designed to process documents at ingest time, as described in the ingest node documentation. One way to execute an ingest pipeline is by including a pipeline name when using the PUT
command, as follows:
PUT example_index/_doc/1?pipeline=example_grok_pipeline { "message": "55.3.244.1 GET /index.html 15824 0.043 other stuff" }
And the document that has been written can be seen by executing:
GET example_index/_doc/1
Which will respond with the following:
{ "_index" : "example_index", "_type" : "_doc", "_id" : "1", "_version" : 2, "_seq_no" : 2, "_primary_term" : 1, "found" : true, "_source" : { "host" : { "ip" : "55.3.244.1" }, "http" : { "request" : { "method" : "GET", "bytes" : 15824 } }, "message" : "55.3.244.1 GET /index.html 15824 0.043 other stuff", "event" : { "duration" : 0.043 }, "url" : { "original" : "/index.html" } } }
Alternatively (and likely preferably), the ingest pipeline can be applied by default to all documents that are written to a given index by adding it to the index settings:
PUT example_index/_settings { "index.default_pipeline": "example_grok_pipeline" }
After adding the pipeline to the settings, any documents that are written to example_index
will automatically have the example_grok_pipeline
applied to them.
Testing the pipeline
We can verify that the pipeline is working as expected by writing a new document to example_index
as follows:
PUT example_index/_doc/2 { "message": "66.3.244.1 GET /index.html 500 0.120 new other stuff" }
And the document that has been written can be seen by executing:
GET example_index/_doc/2
Which, as expected will return the document that we just wrote. This document has the new fields that were extracted from the message
field:
{ "_index" : "example_index", "_type" : "_doc", "_id" : "2", "_version" : 3, "_seq_no" : 2, "_primary_term" : 1, "found" : true, "_source" : { "host" : { "ip" : "66.3.244.1" }, "http" : { "request" : { "method" : "GET", "bytes" : 500 } }, "message" : "66.3.244.1 GET /index.html 500 0.120 new other stuff", "event" : { "duration" : 0.12 }, "url" : { "original" : "/index.html" } } }
Understanding the grok pattern
In the previous section, we presented an example document with the following structure:
{ "message": "55.3.244.1 GET /index.html 15824 0.043 other stuff" }
And we then used the following grok pattern to extract structured data from the message
field:
"%{IP:host.ip} %{WORD:http.request.method} %{URIPATHPARAM:url.original} %{NUMBER:http.request.bytes:int} %{NUMBER:event.duration:double} %{GREEDYDATA}"
As described in the Grok Processor documentation, the syntax for grok patterns comes in three forms, %{SYNTAX:SEMANTIC}
, %{SYNTAX}
, and %{SYNTAX:SEMANTIC:TYPE}
, all of which we can see in the above grok pattern.
- The
SYNTAX
is the name of the pattern that will match your text. Built-inSYNTAX
patterns can be seen on GitHub. - The
SEMANTIC
is the name of the field that will store the data that matches theSYNTAX
pattern. - The
TYPE
is the data type you wish to cast your named field.
Identifying IP addresses
The first part of our grok pattern is the following:
%{IP:host.ip}
This declaration matches an IP address (corresponding to the IP
grok pattern) and stores it in a field called host.ip
. For our example data, this will extract a value of 55.3.244.1
from the message
field and will store it in the host.ip
field.
If we want more details on the IP
grok pattern, we can look into the grok patterns on GitHub, and we will see the following definition:
IP (?:%{IPV6}|%{IPV4})
This means that the IP pattern will match one of the IPV6
or IPV4
grok patterns. To understand what the IPV6
and IPV4
patterns are, once again we can look into the grok patterns on GitHub to see their definitions, and so on.
Identifying the http request method
The next part of the grok pattern is a single whitespace character followed by the following expression:
%{WORD:http.request.method}
This portion of the grok expression extracts the word GET
from the contents of the message
and stores it into the http.request.method
field. If we want to understand the definition of the WORD
pattern, we can look at the grok patterns on GitHub.
One can do the same kind of analysis to understand the patterns that match the url.original
, request.bytes
, and event.duration
fields, which would be great practice for anyone learning grok (hint… hint…).
Identifying the rest with GREEDYDATA
Finally, the last statement in the grok pattern is the following:
%{GREEDYDATA}
This expression does not have a SEMANTIC
part, which means that the matching data is not stored into any field. Additionally, the GREEDYDATA
grok pattern will consume as much text as it can, which means that in our example it will match everything after the event.duration
field. The GREEDYDATA
expression will come in handy when debugging complex grok patterns, as discussed in the upcoming parts of this blog series.
Stay tuned
In this post, we’ve explored structuring data within an ingest pipeline using grok patterns and grok processors. This (along with the ingest node documentation) should be enough for you to start structuring your own data. So try it out locally, or spin up a 14-day free trial of Elastic Cloud and give it a whirl on us.
If you're ready to learn how to structure your data exactly the way you want it — no matter what your use case is — check out our next blog on constructing new grok patterns.