14 10월 2014 엔지니어링

Little Logstash Lessons - Part I: Using grok and mutate to type your data

By Aaron Mildenstein

Logstash is an event processing pipeline, which features a rich ecosystem of plugins, allowing users to push data in, manipulate it, and then send it to various backends.

One of those plugins is grok. Grok is currently the best for Logstash to parse unstructured log data and structure it so it can be best queried by Elasticsearch. Mutate, another popular plugin, allows the user to manipulate Logstash event data in many useful ways.

Why type my data?

Elasticsearch is so much more powerful than just its full-text search use case. It can also calculate a variety of statistics on numerical data in near real-time. Kibana can then take the results of these calculations and plot them in charts and dashboards. If the data is not typed properly, Elasticsearch will be unable to perform these calculations. Typing your data properly takes a little thought and effort, but yields amazing results!

JSON, Strings, and Numbers

All documents sent to Elasticsearch must be in JSON format, but Logstash takes care of transforming your data into JSON documents for you. When Elasticsearch receives a JSON document, it will do its best to try to guess what type of data each field contains. The list of core types and a thorough description of each one can be found here.

If I were to send the following JSON document to Elasticsearch (patterned as a Logstash event):

{
    "@timestamp": "2014-10-07T20:11:45.000Z",
    "@version": "1",
    "count": 2048,
    "average": 1523.33,
    "host": "elasticsearch.com"
}

We can see 5 fields: @timestamp, @version, count, average, and host.

In the JSON I am sending, @timestamp and host are string fields, count and average are numeric fields, but @version is a strange hybrid. The value of @version is a number, but because it is inside double quotes " it means that it is considered a string within this JSON document.

If I were to send this document to Elasticsearch to be indexed:

curl -XPOST localhost:9200/Logstash-2014.10.07/logs/1 -d '
{
    "@timestamp": "2014-10-07T20:11:45.000Z",
    "@version": "1",
    "count": 2048,
    "average": 1523.33,
    "host": "elasticsearch.com"
}

and then check the mapping…

curl localhost:9200/Logstash-2014.10.07/_mapping?pretty
{
  "Logstash-2014.10.07" : {
    "mappings" : {
      "logs" : {
        "properties" : {
          "@timestamp" : {
            "type" : "date",
            "format" : "dateOptionalTime"
          },
          "@version" : {
            "type" : "string"
          },
          "average" : {
            "type" : "double"
          },
          "count" : {
            "type" : "long"
          },
          "host" : {
            "type" : "string"
          }
        }
      }
    }
  }
}

average is type double, count is type long, Elasticsearch successfully guessed that @timestamp is a date field, and host is type string.

Because @version was sent as a string, it remains type string.

It is important to understand that unless you type (or cast) your data accordingly, Logstash sends all values to Elasticsearch as strings.

Coercing a data type in Logstash

There are currently two ways to coerce Logstash to send numeric values: grok and mutate. You can also coerce types at the Elasticsearch level.

grok

Using grok to parse unstructured data into structured data can be a daunting task on its own. It can be downright confusing to tokenize numeric data into a field (let’s call it num) with the grok pattern %{NUMBER:num} only to find that Elasticsearch thinks num is a string field. Part of the confusion stems from the fact that grok treats the source data as a string since it is a regular expression engine. Because grok sees whatever is input as a string, without further information the output will also be a string. What we need is a way to tell grok and Logstash that the resulting value should be numeric.

The official documentation for grok explains it like this:

Optionally you can add a data type conversion to your grok pattern. By default all semantics are saved as strings. If you wish to convert a semantic’s data type, for example change a string to an integer then suffix it with the target data type. For example %{NUMBER:num:int} which converts the num semantic from a string to an integer. Currently the only supported conversions are int and float.

So, if I add :int into my grok field definition, suddenly my value will be cast as an integer. Caveat: The NUMBER grok pattern will also detect numbers with decimal values. If you cast a number with a decimal as :int, it will truncate the decimal value leaving only the whole number portion.

In that case, the number can be cast as :float, meaning floating-point value.

mutate

Like grok, mutate also allows you to type your fields. The mutate filter currently allows you to convert a field into an integer, a float, or a string. Using our previous example of field num, the configuration would look something like this:

filter {
  mutate {
    convert => { "num" => "integer" }
  }
}

As with grok, converting a value with a decimal to type integer will truncate the decimal portion, leaving only the whole number.

Checking our work

If I were to check the mapping in Elasticsearch after typing the field num to an integer, I would find something unusual.

curl localhost:9200/Logstash-2014.10.07/_mapping?pretty
...
          "num" : {
            "type" : "long"
          }
...

It says that it’s type long. But I typed it as integer! What happened?

The answer is in the JSON. Logstash can only send JSON to Elasticsearch. If you recall our first example in this post, we sent a document with the field count with a value of 2048. Elasticsearch mapped that as type long also. We did the same in this example, but the field name was num. Unless you tell Elasticsearch what to expect, it can only make an educated guess how to interpret the JSON you send it.

Next steps: Typing in Elasticsearch

So how do we do typing in Elasticsearch? That will have to wait until the next issue of Little Logstash Lessons.

I hope you enjoyed today’s Little Logstash Lesson, where we learned how to re-type our numeric values inside Logstash! Stay tuned for our next lesson where we’ll learn how to take our typing to the next level in Elasticsearch.