03 9월 2015 엔지니어링

Make Your Config Cleaner and your Log Processing Faster with Logstash Metadata

By Pier-Hugues Pellerin

With the release of Logstash 1.5 we have added the ability to add metadata to an event. The difference between regular event data and metadata is that metadata is not serialized by any outputs. This means any metadata you add is transient in the Logstash pipeline and will not be included in the output. Using this feature, one can add custom data to an event, perform additional filtering or add conditionals based on the metadata while the event flows through the Logstash pipeline. This will simplify your configuration and remove the need to define temporary fields.

To access the metadata fields you can use the standard field syntax:

[@metadata][foo]

Use Cases

Lets us consider some use cases to illustrate the power of metadata. In all our use cases, will be using the rubydebug and the stdout output to check our transformation, so make sure you are correctly defining the output codec with the metadata option set to true.

Note: The rubydebug codec used in the stdout output is currently the only way to see what is in @metadata at output time.

output { 
  stdout { 
    codec  => rubydebug {
      metadata => true
    }
  }
}

Date filter

Since logs arrive in a wide variety of formats, grok is used to extract them, and the date filter to convert them to ISO8601 and overwrite the @timestamp field with the timestamp from the log event. It happens frequently that users omit to remove the source timestamp field after the conversion and overwrite, though.

Here's a rough example of how the new @metadata field could be used with the date filter and prevent a temporary timestamp field from making it into Elasticsearch:

  grok {
    match => {
      "message" => '%{IPORHOST:clientip} %{USER:ident} %{USER:auth} \[%{HTTPDATE:[@metadata][timestamp]}\] “%{WORD:verb} %{DATA:request} HTTP/%{NUMBER:httpversion}” %{NUMBER:response:int} (?:-|%{NUMBER:bytes:int}) %{QS:referrer} %{QS:agent}'
    }
  }
  date {
    match => [ "[@metadata][timestamp]", "dd/MMM/YYYY:HH:mm:ss Z" ]
  }

Before Logstash 1.5, you would remove the redundant timestamp field by adding the remove_field line into the date filter as I outlined above. Theoretically, that will be a slower operation than this one. That makes using the @metadata field a performance booster!

The @metadata field act like a normal field and you can do all the operations or filtering on it. Use them as a scratchpad if you don't need to persist the information.

# Log sample:
# 213.113.233.227 - server=A id=1234 memory_load=300 error_code=13 payload=12 event_start=1417193566 event_stop=1417793586
input { 
  file {
    sincedb_path => '/dev/null'
    path => "/source/test.log"
    start_position => 'beginning'
  }
}
filter { 
  grok {
    match => {
      "message" => "%{IP:ip} - %{DATA:[@metadata][components]}$" 
    }
  }
  kv { source => "[@metadata][components]" }
  date {
    match => ["event_start", "UNIX"]
    target => "event_start"
  }
  date {
    match => ["event_stop", "UNIX"]
    target => "event_stop"
  }
  ruby {
    code => "event['@metadata']['duration'] = event['event_stop'] - event['event_start']"
  }
  if [@metadata][duration] > 100 {
    mutate { 
      add_tag => "slow_query" 
      add_field => { "[@metadata][speed]" =>  "slow_query" }
    }
  } else {
    mutate { 
      add_field => { "[@metadata][speed]" =>  "normal" }
    }
  }
}
output { 
  stdout { 
    codec  => rubydebug { metadata => true }
  }
}

Elasticsearch output

Some plugins leverage the use of the metadata, like the elasticsearch input. It allows you to keep the document information in a predefined @metadata field. This information is available to various parts of the Logstash pipeline, but will not be persisted in Elasticsearch documents.

input {
  elasticsearch {
    host => "localhost"
    # Store ES document metadata (_index, _type, _id) in metadata
    docinfo_in_metadata => true
  }
}
output {
  elasticsearch {
    document_id => "%{[@metadata][_id]}"
    index => "transformed-%{[@metadata][_index]}"
    type => "%{[@metadata][_type]}"
  }
}

Create your own id from your event data

Out of the box, Elasticsearch provides an efficient way to create unique IDs for every documents that you are inserting. In most cases, you should let Elasticsearch generate the IDs. However, there are scenarios where you would want to generate an unique identifier in Logstash based on the content of the event. Using IDs based on event data lets Elasticsearch perform de-duplication. In our example, we will generate the IDs using the logstash-filter-fingerprint and use the default hash method (SHA1).

To test it, use the following JSON event with this configuration:

{ "IP": "127.0.0.1", "message": "testing generated id"}
input {
 stdin { codec => json }
}
filter {
  fingerprint {
    source => ["IP", "@timestamp", "message"]
    target => "[@metadata][generated_id]"
    key => "my-key"
  }
}
output {
  elasticsearch {
    protocol => "http"
    host => "127.0.0.1"
    document_id => "%{[@metadata][generated_id]}"
  }
  stdout {
    codec => rubydebug { metadata => true }
  }
}

Like in the previous examples, we are using the fieldref syntax to access the generated_id in the @metadata hash. The Elasticsearch output will use this value as the document id, but the intermediate variable generated_id will not be saved as part of the _source inside Elasticsearch. If you do a query for the specific document using the generated ID you should see a similar document showing the saved information.

# curl -XGET "http://localhost:9200/logstash*/_search?q=_id:5f5b8e63da13c17405e940b5e8db703a19cd4485&pretty=1"
{
  "took" : 8,
  "timed_out" : false,
  "_shards" : {
    "total" : 35,
    "successful" : 35,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 1.0,
    "hits" : [ {
      "_index" : "logstash-2015.09.03",
      "_type" : "logs",
      "_id" : "5f5b8e63da13c17405e940b5e8db703a19cd4485",
      "_score" : 1.0,
      "_source":{"IP":"127.0.0.1","message":"testing generated id","@version":"1","@timestamp":"2015-09-03T20:27:25.206Z","host":"sashimi"}
    } ]
  }
}

Similarly, you can also use @metadata as fieldref syntax in your configuration like any other fields:

"from server: %{[@metadata][source]}%"

Conclusion

As you have seen in the examples above, the addition of metadata provides a simple, yet convenient way to store intermediate results. This makes configuration less complex -- you don't have to use remove_field explicitly. Also, we can reduce storage of unnecessary fields in Elasticsearch which helps reduce the size of your index. Metadata is a powerful addition to your Logstash toolset. Start using this feature today in your configuration!