September 3, 2015

Make Your Config Cleaner and your Log Processing Faster with Logstash Metadata

With the release of Logstash 1.5 we have added the ability to add metadata to an event. The difference between regular event data and metadata is that metadata is not serialized by any outputs. This means any metadata you add is transient in the Logstash pipeline and will not be included in the output. Using this feature, one can add custom data to an event, perform additional filtering or add conditionals based on the metadata while the event flows through the Logstash pipeline. This will simplify your configuration and remove the need to define temporary fields.

To access the metadata fields you can use the standard field syntax:

[@metadata][foo]

Use Cases

Lets us consider some use cases to illustrate the power of metadata. In all our use cases, will be using the rubydebug and the stdout output to check our transformation, so make sure you are correctly defining the output codec with the metadata option set to true.

Note: The rubydebug codec used in the stdout output is currently the only way to see what is in @metadata at output time.

output { 
  stdout { 
    codec  => rubydebug {
      metadata => true
    }
  }
}

Date filter

Since logs arrive in a wide variety of formats, grok is used to extract them, and the date filter to convert them to ISO8601 and overwrite the @timestamp field with the timestamp from the log event. It happens frequently that users omit to remove the source timestamp field after the conversion and overwrite, though.

Here's a rough example of how the new @metadata field could be used with the date filter and prevent a temporary timestamp field from making it into Elasticsearch:

  grok {
    match => {
      "message" => '%{IPORHOST:clientip} %{USER:ident} %{USER:auth} \[%{HTTPDATE:[@metadata][timestamp]}\] “%{WORD:verb} %{DATA:request} HTTP/%{NUMBER:httpversion}” %{NUMBER:response:int} (?:-|%{NUMBER:bytes:int}) %{QS:referrer} %{QS:agent}'
    }
  }
  date {
    match => [ "[@metadata][timestamp]", "dd/MMM/YYYY:HH:mm:ss Z" ]
  }

Before Logstash 1.5, you would remove the redundant timestamp field by adding the remove_field line into the date filter as I outlined above. Theoretically, that will be a slower operation than this one. That makes using the @metadata field a performance booster!

The @metadata field act like a normal field and you can do all the operations or filtering on it. Use them as a scratchpad if you don't need to persist the information.

# Log sample:
# 213.113.233.227 - server=A id=1234 memory_load=300 error_code=13 payload=12 event_start=1417193566 event_stop=1417793586

input { 
  file {
    sincedb_path => '/dev/null'
    path => "/source/test.log"
    start_position => 'beginning'
  }
}
filter { 
  grok {
    match => {
      "message" => "%{IP:ip} - %{DATA:[@metadata][components]}$" 
    }
  }
  kv { source => "[@metadata][components]" }
  date {
    match => ["event_start", "UNIX"]
    target => "event_start"
  }
  date {
    match => ["event_stop", "UNIX"]
    target => "event_stop"
  }
  ruby {
    code => "event['@metadata']['duration'] = event['event_stop'] - event['event_start']"
  }
  if [@metadata][duration] > 100 {
    mutate { 
      add_tag => "slow_query" 
      add_field => { "[@metadata][speed]" =>  "slow_query" }
    }
  } else {
    mutate { 
      add_field => { "[@metadata][speed]" =>  "normal" }
    }
  }
}
output { 
  stdout { 
    codec  => rubydebug { metadata => true }
  }
}

Elasticsearch output

Some plugins leverage the use of the metadata, like the elasticsearch input. It allows you to keep the document information in a predefined @metadata field. This information is available to various parts of the Logstash pipeline, but will not be persisted in Elasticsearch documents.

input {
  elasticsearch {
    host => "localhost"
    # Store ES document metadata (_index, _type, _id) in metadata
    docinfo_in_metadata => true
  }
}
output {
  elasticsearch {
    document_id => "%{[@metadata][_id]}"
    index => "transformed-%{[@metadata][_index]}"
    type => "%{[@metadata][_type]}"
  }
}

Create your own id from your event data

Out of the box, Elasticsearch provides an efficient way to create unique IDs for every documents that you are inserting. In most cases, you should let Elasticsearch generate the IDs. However, there are scenarios where you would want to generate an unique identifier in Logstash based on the content of the event. Using IDs based on event data lets Elasticsearch perform de-duplication. In our example, we will generate the IDs using the logstash-filter-fingerprint and use the default hash method (SHA1).

To test it, use the following JSON event with this configuration:

{ "IP": "127.0.0.1", "message": "testing generated id"}

input {
 stdin { codec => json }
}
filter {
  fingerprint {
    source => ["IP", "@timestamp", "message"]
    target => "[@metadata][generated_id]"
    key => "my-key"
  }
}
output {
  elasticsearch {
    protocol => "http"
    host => "127.0.0.1"
    document_id => "%{[@metadata][generated_id]}"
  }
  stdout {
    codec => rubydebug { metadata => true }
  }
}

Like in the previous examples, we are using the fieldref syntax to access the generated_id in the @metadata hash. The Elasticsearch output will use this value as the document id, but the intermediate variable generated_id will not be saved as part of the _source inside Elasticsearch. If you do a query for the specific document using the generated ID you should see a similar document showing the saved information.

# curl -XGET "http://localhost:9200/logstash*/_search?q=_id:5f5b8e63da13c17405e940b5e8db703a19cd4485&pretty=1"

{
  "took" : 8,
  "timed_out" : false,
  "_shards" : {
    "total" : 35,
    "successful" : 35,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 1.0,
    "hits" : [ {
      "_index" : "logstash-2015.09.03",
      "_type" : "logs",
      "_id" : "5f5b8e63da13c17405e940b5e8db703a19cd4485",
      "_score" : 1.0,
      "_source":{"IP":"127.0.0.1","message":"testing generated id","@version":"1","@timestamp":"2015-09-03T20:27:25.206Z","host":"sashimi"}
    } ]
  }
}

Similarly, you can also use @metadata as fieldref syntax in your configuration like any other fields:

"from server: %{[@metadata][source]}%"

Conclusion

As you have seen in the examples above, the addition of metadata provides a simple, yet convenient way to store intermediate results. This makes configuration less complex -- you don't have to use remove_field explicitly. Also, we can reduce storage of unnecessary fields in Elasticsearch which helps reduce the size of your index. Metadata is a powerful addition to your Logstash toolset. Start using this feature today in your configuration!

컨텍스트 엔지니어링

벡터 데이터베이스

Search AI 기반 애플리케이션

로그

위협 보호

워크플로우

Elasticsearch

Kibana(Discover, 대시보드)

Elastic Agent Builder

AutoOps

파이프 쿼리 언어

Jina AI 검색 모델

Elastic Cloud Serverless

Elastic Cloud Hosted

자체 관리형 Elasticsearch

전자 상거래 검색

고객 지원 검색

검색 기반 앱

로그 분석

인프라 모니터링

디지털 경험 모니터링

앱 성능 모니터링

AIOps

LLM 통합 가시성

차세대 SIEM

보안 워크플로우

XDR 및 엔드포인트 보안

보안을 위한 AI

데이터 가치 10배 향상

클라우드 서비스 제공자

Elastic AI 에코시스템

Search AI 파트너 프로그램

AV-Comparatives

Forrester Wave™ XDR

Gartner Magic Quadrant 리더

IDC MarketScape

검색

보안

통합 가시성

시작하기

데모 갤러리

다운로드

통합

설명서

Elastic Search Labs

Elastic Security Labs

Elastic Observability Labs

블로그

커뮤니티

이벤트

웨비나

토론

교육

지원

컨설팅