2018년 1월 25일 엔지니어링

Moving Custom Ruby Code out of the Logstash Pipeline

By João Duarte

If you're a user of Logstash it is very likely that you have found yourself in a situation where none of the existing filter plugins are capable of performing the transformations or validations you need for your data pipeline. In these situations Logstash provides you with two options: writing your own filter plugin or resorting to the ruby filter.

Writing a plugin has multiple benefits: it encapsulates the logic into a ruby gem which can be easily distributed to others, it can be installed, updated and removed through the Logstash plugin manager, it allows you to use additional third-party dependencies and it can be tested in isolation. On the other hand, many aren't familiar with the Ruby ecosystem and make a (perfectly valid) decision to not invest their time on this option.

The alternative of using the ruby filter consists of placing a block of Ruby code wrapped in quotes in the "filter stage" of your pipeline configuration file that will be executed for each event. For example, to enrich each event with the size of one of its fields, you can do:

input { # ... }
filter {
  ruby {
    code => 'size = event.get("description").size;
             event.set("description_size", size)'
  }
}
output { # ... }

This is simple, doesn't require knowledge about the ruby ecosystem and the code is shipped with the pipeline configuration. While this solution isn't wrong, we can foresee two major ways in which a small ruby filter like this can evolve: the need to reuse this code in multiple places and the growing complexity on the script itself.

Reusing the code of a ruby filter

Imagine that you need to enrich an event with the size of the "description" field and also the "extra" field. Copy/Paste to the rescue:

input { # ... }
filter {
  ruby {
    code => 'size = event.get("description").size;
             event.set("description_size", size)'
  }
  ruby {
    code => 'size = event.get("extra").size;
             event.set("extra_size", size)'
  }
}
output { # ... }

You might be thinking that you could write that in a single block of ruby by having a list of fields to measure the size of, but these ruby filters can occur in different conditional branches for different kinds of data going through your pipeline. Once you need to reuse this functionality, you'll have multiple nearly identical pieces of ruby code repeated in the filter section and if there's a bug in the logic you'll have to fix the same thing in multiple places (and hopefully not forget one..or two).

Growing complexity of the Ruby code

Suppose now that you need a bit more than just the size of the field. To protect yourself from bad data that is being ingested, you want to ensure the field is there, and tag the event if it's missing. Oh and by the way, would be cool to add a field with a qualitative notion of how big the string is. Oh and also drop events of a certain size class. Our two line script has grown to something like this:

input { # ... }
filter {
  ruby {
    code => 'if event.get("description").nil?
               event.tag("description_not_found")
             else
               size = event.get("description").size
               event.set("description_size", size)
               size_class = if size >= 50000
                 "tremendous"]
               elsif size >= 200
                 "sizeable"
               elsif size >= 1
                 "tiny"
               end
               event.set("description_size_class", size_class)
               event.cancel if size_class == "tremendous"
             end'
  }
}
output { # ... }

Looking at the code, one must wonder if it code actually works. It probably does. Probably.

Mixing reuse of code and growing complexity

Combining the complexity increase and reuse, it's easy to see how one can end up with a block of 30 lines of Ruby code repeated 7 times in a Logstash pipeline. Thankfully, there's a solution to both of these problems in version 3.1.0 of the ruby filter plugin: file-based scripting.

File-Based Scripting

This new feature of the ruby filter allows you to take the large blocks of inline ruby code shown above to the following:

filter {
  ruby {
    path => "/etc/logstash/compute_string_size.rb"
    script_params => {
      "source_field" => "description"
      "sizes" => {
        50000 => "tremendous"
        200 => "sizeable"
        1 => "tiny"
      }
      "drop_size" => "tremendous"
    }
  }
}

So what's in the "compute_string_size.rb" file? The inline logic we had before, but with a few tweaks to make it more generic and actual testing:

# register accepts the hashmap passed to "script_params"
# it runs once at startup
def register(params)
  @source_field = params["source_field"]
  @sizes = params["sizes"]
  @drop_size = params["drop_size"]
end

# filter runs for every event
# return the list of events to be passed forward
# returning empty list is equivalent to event.cancel
def filter(event)
  # tag if field isn't present
  if event.get(@source_field).nil?
    event.tag("#{@source_field}_not_found")
    return [event]
  end

  # set string size
  size = event.get(@source_field).size
  event.set("#{@source_field}_size", size)
  
  # calculate and tag size class
  return [event] unless @sizes
  size_class = size_class(size)
  event.set("#{@source_field}_size_class", size_class)

  # drop if it's the right size class
  size_class == @drop_size ? [] : [event]
end

# it's possible to have auxiliary methods
def size_class(size)
  @sizes.each do |lower_bound, size_class|
    return size_class if size >= lower_bound.to_i
  end
end

# testing!!
test "when field exists" do
  parameters { { "source_field" => "field_A" } }
  in_event { { "field_A" => "hello" } }
  expect("the size is computed") {|events| events.first.get("field_A_size") == 5 }
end

test "when field doesn't exist" do
  parameters { { "source_field" => "field_A" } }
  in_event { { "field_B" => "hello" } }
  expect("tags as not found") {|events| events.first.get("tags").include?("field_A_not_found") }
end

test "when drop size is set" do
  parameters do
    { "source_field" => "field_A",
      "sizes" => { 50 => "big", 5 => "medium", 1 => "small" },
      "drop_size" => "medium" }
  end
  in_event { { "field_A" => "a kind of medium sized string" } }
  expect("drops events of a certain size class") {|events| events.empty? }
end

You can find more information about building these scripts in the documentation but even though the code is lengthier than its inline counterpart, it's easy to see the benefits of this approach:

Reusability

Thinking back to the previous example of computing the string size of multiple fields, the source field, list of size classes and which class to drop are now configurable. This means we can reuse this script in multiple places, rewriting the dual usage example as:

input { # ... }
filter {
  ruby {
    path => "/etc/logstash/compute_string_size.rb"
    script_params => { "source_field" => "description" }
  }
  ruby {
    path => "/etc/logstash/compute_string_size.rb"
    script_params => { "source_field" => "extra" }
  }
}
output { # ... }

Testing

We now have tests that check if the calculation of the string size, the tagging when field is missing and usage of the size classes works correctly. We can confirm the tests pass by using Logstash's "test and exit" flag:

% bin/logstash -e "filter { ruby { path => 'compute_string_size.rb' } }" -t
[2018-01-23T23:56:01,805][INFO ][logstash.filters.ruby.script] Test run complete {:script_path=>"compute_string_size.rb", :results=>{:passed=>4, :failed=>0, :errored=>0}}
Configuration OK
[2018-01-23T23:56:01,819][INFO ][logstash.runner          ] Using config.test_and_exit mode. Config Validation Result: OK. Exiting Logstash

Sharing

That's right! Now that we've encapsulated the sizing of strings in a neat little ruby file, we can share it! I opened a repository to collect scripts I've created so far: https://github.com/jsvd/logstash-filter-ruby-scripts, and I welcome scripts created by others!

Limitations

Even though file-based scripting in the ruby filter provides code reuse and testability, there are still limitations compared to full blown Logstash plugins. The main one is lack of support for third-party libraries: if you need the custom ruby code to use a library that doesn't ship with Logstash, it's not possible to load it through the ruby filter.

If you're planning to use or are already using the centralized pipeline management there's no current way of pushing through the management UI other artifacts like these ruby scripts. This is being tracked and addressed, but it is a limitation at the time of writing of this blog post.

Conclusion

The new Logstash ruby filter 3.1.x provides users with a way to remove all the inline ruby code in their pipeline configs. This code is typically hard to maintain and to validate its correctness. The file-based scripting feature solves both of these problems by allowing scripts to be parametrized and tests to be written in the file containing the script itself.

If you want to use this feature it is available by default on Logstash since version 6.1. Otherwise you can update the ruby plugin by doing `bin/logstash-plugin update logstash-filter-ruby`.