08 October 2015

Logstash configuration tuning

By Robin Clarke

Logstash is a powerful beast and when it’s firing on all cylinders to crunch data, it can use a lot of resources. The goal of this blog post is to provide a methodology to optimise your configuration and allow Logstash to get the most out of your hardware.

As always, there is no golden rule to optimise which will work for everybody and it will be influenced heavily by

  • your data (size and complexity of the documents, mapping used, etc.)
  • your hardware (cpus, memory, disks, over allocation, etc.)

… which is why your mileage may vary and it is always best to determine the best configuration for your data yourself.


I have split this up into 6 steps which should be carried out in order for best results - later on you will be able to repeat individual steps to optimise individual sections.

For your test setup, it is important that Logstash has the full resources of the machine available and is not sharing resources (e.g. redis/elasticsearch node running on the same machine, or virtualized on a host with over-allocation), otherwise your test results may vary greatly from one run to the next.


Step 1: the sample data

To be able to tell if any change to the system or configuration has had a positive (or negative) impact on your throughput, you must be able to do repeatable tests, and that means sample data.  You need data which is representative of the data your production system will see, and a volume of data which allows the system to warm up.  You will have to do some testing to see how much is enough, but be sure that 10 documents will not give any meaningful results... try starting off with 1GB of data and see how far that gets you.


Step 2: metrics

The logstash metrics plugin is your friend here.  Add this to your configuration to collect metrics:

filter {
    metrics {
        meter => "documents"
        add_tag => "metric"
        flush_interval => 60
    }
}

And this to output the metrics:

output {
    if "metric" in [tags] {
        stdout {
            codec => line {
                format => "1m rate: %{documents.rate_1m} ( %{documents.count} )"
            }
        }
    }
}

Of course, as with any system… measuring it will influence the system being measured, but the metrics plugin is very light, and the additional load should be linear and not be influenced by the size/complexity of your data.

Assuming you already have a complete Logstash configuration which you want to optimise, this would be a good time to take a baseline measurement of what your current documents/minute rate is so we can see at the end how much improvement you actually made.


Step 3: Optimising filters

To optimise your filters you need to ensure that your inputs are as fast as possible:  use the file input to read from your sample data, and send all the documents to /dev/null to ensure that there is no bottleneck with processing or outputting your documents.

input {
    file {
        path => [ "/path/to/your.log" ]
        start_position  => "beginning"
        sincedb_path => "/dev/null"
    }
}
...
output {
    null{}
}

Note the start_position and sincedb_path - this will allow you to read the file from the beginning every time you start Logstash without having to delete sincedb files.


The first thing to note here is that the Logstash command line argument --filterworkers (or -w) only influences the number of worker threads for filters and has no influence whatsoever on the inputs or outputs. The current default value for filter workers is 1 (this may be changed in future releases), so set --filterworkers to the number of cores that you have on your machine to make sure you have all the resources available for your filters.

If you made a change here already, then start off by taking a baseline measurement to see the effect of any changes.


You will probably see a warm up phase at the beginning, but it should soon settle to a steady rate with little variation (assuming your documents are relatively homogenous).  After 5 minutes you should expect it to be warmed up and representative rates being displayed.

The following are some examples I tried to illustrate the methods and some results, but please note that the results that I got may differ greatly from those that you get with your data - please do not assume that just because a configuration change on my hardware with my data increased my throughput that it will for yours - you must test yourself.


Conditionals

Is it more efficient to do two boolean comparisons

if [useragent][name] == "Opera" or [useragent][name] == "Chrome" {
   drop {}
}

or a compact regular expression?

if [useragent][name] =~ /^(?:Opera|Chrome)$/ {
    drop {}
}

In my tests, the first gets a rate of 2600 documents/minute and the regex only does 2000… as expected, regular expressions are more powerful, but also more expensive.

Counter example: If your binary comparison has 100 “or’s”, but you can solve the same problem with a very simple regex, you may find the regex to be the faster solution.

Type casting outside of grok

Which is more efficient?

grok {
    match => {
        "message" => "CustomerID%{INT:CustomerID:int}"
        "tag_on_failure" => []
    }
}


or

grok {
    match => {
        "message" => "CustomerID%{INT:CustomerID}"
        "tag_on_failure" => []
    }
}
mutate {
    convert => { "CustomerID" => "integer" }
}

In my tests, the first gets a rate of 17500 while the second only 14000 documents/minute, so yes - for my data it is good to type cast within grok where you can.


Conditional before grok or not

From the syslog logs I have on my machine I am interested (who knows why…) in capturing the IP address and renewal time from this dhclient line:

Sep  9 07:50:43 es-rclarke dhclient: bound to 10.10.10.89 -- renewal in 426936 seconds.

In my sample file, dhclient entries make up only 3% of all the log entries, and the renewal entries make only 8% of those…

The lazy way to write it would be:

filter {
    grok {
        match => {
            "message" => "%{SYSLOGLINE}"
        }
        overwrite => [ "message" ]
    }
    grok {
        match => {
            "message" => "bound to %{IPV4:[dhclient][address]} -- renewal in %{INT:[dhclient][renewal]:int} seconds."
        }
        tag_on_failure => []
    }
}

but how much performance would it bring to only grok on the dhclient messages?

filter {
    grok {
        match => {
            "message" => "%{SYSLOGLINE}"
        }
        overwrite => [ "message" ]
    }
    if [program] == "dhclient" {
        grok {
            match => {
                "message" => "bound to %{IPV4:[dhclient][address]} -- renewal in %{INT:[dhclient][renewal]:int} seconds."
            }
            tag_on_failure => []
        }
    }
}

In this example… surprisingly little: 12100 documents/minute without, and 14000 with the conditional.


But the point is…

Test, test, test!  Keep on reviewing your filters, think of ways to optimise, and if you have an idea put it to the test.  If someone tells you “always do XYZ in your Logstash configurations - it’s faster!”, test it with your data on your hardware to be sure it is better for you.  You will be surprised how much performance you can gain by optimising your filters, and how that translates to less hardware being purchased.


Step 4: Optimising inputs

To optimise your inputs you need to remove all filters (with the exception of the metrics filter), and again send all the documents to /dev/null to ensure that there is no bottleneck with processing or outputting your documents.

The intention here is to identify the (theoretical) maximum throughput of Logstash on your system with this input, and moreover to put your input source under load to discover its weak spots.  

As with the filter optimisation, try different settings comparing results, e.g. try changing the number of threads if your input has this setting (you will probably get best results if you set this to the number of cores on the machine, but never more than that), and of course monitor your source to ensure that the limiting factor is not Logstash.

If possible your input source (e.g. a Redis server) is on separate hardware to your Logstash, otherwise they could both be contending for resources and not give you repeatable or reliable results.


Step 5: Testing outputs

To optimise your output you would be best to prepare an input file by reading in from your source, applying all the filters you want to have in production, and writing out to file using the json_lines codec.  You can now use this file as an input, again having no filters other than the metric filter, and output to your target ensuring that there is no load from filters which would influence your tests.  Ensure you have enough data: any test involving Elasticsearch should run long enough for it to warm up and have caches filled and garbage collection running normally (probably about  30 minutes, but check your marvel stats to be sure).

Again, as with optimising the input filters, modify configuration variables on your output (most notably the “workers” option on the Elasticsearch output which will probably be best at the number of cores your machine has) and determine what the best configuration is, all the while monitoring your output target (e.g. Elasticsearch cluster) to ensure that it is not the bottleneck.  As with the input optimisation, it is important that your output target system is not on the same hardware as your Logstash agent, otherwise they could be contending for resources.


Step 6: Putting it all together

Now that you have optimised your input, filters and output it is time to put your configuration together.  I usually find it easier to manage if I keep all of my Logstash configuration files in one directory, and name them so that they will be read in the correct order by the Logstash agent, e.g.

├── 100-input.conf
├── 200-metrics.conf
├── 500-filter.conf
└── 900-output.conf

It is important to note that while you can have input, filter and output sections in any order within your files, Logstash will first concatenate all the files by alphanumeric filename order.  i.e. if you want to have one filter applied first, you must name the file accordingly.