Product

Using parallel Logstash pipelines to improve persistent queue throughput

By default, Logstash uses in-memory bounded queues between pipeline stages (inputs → pipeline workers) to buffer events. However, in order to protect against data loss during abnormal termination, Logstash has a persistent queue feature which can be enabled to store the message queue on disk. The queue sits between the input and filter stages as follows:

input → persistent queue → filter + output

According to the persistent queue blog post, Logstash persistent queues should have a small impact on overall throughput. While this is generally true for use cases where the pipeline is CPU bound, there are other cases where persistent queues may cause a substantial drop in Logstash performance. Therefore, in this blog post I will discuss when this can happen and what technique to use to reduce the performance impact that may be caused by enabling Logstash persistent queues.

Motivation for this blog

In one recent consulting engagement, persistent queues caused a slowdown of about 75%  from 40K events/s down to 10K events/s. Investigations showed that neither disks or CPU were saturated, and standard Logstash tuning techniques such as testing different batch sizes and adding more worker threads were unable to remedy this slowdown.

Why persistent queues may impact Logstash performance

If the persistent queue is enabled for a pipeline, then Logstash will run a single threaded persistent queue for that pipeline — the persistent queue does not run across multiple threads within a single pipeline. To put this another way, a single pipeline can only drive the disk with a single thread. This is true even if a pipeline were to have multiple inputs, as additional inputs in a single pipeline do not increase disk I/O threads.

Because enabling the persistent queue adds synchronous disk I/O (which adds wait time) into the Logstash pipeline, the persistent queue may reduce throughput even if none of the resources on the associated server are saturated.

Solution for improving overall performance

If Logstash is unable to saturate the disks, then throughput may be limited due to the wait time caused by synchronous disk I/O. In this case, additional persistent queue threads running in parallel may be able to drive the disk harder which should increase the overall throughput.

This can be accomplished by running multiple (identical) Logstash pipelines in parallel within a single Logstash process, and then load balancing the input data stream across the pipelines. For example, if we are using Filebeat as the input source for Logstash, then load balancing across multiple Logstash pipelines can be done by specifying multiple Logstash outputs in Filebeat.

Below we provide an example of a standard (non-optimized) Logstash persistent queue implementation, followed by an improved implementation that consists of two pipelines running in parallel.

A simple logstash pipeline

An example of a single Logstash pipeline receiving data from Filebeat is given below. As noted above, since this is just a single pipeline, its persistent queue is single threaded and therefore its throughput may be limited by the wait time caused by synchronous disk I/O.

Logstash pipeline:

input {
  beats {
    port => 5044
  }
}

filter {
  # Custom filters go here
}

output {
  elasticsearch {
    hosts => ["http://<elasticsearch hostname>:9200"]
    index => "<index name>" 
  }
}

An example Filebeat configuration that could drive data into the above Logstash pipeline is given below:

Filebeat configuration:

filebeat.inputs:
- type: log
  enabled: true
  paths:
    - /var/log/*.log

output.logstash:
  hosts: ["<logstash hostname>:5044"]

Running multiple Logstash pipelines in parallel

In order to increase the total number of persistent queue threads driving the disk, we could run multiple Logstash pipelines in a single Logstash process. For example, Logstash could execute the following two pipelines in parallel:

Pipeline #1:

input {
  beats {
    port => 5044
  }
}
filter {
  # Custom filters go here
}
output {
  elasticsearch {
    hosts => ["http://<elasticsearch hostname>:9200"]
    index => "<index name>" 
  }
}

Pipeline #2:

input {
  beats {
    port => 5045 # Different than Pipeline #1
  }
}
filter {
  # Same as Pipeline #1
}
output { 
  # Same as Pipeline #1
}

Given the above Logstash pipelines, Filebeat can be configured to load balance between these pipelines. If our Logstash instance is running as we just described, with Pipeline #1 listening on port 5044 and Pipeline #2 listening on port 5045, then our filebeat instances could be configured to balance across these two Logstash pipelines as follows:

Filebeat configuration:

filebeat.inputs:
- type: log
  enabled: true
  paths:
    - /var/log/*.log

output.logstash:
  hosts: ["<logstash hostname>:5044", "<logstash hostname>:5045"]
  loadbalance: true

Result

In the aforementioned consulting engagement where enabling persistent queues initially caused a performance drop of 75%, we created 4 identical Logstash pipelines running in parallel, and balanced the filebeat output across these 4 pipelines. This resulted in an increase in performance up to 30K events/s, or only 25% worse than without persistent queues. At this point the disks were saturated (as desired), and no further performance improvements were possible.

Conclusion

The single threaded nature of disk I/O in Logstash persistent queues may cause performance to drop even if the underlying server’s resources are not saturated. To remedy this issue, multiple identical Logstash pipelines can be executed in parallel, and input data can be balanced across these pipelines. This has the effect of increasing the number of parallel threads that can simultaneously write to disk, which may improve Logstash’s overall throughput. Give a try in your own cluster, or spin up a 14-day free trial of Elasticsearch Service, and if you have any questions, reach out in our Discuss forums.