Announcing Rally: Our benchmarking tool for Elasticsearch

Today we're excited to announce the first public release of Rally, the benchmarking tool we have been using internally in the Elasticsearch development team for a few months now.

We want to share it with the community to help you reproduce performance numbers that we publish in your own environment and to help you write your own benchmarks without worrying about lots of tiny details.

Rally's origins

Rally originates from Python scripts that drive the nightly Elasticsearch benchmarks and also Lucene's nightly benchmarks.

Elasticsearch Classic Benchmarks

The benchmarking infrastructure is a great help to avoid boiling frog problems (a.k.a. slowly decreasing performance). We are constantly improving Elasticsearch in all areas and we also care deeply about performance. Wouldn't it be great if developers could run a benchmark by themselves to see the impact of their changes early during development instead of waiting until a feature is merged to the master branch?

Developers are typically creative, so you can always write a quick and dirty script in the language of your choice, run it for your specific use case and forget about it. But how often do we really verify these numbers?

So I tried the next best thing, which is to use the existing benchmark scripts locally, but the setup involved lots of manual steps. I’ve decided to simplify installation and usage and Rally was born.

What can Rally do?

Over the last months Rally gradually supported more and more features:

We can attach so-called telemetry devices for detailed analysis of the benchmark candidate behavior. For example, Java flight recorder has already helped us to spot different problems. Here are a few examples of what you can do with Rally and the Java flight recorder telemetry device:

Allocation Profiling

Allocation Profiling with Java Mission Control

Inspecting hot classes in Elasticsearch

Inspecting hot classes in Elasticsearch with Java Mission Control

We have also added a JIT compiler telemetry device where we can inspect JIT compiler behavior, which allows us, for example, to analyze warm-up times during the benchmark. The graphics below shows the number of JIT compiler events during the benchmark:

JIT compilations over time

Evaluating the performance of such a complex system as Elasticsearch is also a very multi-dimensional problem. Whereas performance could improve in one scenario - say for searching log data - it could have a negative impact on full-text search. Therefore, we can define multiple benchmarks (called "tracks" in Rally). They are currently directly implemented in the Rally code base, but as the API is more stable, we want to separate tracks, so it is easier to define your own ones.

As Rally stores all metrics data in Elasticsearch, we can easily visualize data with Kibana, such as the distribution of CPU usage during a benchmark:

CPU usage during a benchmark

In the beginning we add the benchmark data set to the index. After that we run search benchmarks. I bet you can clearly see the point in time where indexing is complete.

We have also started to run the nightly benchmarks in parallel now and provide the results as Kibana dashboard.


There is still a lot of work to do:

One major topic is to remove restrictions. We want to separate the benchmark definitions (called "tracks") from Rally itself and also allow more flexibility in the steps that Rally performs during the benchmark. Currently we support only a very limited scenario: first all documents are bulk-indexed, then we run a track-dependent number of queries. By default, we run a benchmark based on data from but further tracks are available (just issue esrally list tracks).

The second major topic is improving correctness of the measurements. We take correctness seriously and are already aware of a couple of topics that need improvement. One of the major issues that latency measurements suffer from coordinated omission. This basically means that requests that take a long time to process prevent the benchmark driver to send further requests in the meantime, so we lose measurement samples. This means that the reported latency percentile distribution appears to be better than it actually is.

Third: Rally is currently limited to single machine benchmarks, but we will allow at some point multi-machine benchmarks and early prototypes are already promising.

Running your first benchmark

After all this talking, let's see how easy is it to run a benchmark on your own machine.

Considering that you have installed all prerequisites and you have started Rally's metrics store, it is a three step process to get your first benchmark results:

pip3 install esrally
esrally configure
esrally --pipeline=from-distribution --distribution-version=5.0.0-alpha1

If you want to learn more about Rally, just head to Rally's project page, look at some issues or help us by contributing tracks or changes to Rally itself.

The image at the top of the post has been provided by Andrew Basterfield under the CC BY-SA license (original source). The image has been colorized differently than the original but is otherwise unaltered.