Elasticsearch Testing and QA: Increasing Coverage by Randomizing Test Runs

Welcome to the final installment in our three-part series on Testing and QA for Elasticsearch. In case you missed it, you may want to read Part 1: Elasticsearch Continuous Integration and Part 2: Testing Levels of Elasticsearch. Only have a few minutes right now? No worries. You can read each piece independently and still learn a lot from each individual article.

When writing tests, developers tend to have a certain picture of how the code is supposed to work in their mind. Based on that picture they craft tests, ideally checking not only the Happy Path but also boundary conditions and error cases.

Often, though, our perception of what the code should be doing is not in line with what is actually going on. Here's one famous example code snippet that shows unexpected behavior:

Math.abs(Integer.MIN_VALUE) < 0

In Java, there is no integer value defined that would be large enough to represent the absolute value of Integer.MIN_VALUE; as a result, the returned value is negative.

Apart from such "surprising" behavior, developer inexperience with the problem domain can lead to overlooking otherwise obvious test cases. For example, engineers inexperienced with geographic coordinates might overlook double checking whether their code works at the North/South pole or at the date line leading to problems in production systems.

One approach to find issues in programs used by the security industry is to feed the program under test with all sorts of expected and in particular unexpected or even invalid random input. This particular technique is known as Fuzz testing or fuzzing.

  • One popular framework in this space is the multi purpose fuzzer ZZUF by Sam Hocevar.
  • For .NET projects Pex can be used to check whether code breaks unexpectedly when confronted with random values.

The term Random Testing was initially coined by Richard Hamlet in 1994 as one type of black box testing. In the recent past, though, it's become increasingly popular to use the same concept for white box testing by replacing hard coded test values with some automated way of generating valid input data:

  • Inspired by QuickCheck from the Haskell world property based testing founds its way into the Scala library ScalaCheck.
  • In .NET the library AutoFixture can be used for that purpose.

When using pseudo-random input - based on pre-defined constraints - checking the test result for correctness can be done in multiple ways:

  • Some test results are expensive to create but cheap to check. In the case of a sorting function, implementing the sort itself might be complex, but checking whether the result is actually sorted is easy.
  • Some test results can be checked by running a slower but much easier to implement algorithm.
  • When simply changing the runtime environment without randomizing the input to a method itself, standard checks can be applied.

Several years ago the Lucene community added support for Randomized Testing to their unit test suite. In contrast to other approaches, the idea here is to not only use random input into the program under test, but also create randomized runtime environments.

When initializing each test run with a new set of randomly chosen input parameters, it also makes sense to re-run one particular test multiple times, each time with a new set of input parameters. When running tests on a developer's workstation, the number of re-runs should be limited to decrease testing time: Overly long running test suites often aren't executed by the developers after making changes. When running on a continuous integration server, though, the number of iterations can easily be increased to 50 or 100 to cover more ground, especially if the test itself is cheap in terms of runtime.

A simple randomization example that runs 100 iterations and is based on a list of length 10 to 100, filled with random short numbers would look like this:

01 @Test

02 @Repeat (iterations = 100)

03 public void testSorting {

04         int length = randomIntBetween(10, 100);

05         short[] list = new short[length];

06         for (int i = 0; i < length; i++) {

07             list[i] = randomShort();

08         }

09         short[] result = Arrays.sort(list);

10         assertTrue(isSorted(result));

11 }

Line 02 defines the number of iterations to run. Each run is initialized with a different test seed leading to different values being generated in the test. On failure, this test seed is provided to the user to allow for reproducing (and fixing) what went wrong deterministically.

Line 04 defines the length of our array to be a random value between a maximum and minimum boundary. The max and min values can be omitted to generate just any integer value. Line 07 uses this notation to generate short values.

This example only shows a very limited - but already powerful - subset of the functionality provided by the carrotsearch randomized testing framework that is the basis for randomized testing in both Apache Lucene and Elasticsearch. Other types of input parameters that can be generated include, but are not limited to, random String generation (limited to Unicode or ASCII characters only) and input data of every primitive data type available in Java. The framework also checks for threads lingering around after test execution completed.

In addition to initializing tests with a specific number of iterations, it is possible to also make sure one specific test seed is checked once on each run. This way fixed tests can always be included in each test run to avoid regressions. For more extensive examples see the Carrotsearch repo on GitHub.

Randomization at the Java Testing Level

Not only does Elasticsearch use the randomization framework above to make unit tests more interesting, it also helps our integration tests go one step further. In our previous post, we saw how to write an Elasticsearch integration test. What was not - and could not be - shown in the example, but happens in the background, is that the cluster configuration used to pull up an integration test cluster isn't static. Instead, basic configuration options like the number of master and data nodes, number of replicas, transport to use etc. are randomized on each run. Again, an exact configuration can be reproduced deterministically by initializing the test in question with a specific test seed, e.g. on test failure.

If you are using the Elasticsearch Java Testing Framework you will not only benefit from a great number of helper methods for cluster startup, search request creation and search result checking. The same kind of randomization on cluster bootup will also be applied in your test setup.

Randomization at the Continuous Integration Level

Our continuous integration deployment sounds pretty much standard. There are multiple Jenkins managed jobs checking the software works on all supported hardware configurations, operating systems and JDK versions. Hardware platforms range from small EC2 instances to rather beefy bare metal machines hosted by Hetzner. Operating systems include various Linux flavors as well as Windows versions. The trick lies in the details; on each run, a supported JDK is chosen at random, and so are the JVM configuration options. The Elasticsearch server configuration itself is randomized.

JDK choice can be biased towards specific JDK versions. We work closely with the Lucene community to make sure we are using - and recommending to users - JDK versions that are known to work well and avoid bugs. As a result, our tests are biased towards those versions. By simply biasing test runs towards specific versions, we make sure that time and resources to test Elasticsearch can be distributed according to how often the underlying platforms are being used in the wild.

Impact of Randomization on Development Culture

With automated testing and continuous integration in place, it has become common advice to make sure your tests are always green before checking in. There's even a Jenkins plugin that turns the goal of having a working build into a game heavily penalizing build breakage.

When introducing randomization to any project, over time, the space searched for problematic code increases over time. Parts of the code that are complex tend to cause tests to become flaky - adding noise instead of signal. In addition, pieces of your software that put the underlying platform - in our use case: Apache Lucene and Elasticsearch JVM - under pressure are bound to reveal bugs in that platform over time.

Is this a bad thing? Of course not. As Mike puts it, "Your test cases should sometimes fail." However it is paramount to establish a development culture where even complex issues, that only rarely cause the build to break, are fixed quickly, especially if they impact real world user deployments. Often believing the issues uncovered are merely false positives that cannot occur in production leads to finding lots of "this should never happen" exception messages in log files. Double - or better triple - check your conclusion that it's noise. If it is noise, make sure to disable the test configuration in question to get the build back to stable state. Don't do so, and your development teams will slow down in reacting to test failures, or just ignore them entirely.