How Kenna Security Speeds Up Elasticsearch Indexing at Scale - Part 1

Everyone wants their Elasticsearch cluster to index and search faster, but optimizing both at scale can take planning. In 2015, Kenna's cluster held 500 million documents, a million of which were processed every day. At the time, our poorly configured Elasticsearch cluster was the least stable piece of our infrastructure and could barely keep up as our data size grew. Today, our cluster holds 4 billion documents and we process over 200 million of them a day, with ease. Building a cluster to meet all of our indexing and searching demands was not easy. But with a lot of persistence and a few "OH CRAP!" moments, we got it done and learned a hell of a lot along the way. In this two part blog, I will share the techniques we used while building our Elasticsearch cluster at Kenna to get us to where we are today. However, before I dive into all the Elasticsearch fun, I first want to tell you a little bit about Kenna.

What is Kenna Security?

Kenna helps Fortune 500 companies manage their cybersecurity risk. The average company has 60 thousand assets. An asset is basically anything with an IP address. The average company also has 24 million vulnerabilities. A vulnerability is how you can hack an asset. With all this data it can be extremely difficult for companies to know what they need to focus on and fix first. That's where Kenna comes in. At Kenna, we take all of that data and we run it through our proprietary algorithms. Those algorithms then tell our clients which vulnerabilities pose the biggest risk to their infrastructure so they know what they need to fix first.

We initially store all of this data in MySQL, which is our source of truth. From there, we index the asset and vulnerability data into Elasticsearch:

MySQL to Elasticsearch.png

Elasticsearch is what allows our clients to really organize their data in any way they need, making search speed a top priority. At the same time, data is constantly changing in MySQL, so indexing is also extremely important in order to keep Elasticsearch insync with it.

Part 1: Speeding up Indexing at Scale

Let's start with how we were able to scale our indexing capacity. For our use case, we set a great refresh interval from the beginning, one of the best things we did early on in running Elasticsearch.

Refresh Interval

By default, Elasticsearch uses a one-second refresh interval. Refreshing an index takes up considerable resources, which takes away from other resources you could use for indexing. One of the easiest ways to speed up indexing is to increase your refresh interval. By increasing it you will decrease the number of refreshes an index does and thus free up resources for indexing. We have ours set at 30 seconds.

PUT /my_index/_settings 
    "index" : {
       "refresh_interval": "30s"

Client data updates are handled by background jobs, so waiting an extra 30 seconds for data to be searchable is not a big deal. If 30 seconds is too long for your use case, try 10 or 20. Anything greater than one will get you indexing gains.

Once we had our refresh interval set it was smooth sailing — until we got our first big client. This client had 200,000 assets and over 100 million vulnerabilities. After getting their data into MySQL, we started to index it into Elasticsearch but it was slow going. After a couple days of indexing we had to tell this client that it was going to take two weeks to index all of their data. Obviously, this wasn't going to work long term. The solution was bulk processing.

Bulk Processing

We found that indexing 1000 documents in each bulk request gave us the biggest performance boost. The size you want to make your batches will depend on your document size. For tips on finding your optimal batch size, check out Elastic's suggestions for bulk processing.

Indexing documents individually vs. Indexing in a single batch using a bulk request

Bulk processing alone got us through a solid year of growth. Then, this past year, MySQL got an overhaul and suddenly Elasticsearch couldn't keep up. During peak indexing periods, node CPU would max out and we started getting a bunch of 429 (TOO_MANY_REQUESTS) errors. This is when we became aware of threads and the role they play in indexing.

Route Your Documents

When you are indexing a set of documents the number of threads needed to complete the request depends on how many shards on which those documents belong. Let's look at two batches of documents:

It's going to take four threads to index each batch because each one needs to talk to four shards. An easy way to decrease the number of threads you need for each request is to group your documents by shard. Going back to our example, if you group the documents by shard you can cut the number of threads needed to execute the request in half.

Seems easy, right? Except, how do you know what shard a document belongs on? The answer is routing. When Elasticsearch is determining what shard to put a document on it uses this formula:

The routing value will default to the document _id or you can set it yourself. If you set it yourself, then you can use it when you are indexing documents. Let's look at that example again. Since our route value corresponds with the shard a document belongs on, we can use that to group our documents and reduce the number of threads needed to execute each request.

At Kenna, we have a parent/child relationship between assets and vulnerabilities. This means the parent, or asset_id, is used for routing. When we are indexing vulnerabilities, we group them by asset_id to reduce the number of threads needed to fulfill each request.

Vulnerabilities by Asset

Grouping documents by their routes has allowed us to ramp up our indexing considerably while keeping the cluster happy.

Recap: Speeding up Indexing at Scale

  1. Toggle your refresh interval
  2. Bulk process documents
  3. Route your documents

Planning Ahead

These techniques have gotten Kenna to the point where we can process over 200 million documents a day and still have room to grow. When your cluster is small, and you are not processing a lot of data, these small adjustments are easy to overlook. However, as you scale, I guarantee these techniques will be invaluable. Plan ahead and start implementing these indexing strategies now so you can avoid slow downs in the future.

Indexing speed is only half of the picture. At Kenna, search is also a top priority for our clients. Check out part two of this blog, where I will dive into all the ways Kenna was able to speed up its search while scaling its cluster.

Molly Struve (@molly_struve) is a Sr. Site Reliability Engineer at Kenna Security. She has been working at Kenna and with Elasticsearch for over 3 years. During that time, she helped lead the team charged with scaling Kenna's Elasticsearch cluster which now holds 4 billion documents and updates over 200 million of them a day. When she isn't wrangling Elasticsearch, she can be found fulfilling her need for speed by riding and jumping her show horses.