Tech Topics

# Efficient Duplicate Prevention for Event-Based Data in Elasticsearch

The Elastic Stack is used for a lot of different use cases. One of the most common is to store and analyze different types of event or time-series based data, e.g. security events, logs, and metrics. These events often consist of data linked to a specific timestamp representing when the event occurred or was collected and there is often no natural key available to uniquely identify the event.

For some use cases, and maybe even types of data within a use case, it is important that the data in Elasticsearch is not duplicated: Duplicate documents may lead to incorrect analysis and search errors. We started looking at this last year in the introduction to duplicate handling using Logstash blog post, and in this one we will dig a bit deeper and address some common questions.

## Indexing into Elasticsearch

When you index data into Elasticsearch, you need to receive the response to be sure that the data has been successfully indexed. If an error, e.g. connection error or node crash, prevents you from receiving it, you can not be sure whether any of the data has been indexed or not. When clients encounter this type of scenario, the standard way to ensure delivery is to retry, which may lead to the same document being indexed more than once.

As described in the blog post on duplicate handling, it is possible to get around this by defining a unique ID for each document in the client rather than having Elasticsearch automatically assign one at indexing time. When a duplicate document is written to the same index, this will result in an update instead of the document being written a second time, which prevents duplicates.

## UUIDs vs hash-based document ids

When deciding on what type of identifier to use, there are two main types to choose from.

Universally Unique Identifiers (UUIDs) are identifiers based on 128-bit numbers that can be generated across distributed systems, while for practical purposes being unique. This type of identifier generally has no dependence on the content of the event it is associated with.

In order to use UUIDs to avoid duplicates, it is essential that the UUID is generated and assigned to the event before the event passes any boundary that does guarantee that the event is being delivered exactly once. This in practice often means that the UUID must be assigned at the point of origin. If the system where the event originates cannot generate a UUID, a different type of identifier may need to be used.

The other main type of identifier are ones where a hash function is used to generate a numeric hash based on the content of the event. The hash function will always generate the same value for a specific piece of content, but the generated value is not guaranteed to be unique. The probability of a hash collision, which is when two different events result in the same hash value, depends on the number of events in the index as well as the type of hash function used and the length of the value it produces. A hash of at least 128 bits in length, e.g. MD5 or SHA1, generally provides a good compromise between length and low collision probability for a lot of scenarios. For even greater uniqueness guarantees, an even longer hash like SHA256 can be used.

As a hash-based identifier depends on the content of the event, it is possible to assign this at a later processing stage as the same value will be calculated wherever it is generated. This makes it possible to assign this type of IDs at any point before the data is indexed into Elasticsearch, which allows for flexibility when designing an ingest pipeline.

Logstash has support for calculating UUIDs and a range of popular and common hash functions through the fingerprint filter plugin.

## Choosing an efficient document id

When Elasticsearch is allowed to assign the document identifier at indexing time, it can perform optimizations as it knows the generated identifier can not already exist in the index. This improves indexing performance. For identifiers generated externally and passed in with the document, Elasticsearch must treat this as a potential update and check whether the document identifier already exists in existing index segments, which requires additional work and therefore is slower.

Not all external document identifiers are created equal. Identifiers that gradually increase over time based on sorting order generally result in better indexing performance than completely random identifiers. The reason for this is that it is possible for Elasticsearch to quickly determine whether an identifier exists in older index segments based solely on the minimum and maximum identifier value in that segment rather than having to search through it. This is described in this blog post, which is still relevant even though it is getting a bit old.

Hash-based identifiers and many types of UUIDs are generally random in nature. When dealing with a flow of events where each has a defined timestamp, we can use this timestamp as a prefix to the identifier to make them sortable and thereby increase indexing performance.

Creating an identifier prefixed by a timestamp also has the benefit of reducing the hash collision probability, as the hash value only has to be unique per timestamp. This makes it possible to use shorter hash values even in high ingest volume scenarios.

We can create these types of identifiers in Logstash by using the fingerprint filter plugin to generate a UUID or hash and a Ruby filter to create a hex-encoded string representation of the timestamp. If we assume that we have a message field that we can hash and that the timestamp in the event has already been parsed into the @timestamp field, we can create the components of the identifier and store it in metadata like this:

fingerprint {
source => "message"
method => "MD5"
key => "test"
}
ruby {
}


These two fields can then be used to generate a document id in the Elasticsearch output plugin:

elasticsearch {
}


This will result in a document id that is hex encoded and 40 characters in length, e.g. 4dad050215ca59aa1e3a26a222a9bbcaced23039. A full configuration example can be found in this gist.

## Indexing performance implications

The impact of using different types of identifiers will depend a lot on your data, hardware, and use-case. While we can give some general guidelines, it is important to run benchmarks to determine exactly how this affects your use-case.

For optimal indexing throughput, using identifiers autogenerated by Elasticsearch is always going to be the most efficient option. As update checks are not required, indexing performance does not change much as indices and shards grow in size. It it therefore recommended to use this whenever possible.

The update checks resulting from using an external ID will require additional disk access. The impact this will have depend on how efficiently the required data can be cached by the operating system, how fast the storage is and and how well it can handle random reads. The indexing speed also often goes down as indices and shards grow and more and more segments need to be checked.

## Use of the rollover API

Traditional time-based indices rely on each index covering a specific set time period. This means that index and shard sizes can end up varying a lot if data volumes fluctuate over time. Uneven shard sizes are not desirable and can lead to performance problems.

The rollover index API was introduced to provide a flexible way to manage time-based indices based on multiple criteria, not just time. It makes it possible to roll over to a new index once the existing one reaches a specific size, document count and/or age, resulting in much more predictable shard and index sizes.

This however breaks the link between the event timestamp and the index it belongs to. When indices were based strictly on time, an event would always go to the same index no matter how late it arrived. It is this principle that makes it possible to prevent duplicates using external identifiers. When using the rollover API it is therefore no longer possible to completely prevent duplicates, even though the probability is reduced. It is possible that two duplicate events arrive on either side of a rollover and therefore end up in different indices even though they have the same timestamp, which will not result in an update.

It is therefore not recommended to use the rollover API if duplicate prevention is a strict requirement.

## Adapting to unpredictable traffic volumes

Even if the rollover API cannot be used, there are still ways to adapt and adjust shard size if traffic volumes fluctuate and result in time-based indices that are too small or large.

If shards have ended up too large due to, e.g. a surge in traffic, it is possible to use the split index API to split the index into a larger number of shards. This API requires a setting to be applied at index creation, so this needs to be added through an index template.

If traffic volumes on the other hand have been too low, resulting in unusually small shards, the shrink index API can be used to reduce the number of shards in the index.

## Conclusions

As you have seen in this blog post, it is possible to prevent duplicates in Elasticsearch by specifying a document identifier externally prior to indexing data into Elasticsearch. The type and structure of the identifier can have a significant impact on indexing performance. This will however vary from use case to use case so it is recommended to benchmark to identify what is optimal for you and your particular scenario.