Identifying beaconing malware using Elastic


The early stages of an intrusion usually include initial access, execution, persistence, and command-and-control (C2) beaconing. When structured threats use zero-days, these first two stages are often not detected. It can often be challenging and time-consuming to identify persistence mechanisms left by an advanced adversary as we saw in the 2020 SUNBURST supply chain compromise. Could we then have detected SUNBURST in the initial hours or days by finding its C2 beacon?

The potential for beaconing detection is that it can serve as an early warning system and help discover novel persistence mechanisms in the initial hours or days after execution. This allows defenders to disrupt or evict the threat actor before they can achieve their objectives. So, while we are not quite "left of boom" by detecting C2 beaconing, we can make a big difference in the outcome of the attack by reducing its overall impact.

In this blog, we talk about a beaconing identification framework that we built using Painless and aggregations in the Elastic Stack. The framework can not only help threat hunters and analysts monitor network traffic for beaconing activity, but also provides useful indicators of compromise (IoCs) for them to start an investigation with. If you don’t have an Elastic Cloud cluster but would like to try out our beaconing identification framework, you can start a free 14-day trial of Elastic Cloud.

Beaconing — A primer

An enterprise's defense is only as good as its firewalls, antivirus, endpoint detection and intrusion detection capabilities, and SOC (Security Operations Center) — which consists of analysts, engineers, operators administrators, etc. who work round the clock to keep the organization secure. Malware however, enters enterprises in many different ways and uses a variety of techniques to go undetected. An increasingly common method being used by adversaries nowadays to evade detection is to use C2 beaconing as a part of their attack chain, given that it allows them to blend into networks like a normal user.

In networking, beaconing is a term used to describe a continuous cadence of communication between two systems. In the context of malware, beaconing is when malware periodically calls out to the attacker's C2 server to get further instructions on tasks to perform on the victim machine. The frequency at which the malware checks in and the methods used for the communications are configured by the attacker. Some of the common protocols used for C2 are HTTP/S, DNS, SSH, and SMTP, as well as common cloud services like Google, Twitter, Dropbox, etc. Using common protocols and services for C2 allows adversaries to masquerade as normal network traffic and hence evade firewalls.

While on the surface beaconing can appear similar to normal network traffic, it has some unique traits with respect to timing and packet size, which can be modeled using standard statistical and signal processing techniques.

Below is an example of a Koadic C2 beacon, which serves the malicious payload using the DLL host process. As you can see, the payload beacons consistently at an interval of 10 minutes, and the source, as well as destination packet sizes, are almost identical.

Example of a Koadic C2 beacon

It might seem like a trivial task to catch C2 beaconing if all beacons were as neatly structured and predictable as the above. All one would have to look for is periodicity and consistency in packet sizes. However, malware these days is not as straightforward.

Example of an Emotet beacon

Most sophisticated malware nowadays adds a "jitter" or randomness to the beacon interval, making the signal more difficult to detect. Some malware authors also use longer beacon intervals. The beaconing identification framework we propose accounts for some of these elusive modifications to traditional beaconing behavior.

Our approach

We’ve discussed a bit about the why and what — in this section we dig deeper into how we identify beaconing traffic. Before we begin, it is important to note that beaconing is merely a communication characteristic. It is neither good nor evil by definition. While it is true that malware heavily relies on beaconing nowadays, a lot of legitimate software also exhibits beaconing behaviour.

While we have made efforts to reduce false positives, this framework should be looked at as a means for beaconing identification to help reduce the search space for a threat hunt, not as a means for detection. That said, indicators produced by this framework, when combined with other IoCs, can potentially be used to detect on malicious activity.

The beacons we are interested in comprise traffic from a single running process on a particular host machine to one or more external IPs. Given that the malware can have both short (order of seconds) and long (order of hours or days) check-in intervals, we will restrict our attention to a time window that works reasonably for both and attempt to answer the question: “What is beaconing in my environment right now or recently?” We have also parameterized the inputs to the framework to allow users to configure important settings like time window, etc. More on this in upcoming sections.

When dealing with large data sets, such as network data for an enterprise, you need to think carefully about what you can measure, which allows you to scale effectively. Scaling has several facets, but for our purposes, we have the following requirements:

  1. Work can be parallelised over different shards of data stored on different machines
  2. The amount of data that needs to move around to compute what is needed must be kept manageable.

Multiple approaches have been suggested for detecting beaconing characteristics, but not all of them satisfy these constraints. For example, a popular choice for detecting beacon timing characteristics is to measure the interval between events. This proves to be too inefficient to use on large datasets because the events can't be processed across multiple shards.

Driven by the need to scale, we chose to detect beaconing by bucketing the data in the time window to be analyzed. We gather the event count and average bytes sent and received in each bucket. These statistics can be computed in MapReduce fashion and values from different shards can be combined at the coordinating node of an Elasticsearch query.

Furthermore, by controlling the ratio between the bucket and window lengths, the data we pass per running process has predictable memory consumption, which is important for system stability. The whole process is illustrated diagrammatically below:

Bucketing data for analysis

A key attribute of beaconing traffic is it often has similar netflow bytes for the majority of its communication. If we average the bytes over all the events that fall in a single bucket, the average for different buckets will in fact be even more similar. This is just the law of large numbers in action. A good way to measure similarity of several positive numbers (in our case these are average bucket netflow bytes) is using a statistic called the coefficient of variation (COV). This captures the average relative difference between the values and their mean. Because this is a relative value, a COV closer to 0 implies that values are tightly clustered around their mean.

We also found that occasional spikes in the netflow bytes in some beacons were inflating the COV statistic. In order to rectify this, we simply discarded low and high percentile values when computing the COV, which is a standard technique for creating a robust statistic. We threshold the value of this statistic to be significantly less than one to detect this characteristic of beacons.

For periodicity, we observed that signals displayed one of two characteristics when we viewed the bucket counts. If the period was less than the time bucket length (i.e. high frequency beacons), then the count showed little variation from bucket to bucket. If the period was longer than the time bucket length (i.e. low frequency beacons), then the signal had high autocorrelation. Let's discuss these in detail.

To test for high frequency beacons, we use a statistic called relative variance (RV). The rate of many naturally occurring phenomena are well described by a Poisson distribution. The reason for this is that if events arrive randomly at a constant average rate and the occurrence of one event doesn’t affect the chance of others occurring, then their count in a fixed time interval must be Poisson distributed.

Just to underline this point, it doesn’t matter the underlying mechanisms for that random delay between events (making a coffee, waiting for your software to build, etc.)— if those properties hold, their rate distribution is always the same. Therefore, we expect that the bucket counts to be Poisson distributed for much of the traffic in our network, but not for beacons, which are much more regular. A feature of the Poisson distribution is that its variance is equal to its average, i.e. its RV is 1. Loosely, this means that if the RV of our bucket counts is closer to 0, the signal is more regular than a Poisson process.

Autocorrelation is a useful statistic for understanding when a time series repeats itself. The basic idea behind autocorrelation is to compare the time series values to themselves after shifting them in time. Specifically, it is the covariance between the two sets of values (which is larger when they are more similar), normalized by dividing it by the square root of the variances of the two sets, which measures how much the values vary among themselves.

This process is illustrated schematically below. We apply this to the time series comprising the bucket counts: if the signal is periodic then the time bucketed counts must also repeat themselves. The nice thing about autocorrelation from our perspective is that it is capable of detecting any periodic pattern. For example, the events don’t need to be regularly spaced but might repeat like two events occurring close to one another in time, followed by a long gap and so on.

We don’t know the shift beforehand that will maximize the similarity between the two sets of values, so we search over all shifts for the maximum. This, in effect, is the period of the data — the closer its autocorrelation is to one, the closer the time series is to being truly periodic. We threshold the autocorrelation close to one to test for low frequency beacons.

Finally, we noted that most beaconing malware these days incorporates jitter. How does autocorrelation deal with this? Well first off, autocorrelation isn’t a binary measure — it is a sliding scale: the closer the value is to 1 the more similar the two sets of values are to one another. Even if they are not identical but similar it can still be close to one. In fact, we can do better than this by modelling how random jitter affects autocorrelation and undoing its effect. Provided the jitter isn’t too large, the process to do this turns out to be about as complex as just finding the maximum autocorrelation.

In our implementation, we’ve made the percentage configurable, although one would always use a small-ish percentage to avoid flagging too much traffic as periodic. If you'd like to dig into the gory details of our implementation, all the artifacts are available as a GitHub release in our detection rules repository.

How do we do this using Elasticsearch?

Elasticsearch has some very powerful tools for ad hoc data analysis. The scripted metric aggregation is one of them. The nice thing about this aggregation is that it allows you to write custom Painless scripts to derive different metrics about your data. We used the aggregation to script out the beaconing tests.

In a typical environment, the cardinality of the distinct processes running across endpoints is rather high. Trying to run an aggregation that partitions by every running process is therefore not feasible. This is where another feature of the Elastic Stack comes in handy. A transform is a complex aggregation which paginates through all your data and writes results to a destination index.

There are various basic operations available in transforms, one of them being partitioning data at scale. In our case, we partitioned our network event logs by host and process name and ran our scripted metric aggregation against each host-process name pair. The transform also writes out various beaconing related indicators and statistics. A sample document from the resulting destination index is as follows:

  • We're hiring

    Work for a global, distributed team where finding someone like you is just a Zoom meeting away. Flexible work with impact? Development opportunities from the start?