An Elastic approach to large-scale dynamic malware analysis

Introduction

In previous publications, we have written about Detonate: how we built it and how we use it within Elastic for malware analysis. This publication delves deeper into using Detonate for dynamic large-scale malware analysis.

At a high level, Detonate runs malware and other potentially malicious software in a controlled (i.e., sandboxed) environment where the full suite of Elastic Security capabilities are enabled. For more information about Detonate, check out Click, Click… Boom! Automating Protections Testing with Detonate.

A significant portion of the data generated during execution consists of benign and duplicate information. When conducting dynamic malware analysis on a large scale, managing the vast amount of low-value data is a considerable challenge. To address it, we took advantage of several Elastic ingest pipelines, which we leveraged to effectively filter out noise from our datasets. This application of ingest pipelines enabled us to conveniently analyze our large volumes of malware data and identify several malicious behaviors that we were already interested in.

This research examines the concept of ingest pipelines, exploring their different types and applications, and how to implement them. We will then walk through a comprehensive workflow incorporating these ingest pipelines. We will discuss our scripts and the methods that we created in order to automate the entire process. Finally, we will present our results and discuss how the workflow shared in this publication can be leveraged by others to obtain similar outcomes.

Overview

In order to accomplish our large-scale malware analysis goals, we required effective data management. An overview of the chained ingest pipelines and processors that we built is shown below:

In summary, we fingerprint known good binaries and store those fingerprints in an enrich index. We do the same thing when we detonate malware or an unknown binary, using a comparison of those fingerprints to quickly filter out low-value data.

Ingest pipelines

Ingest pipelines are a powerful feature that allows you to preprocess and transform data before indexing it into Elasticsearch. They provide a way to perform various actions on incoming documents, such as enriching the data, modifying fields, extracting information, or applying data normalization. Ingest pipelines can be customized to meet specific data processing requirements. Our objective was to create a pipeline that differentiates known benign documents from a dataset containing both benign and malicious records. We ingested large benign and malicious datasets into separate namespaces and built pipelines to normalize the data, calculate fingerprints, and add a specific label based on certain criteria. This label helps differentiate between known benign and unknown data.

Normalization

Normalization is the process of organizing and transforming data into a consistent and standardized format. When dealing with lots of different data, normalization becomes important to ensure consistency, improve search and analysis capabilities, and enable efficient data processing.

The goal is to make sure documents with unique identifiers are no longer unique. For example, we remove the unique 6-character filename of the Elastic Agent in the " /opt/Elastic/Agent/data/" directory after installation. This ensures data from different Elastic Agents can be fully comparable, leading to more filtering opportunities in later pipeline phases.

To accomplish this, we leveraged the gsub pipeline. It allowed us to apply regex-based transformations to fields within the data pipeline. We performed pattern matching and substitution operations to normalize event data, such as removing special characters, converting text to lowercase, or replacing certain patterns with standardized values.

By analyzing our dataset, we discovered a set of candidates that would require normalization, and created a simple Python script to generate a list of gsub processors based on the matching value and the replacement value. The script that we leveraged can be found on GitHub. Using the output of the script, we can leverage dev tools to create a pipeline containing the generated gsub processors.

Prior to utilizing the normalization pipeline, documents would contain random 6 character strings for every single Elastic agent. An example is displayed below.

After ingesting and manipulating the documents through the normalization pipeline, the result looks like the following.

When all documents are normalized, we can continue with the fingerprint calculation process.

Fingerprint calculation

Fingerprint calculations are commonly used to generate a unique identifier for documents based on their content. The fingerprint ingest pipeline provides a convenient way to generate such identifiers by computing a hash value based on the specified fields and options, allowing for efficient document deduplication and comparison. The pipeline offers various options, including algorithms (such as MD5 or SHA-1), target fields for storing the generated fingerprints, and the ability to include or exclude specific fields in the calculation.

We needed to calculate the fingerprints of documents ingested into Elasticsearch from several sources and integrations such as endpoint, auditd manager, packetbeat, file integrity monitoring etc. To calculate the fingerprints, we first needed to specify which fields we wanted to calculate them for. Because different data sources use different fields, it was important to create separate processors for each data type. For our use case, we ended up creating a different fingerprint processor for the following set of event categories:

By specifying a condition we ensure that each processor only runs on its corresponding dataset.

The included fields to these processors are of the utmost importance, as they can indicate if a field is less static than expected or if an empty field could result in a non-functional pipeline. For example, when working with network data, it might initially make sense to include protocol, destination ip, destination port, source ip and source port. But this will lead to too much noise in the pipeline, as the socket that is opened on a system will be opened on an ephemeral source port, which will result in many unique fingerprints for otherwise identical network traffic. Some fields that may be subject to change relate to file sizes, version numbers, or specific text fields that are not being parsed. Normalization sometimes preserves fields that aren't useful for fingerprinting, and the more specific the fingerprint the less useful it tends to be. Fingerprinting by file hash illustrates this, while adding an empty space to the file causes a new hash to be calculated, this would break an existing hash-based fingerprint of the file.

Field selection is a tedious process but vital for good results. For a specific integration, like auditd manager, we can find the exported fields on GitHub and pick the ones that seem useful for our purposes. An example of the processor that we used for auditd\_manager can be found in the image below.

Example of the event's fingerprint fields used for the calculation.

Enrichment process

The enrich ingest pipeline is used for enriching incoming documents with additional information from external data sources. It allows you to enrich your data by performing lookups against an index or data set, based on specific criteria. Common use cases for the enrich ingest pipeline include augmenting documents with data from reference datasets (such as geolocation or customer information) and enriching logs with contextual information (like threat intelligence labels).

For this project we leveraged enrich pipelines to add a unique identifier to the ingested document if it met certain criteria described within an enrich policy. To accomplish this, we first ingested a large and representative batch of benign data using a combination of normalization and fingerprint calculation pipelines. When the ingestion was completed, we set up several enrich policies through the execute enrich policy API. The execution of these enrich policies will create a set of new .enrich-* system indices. The results stored within these indices will later be used by the pipelines used to ingest mixed (benign and malicious) data.

This will make more sense with an example workflow. To leverage the enrich ingest pipeline, we first need to create enrich policies. As we are dealing with different data sources - meaning network data looks very different from auditd manager data - we will have to create one enrich policy per data type. In our enrich policy we may use a query to specify which documents we want to include in our enrich index and which ones we want to exclude. An example enrich policy that should add all auditd manager data to the enrich index, other than the data matching three specific match phrases, is displayed below.

We are leveraging the “fingerprint” field which is calculated in the fingerprint processor as our match field. This will create an index filled with benign fingerprints to be used as the enriching index within the enrich pipeline.

After creating this policy, we have to execute it for it to read the matching index, read the matching field, query for inclusions and exclusions, and create the new .enrich-* system index. We do this by executing a POST request to the _execute API.

Example API request to execute the enrich policy

We set wait_for_completion=false to make sure that the policy doesn’t time out. This might occur if the dataset is too large. When we navigate to index management and include hidden indices, we can see that the index is created successfully.

We now have a list of known benign fingerprints, which we will use within our enrich pipeline to filter our mixed dataset with. Our enrich pipeline will once again use a condition to differentiate between data sources. An overview of our enrich processors is displayed below.

Focusing on the auditd manager, we built an enrich processor using the condition field to check if the document's dataset is auditd_manager.auditd. If it matches, we reference the enrich policy we created for that dataset. Using the fingerprint field, we match and enrich incoming documents. If the fingerprint is known within the enrich indices we created, we add the "enrich_label" field with the fingerprint to the document. See the processor below.

Once a document originating from the auditd_manager.auditd dataset comes through, the enrich processor is executed, and this finally executes a script processor. The script processor allows us to run inline or stored scripts on incoming documents. We leverage this functionality to read each document in the pipeline, check whether the “enrich_label” field was added; and if this is the case, we set a new boolean field called “known_benign” to true and remove the “enrich_label” and “enriched_fingerprint” fields. If the document does not contain the “enrich_label” field, we set “known_benign” to false. This allows us to easily filter our mixed dataset in Kibana.

When using the “test pipeline” feature by adding a document that contains the “enrich_label”, we can see that the “fingerprint” and the “known_benign” fields are set.

Testing the pipeline with a benign document

For documents that do not contain “enrich_label”, just the fingerprint is set.

Working with these enrich policies requires some setup, but once they are well structured they can truly filter out a lot of noise. Because doing this manually is a lot of work, we created some simple Python scripts to somewhat automate this process. We will go into more detail about how to automate the creation of these enrich policies, their execution, the creation of the enrich pipeline and more shortly.

Ingest pipeline chaining

The pipeline ingest pipeline provides a way to chain multiple ingest pipelines. By chaining pipelines, we create a sequence of operations that collectively shapes the incoming data in the form that we want, facilitating our needs for data normalization, fingerprint calculation, and data enrichment.

In our work with Detonate, we ended up creating two ingest pipelines. The first will process benign data, which consists of a normalization pipeline and a fingerprint calculation pipeline. The next will process malicious data, consisting of a normalization, fingerprint calculation, and enrichment pipeline. An example of this would be the following:

With the pipelines in place, we need to ensure that they are actually being used when ingesting data. To accomplish this, we leverage component templates.

Component templates

Component templates are reusable configurations that define the settings and mappings for specific types of Elasticsearch components. They provide a convenient way to define and manage consistent configurations across multiple components, simplifying the management and maintenance of resources.

When you first start using any fleet integrations, you would notice that a lot of component templates are created by default. These are also tagged as "managed", meaning that you can't change the configuration.

In order to accommodate users that want to post process events that are ingested via the fleet managed agent, all index templates call out to a final component template whose name ends in @custom.

The settings you put in these components will never be changed by updates. In our use case, we use these templates to add a mapping for the enrichment fields. Most of the data that is ingested via the fleet and its integrations will go through an ingest pipeline. These pipelines will follow the same pattern in order to accommodate user customizations. Take for example the following ingest pipeline:

Example of fleet manager component template

We can see that it is managed by fleet and it is tied to a specific version (e.g. 8.8.0) of the integration. The processor will end by calling the @custom pipeline, and ignore it if it doesn't exist.

We want to add our enrichment data to the documents using the enrichment pipelines we described in the previous section. This can now simply be done by creating the @custom pipeline and having that call out to the enrichment pipeline.

Example of the created custom ingest pipeline

Automating the process

In order to create the gsub processors, ingest pipelines, and enrich policies, we had to use three Python scripts. In the next section we will showcase these scripts. If you choose to integrate these scripts, remember that you will need to adjust them to match your own environment in order to make them work.

Creating the gsub ingest pipelines

In order to create a gsub pipeline that will replace the given random paths by static ones we used a Python script that takes several fields and patterns as an input, and prints out a json object which can be used by the pipeline creation API.

Create Custom Pipelines

After setting up the gsub pipeline, we leveraged a second Python script that searches for all fleet managed configurations that call an @custom ingest pipeline. It will then create the appropriate pipeline, after which all the custom pipelines will be pointing to the process_local_events pipeline.

Generate Enrichment Processors

Finally, we created a third Python script that will handle the creation of enrichment processors in four steps.

The cleanup process : While an enrichment processor is used in an ingest pipeline it cannot be deleted. During testing and development we simply delete and recreate the ingest pipeline. This is of course not recommended for production environments.
Create enrich policies : The script will create every individual policy.
Execute the policies : This will start the process of creating the hidden enrichment system index. Note that the execution of the policy will take longer than the execution of the script as it will not wait for the completion of the command. Elastic will create the enrichment index in the background.
Re-create the ingest pipeline : After the enrich policy has been updated, we can now re-create the ingest pipeline that uses the enrichments.

After executing these three scripts, the whole setup is completed, and malicious data can be ingested into the correct namespace.

Results and limitations

Our benign dataset includes 53,267,892 documents generated by executing trusted binaries on a variety of operating systems and collecting events from high-value data sources. Using this normalized benign dataset, we calculated the fingerprints and created the enrich policies per data type.

With this setup in place, we detonated 332 samples. After removing the Elastic agent metrics and endpoint alerts from the datasets, we ended up with a mixed dataset containing a total number of 41,710,279 documents.

Results prior to filtering on known_benign = false

After setting “known_benign” to false, we end up with 1,321,949 documents. This is a decrease of 96.83% in document count.

Results after filtering on known_benign = false

The table below presents an overview of each data source and its corresponding number of documents before and after filtering on our “known_benign” field.

We can see that we managed to successfully filter most data sources by a decent percentage. Additionally, the numbers presented in the “after” column include malicious data that we do want to capture. For example, amongst the different malware samples several included ransomware - which tends to create a lot of file events. Also, all of the http traffic originated from malware samples trying to connect to their C2’s. The auditd_manager and fim.event datasets include a lot of the syscalls and file changes performed by the samples.

While building out this pipeline, several lessons were learnt. First of all, as mentioned before, if you add one wrong field to the fingerprint calculation the whole dataset might end up generating lots of noise. This can be seen by adding the source.port to the packetbeat fingerprint calculation, resulting in the endpoint.events.network and all network_traffic-* datasets to increase drastically.

The second lesson we learned: it is not only important to have a representative dataset, but it is also important to have a large dataset. These two go hand in hand, but we learnt that having a small dataset or a dataset that does not generate very similar behavior to the dataset that will be ingested later, will cause the pipelines to be less than half as effective.

Finally, some data sources are better suited for this filtering approach than others. For example, when dealing with system.syslog and system.auth events, most of the fields within the document (except the message field) are always the same. As we cannot use this approach for unstructured data, such as plain text fields, our filter would filter out 99% of the events when just looking at the remaining fields.

Visualizing results

Kibana offers many great options to visualize large datasets. We chose to leverage the Lens functionality within Kibana to search through our malicious dataset. By setting known\_benign to false, setting count of fingerprint as a metric, and sorting by ascending order, we can right away see different malware samples execute different tasks. Examples of file events is shown below.

Using Lens to visualize malicious file events

Within this table, we can see - suspicious files being created in the /dev/shm/ directory - “ HOW_TO_DECRYPT.txt ” file creations indicating the creation of a ransom message - Files being changed to contain new random file extensions, indicating the ransomware encryption process.

When looking into file integrity monitoring events, we can also very easily distinguish benign events from malicious events by applying the same filter.

Using Lens to visualize malicious symlink events

Right away we notice the creation of a symlink for a linux.service and bot.service , and several run control symlinks to establish persistence onto the system.

Looking at network connections, we can see connection\_attempted events from malicious samples to potential C2 servers on several uncommon ports.

Using Lens to visualize malicious network connections

Finally, looking at auditd manager syscall events, we can see the malware opening files such as cmdline and maps and attempting to change the permissions of several files.

Using Lens to visualize malicious syscalls

Overall, in our opinion the data cleaning results are very promising and allow us to more efficiently conduct dynamic malware analysis on a large scale. The process can always be further optimized, so feel free to take advantage of our approach and fine tune it to your specific needs.

Beyond Dynamic Malware Analysis

In the previous sections we described our exact use case for leveraging fingerprint and enrich ingest pipelines. Other than malware analysis, there are many other fields that can reap the benefits of a workflow similar to the one outlined above. Several of these applications and use cases are described below:

Forensics and Security: Fingerprinting can be employed in digital forensics and security investigations to identify and link related artifacts or events. It helps in tracing the origin of data, analyzing patterns, and identifying potential threats or anomalies in log files, network traffic, or system events. Researchers over at Microsoft leveraged fuzzy hashing in previous research to detect malicious web shell traffic.
Identity Resolution: Fingerprinting can be used to uniquely identify individuals or entities across different data sources. This is useful in applications like fraud detection, customer relationship management, and data integration, where matching and merging records based on unique identifiers is crucial.
Data Deduplication: Fingerprinting can help identify and eliminate duplicate records or documents within a dataset. By comparing fingerprints, you can efficiently detect and remove duplicate entries, ensuring data integrity and improving storage efficiency. Readers interested in data deduplication use cases might find great value in pre-built tools such as Logslash to achieve this goal.
Content Management: Fingerprinting can be used in content management systems to detect duplicate or similar documents, images, or media files. It aids in content deduplication, similarity matching, and content-based searching by improving search accuracy and enhancing the overall user experience.
Media Identification: Fingerprinting techniques are widely used in media identification and recognition systems. By generating unique fingerprints for audio or video content, it becomes possible to identify copyrighted material, detect plagiarism, or enable content recommendation systems based on media similarity.

Conclusion

There are many different approaches to dynamic malware analysis. This blog post explored some of these options by leveraging the powerful capabilities offered by Elastic. Our aim was to both present a new method of dynamic malware analysis while at the same time broadening your understanding and knowledge of the built-in functionalities within Elastic.

Elastic Security Labs is the threat intelligence branch of Elastic Security dedicated to creating positive change in the threat landscape. Elastic Security Labs provides publicly available research on emerging threats with an analysis of strategic, operational, and tactical adversary objectives, then integrates that research with the built-in detection and response capabilities of Elastic Security.

Follow Elastic Security Labs on Twitter @elasticseclabs and check out our research at www.elastic.co/security-labs/.

An Elastic approach to large- scale dynamic malware analysis