Elastic Security Labs - Articles by Apoorva Joshi

Using LLMs and ESRE to find similar user sessions

Tue, 19 Sep 2023 00:00:00 GMT

Using LLMs and ESRE to find similar user sessions

In our previous article, we explored using the GPT-4 Large Language Model (LLM) to condense complex Linux user sessions into concise summaries. We highlighted the key takeaways from our experiments, shedding light on the nuances of data preprocessing, prompt tuning, and model parameter adjustments. In the context of the same experiment, we dedicated some time to examine sessions that shared similarities. These similar sessions can subsequently aid the analysts in identifying related suspicious activities. We explored the following methods to find similarities in user sessions:

In an endeavor to uncover similar user profiles and sessions, one approach we undertook was to categorize sessions according to the actions executed by users; we accomplished this by instructing the Language Model Model (LLM) to categorize user sessions into predefined categories
Additionally, we harnessed the capabilities of ELSER (Elastic’s retrieval model for semantic search) to execute a semantic search on the model summaries derived from the session summarization experiment

This research focuses on our experiments using GPT-4 for session categorization and ESRE for semantic search.

Leveraging GPT for Session Categorization

We consulted a security research colleague with domain expertise to define nine categories for our dataset of 75 sessions. These categories generalize the main behaviors and significant features observed in the sessions. They include the following activities:

Docker Execution
Network Operations
File Searches
Linux Command Line Usage
Linux Sandbox Application Usage
Pip Installations
Package Installations
Script Executions
Process Executions

Lessons learned

For our experiments, we used a GPT-4 deployment in Azure AI Studio with a token limit of 32k. To explore the potential of the GPT model for session categorization, we conducted a series of experiments, directing the model to categorize sessions by inputting the same JSON summary document we used for the session summarization process.

This effort included multiple iterations, during which we concentrated on enhancing prompts and Few-Shot Learning. As for the model parameters, we maintained a Temperature of 0 in an effort to make the outputs less diverse.

Prompt engineering

Takeaway: Including explanations for categories in the prompts does not impact the model's performance.

The session categorization component was introduced as an extension to the session summarization prompt. We explored the effect of incorporating contextual explanations for each category alongside the prompts. Intriguingly, our findings revealed that appending illustrative context did not significantly influence the model's performance, as compared to prompts devoid of such supplementary information.

Below is a template we used to guide the model's categorization process:

You are a cybersecurity assistant, who helps Security analysts in summarizing activities that transpired in a Linux session. A summary of events that occurred in the session will be provided in JSON format. No need to explicitly list out process names and file paths. Summarize the session in ~3 paragraphs, focusing on the following: 
- Entities involved in the session: host name and user names.
- Overview of any network activity. What major source and destination ips are involved? Any malicious port activity?
- Overview of any file activity. Were any sensitive files or directories accessed?
- Highlight any other important process activity
- Looking at the process, network, and file activity, what is the user trying to do in the session? Does the activity indicate malicious behavior?

Also, categorize the below Linux session in one of the following 9 categories: Network, Script Execution, Linux Command Line Utility, File search, Docker Execution, Package Installations, Pip Installations, Process Execution and Linux Sandbox Application.

A brief description for each Linux session category is provided below. Refer to these explanations while categorizing the sessions.
- Docker Execution: The session involves command with docker operations, such as docker-run and others
- Network: The session involves commands with network operations
- File Search: The session involves file operations, pertaining to search
- Linux Command Line Utility: The session involves linux command executions
- Linux Sandbox Application: The session involves a sandbox application activity. 
- Pip Installations: The session involves python pip installations
- Package Installations: The session involves package installations or removal activities. This is more of apt-get, yum, dpkg and general command line installers as opposed to any software wrapper
- Script Execution: The session involves bash script invocations. All of these have pointed custom infrastructure script invocations
- Process Execution: The session focuses on other process executions and is not limited to linux commands. 
 ###
 Text: {your input here}

Few-shot tuning

Takeaway: Adding examples for each category improves accuracy.

Simultaneously, we investigated the effectiveness of improving the model's performance by including one example for each category in the above prompt. This strategy resulted in a significant enhancement, notably boosting the model's accuracy by 20%.

Evaluating GPT Categories

The assessment of GPT categories is crucial in measuring the quality and reliability of the outcomes. In the evaluation of categorization results, a comparison was drawn between the model's categorization and the human categorization assigned by the security expert (referred to as "Ground_Truth" in the below image). We calculated the total accuracy based on the number of successful matches for categorization evaluation.

We observed that GPT-4 faced challenges when dealing with samples bearing multiple categories. However, when assigning a single category, it aligned with the human categorization in 56% of cases. The "Linux Command Line Utility" category posed a particular challenge, with 47% of the false negatives, often misclassified as "Process Execution" or "Script Execution." This discrepancy arose due to the closely related definitions of the "Linux Command Line Utility" and "Process Execution" categories and there may have also been insufficient information in the prompts, such as process command line arguments, which could have served as a valuable distinguishing factor for these categories.

Given the results from our evaluation, we conclude that we either need to tune the descriptions for each category in the prompt or provide more examples to the model via few-shot training. Additionally, it's worth considering whether GPT is the most suitable choice for classification, particularly within the context of the prompting paradigm.

Semantic search with ELSER

We also wanted to try ELSER, the Elastic Learned Sparse EncodeR for semantic search. Semantic search focuses on contextual meaning, rather than strictly exact keyword inputs, and ELSER is a retrieval model trained by Elastic that enables you to perform semantic search and retrieve more relevant results.

We tried some examples of semantic search questions on the session summaries. The session summaries were stored in an Elasticsearch index, and it was simple to download the ELSER model following an official tutorial. The tokens generated by ELSER are stored in the index, as shown in the image below:

Afterward, semantic search on the index was overall able to retrieve the most relevant events. Semantic search queries about the events included:

Password related – yielding 1Password related logs
Java – yielding logs that used Java
Python – yielding logs that used Python
Non-interactive session
Interactive session

An example of semantic search can be seen in the Dev Tools console through a text_expansion query.

Some takeaways are:

For semantic search, the prompt template can cause the summary to have too many unrelated keywords. For example, we wanted every summary to include an assessment of whether or not the session should be considered "malicious", that specific word was always included in the resulting summary. Hence, the summaries of benign sessions and malicious sessions alike contained the word "malicious" through sentences like "This session is malicious" or "This session is not malicious". This could have impacted the accuracy.
Semantic search seemed unable to differentiate effectively between certain related concepts, such as interactive vs. non-interactive. A small number of specific terms might not have been deemed important enough to the core meaning of the session summary for semantic search.
Semantic search works better than BM25 for cases where the user doesn’t specify the exact keywords. For example, searching for "Python" or "Java" related logs and summaries is equally effective with both ELSER and BM25. However, ELSER could retrieve more relevant data when searching for “object oriented language” related logs. In contrast, using a keyword search for “object oriented language” doesn’t yield relevant results, as shown in the image below.

What's next

We are currently looking into further improving summarization via retrieval augmented generation (RAG), using tools in the Elastic Search and Relevance Engine (ESRE). In the meantime, we’d love to hear about your experiments with LLMs, ESRE, etc. If you'd like to share what you're doing or run into any issues during the process, please reach out to us on our community Slack channel and discussion forums.

Using LLMs to summarize user sessions

Mon, 11 Sep 2023 00:00:00 GMT

Using LLMs to summarize user sessions

With the introduction of the AI Assistant into the Security Solution in 8.8, the Security Machine Learning team at Elastic has been exploring how to optimize Security operations with LLMs like GPT-4. User session summarization seemed like the perfect use case to start experimenting with for several reasons:

User session summaries can help analysts quickly decide whether a particular session's activity is worth investigating or not
Given the diversity of data that LLMs like GPT-4 are trained on, it is not hard to imagine that they have already been trained on man pages, and other open Security content, which can provide useful context for session investigation
Session summaries could potentially serve as a good supplement to the Session View tool, which is available in the Elastic Security Solution as of 8.2.

In this publication, we will talk about lessons learned and key takeaways from our experiments using GPT-4 to summarize user sessions.

In our follow-on research, we dedicated some time to examine sessions that shared similarities. These similar sessions can subsequently aid the analysts in identifying related suspicious activities.

What is a session?

In Linux, and other Unix-like systems, a "user session" refers to the period during which a user is logged into the system. A session begins when a user logs into the system, either via graphical login managers (GDM, LightDM) or via command-line interfaces (terminal, SSH).

Upon starting a Linux Kernel, a special process called the "init' process is created, which is responsible for starting configured services such as databases, web servers, and remote access services such as sshd. These services, and any shells or processes spawned by them, are typically encapsulated within their own sessions and tied together by a single session ID (SID).

The detailed and chronological process information captured by sessions makes them an extremely useful asset for alerting, compliance, and threat hunting.

Lessons learned

For our experiments, we used a GPT-4 deployment with a 32k token limit available via Azure AI Studio. Tokens are basic units of text or code that LLMs use to process and generate language. Our goal here was to see how far we can get with user session summarization within the prompting paradigm alone. We learned some things along the way as it related to data processing, prompt engineering, hallucinations, parameter tuning, and evaluating the GPT summaries.

Data processing

Takeaway: An aggregated JSON snapshot of the session is an effective input format for summarization.

A session here is simply a collection of process, network, file, and alert events. The number of events in a user session can range from a handful (< 10) to hundreds of thousands. Each event log itself can be quite verbose, containing several hundred fields. For longer sessions with a large number of events, one can quickly run into token limits for models like GPT-4. Hence, passing raw logs as input to GPT-4 is not as useful for our specific use case. We saw this during experimentation, even when using tabular formats such as CSV, and using a small subset of fields in the logs.

To get around this issue, we had to come up with an input format that retains as much of the session's context as possible, while also keeping the number of input tokens more or less constant irrespective of the length of the session. We experimented with several log de-duplication and aggregation strategies and found that an aggregated JSON snapshot of the session works well for summarization. An example document is as follows:

This JSON snapshot highlights the most prominent activities in the session using de-duplicated lists, aggregate counts, and top-N (20 in our case) most frequent terms, with self-explanatory field names.

Prompt engineering

Takeaway: Few-shot tuning with high-level instructions worked best.

Apart from data processing, most of our time during experimentation was spent on prompt tuning. We started with a basic prompt and found that the model had a hard time connecting the dots to produce a useful summary:

You are an AI assistant that helps people find information.

We then tried providing very detailed instructions in the prompt but noticed that the model ignored some of the instructions:

You are a cybersecurity assistant, who helps Security analysts in summarizing activities that transpired in a Linux session. A summary of events that occurred in the session will be provided in JSON format. No need to explicitly list out process names and file paths. Summarize the session in ~3 paragraphs, focusing on the following: 
- Entities involved in the session: host name and user names.
- Overview of any network activity. What major source and destination ips are involved? Any malicious port activity?
- Overview of any file activity. Were any sensitive files or directories accessed?
- Highlight any other important process activity
- Looking at the process, network, and file activity, what is the user trying to do in the session? Does the activity indicate malicious behavior?

Based on the above prompt, the model did not reliably adhere to the 3 paragraph request and also listed out process names and file paths which it was explicitly told not to do.

Finally, we landed on the following prompt that provided high-level instructions for the model:

Analyze the following Linux user session, focusing on:      
- Identifying the host and user names      
- Observing activities and identifying key patterns or trends      
- Noting any indications of malicious or suspicious behavior such as tunneling or encrypted traffic, login failures, access to sensitive files, large number of file creations and deletions, disabling or modifying Security software, use of Shadow IT, unusual parent-child process executions, long-running processes
- Conclude with a comprehensive summary of what the user might be trying to do in the session, based on the process, network, and file activity     
 ###
 Text: {your input here}

We also noticed that the model follows instructions more closely when they're provided in user prompts rather than in the system prompts (a system prompt is the initial instruction to the model telling it how it should behave and the user prompts are the questions/queries asked by a user to the model). After the above prompt, we were happy with the content of the summaries, but the output format was inconsistent, with the model switching between paragraphs and bulleted lists. We were able to resolve this with few-shot tuning, by providing the model with two examples of user prompts vs. expected responses.

Hallucinations

Takeaway: The model occasionally hallucinates while generating net new content for the summaries.

We observed that the model does not typically hallucinate while summarizing facts that are immediately apparent in the input such as user and host entities, network ports, etc. Occasionally, the model hallucinates while summarizing information that is not obvious, for example, in this case summarizing the overall user intent in the session. Some relatively easy avenues we found to mitigate hallucinations were as follows:

Prompt the model to focus on specific behaviors while summarizing
Re-iterate that the model should fact-check its output
Set the temperature to a low value (less than or equal to 0.2) to get the model to generate less diverse responses, hence reducing the chances of hallucinations
Limit the response length, thus reducing the opportunity for the model to go off-track — This works especially well if the length of the texts to be summarized is more or less constant, which it was in our case

Parameter tuning

Takeaway: Temperature = 0 does not guarantee determinism.

For summarization, we explored tuning parameters such as Temperature and Top P, to get deterministic responses from the model. Our observations were as follows:

Tuning both together is not recommended, and it's also difficult to observe the effect of each when combined
Solely setting the temperature to a low value (< 0.2) without altering Top P is usually sufficient
Even setting the temperature to 0 does not result in fully deterministic outputs given the inherent non-deterministic nature of floating point calculations (see this post from OpenAI for a more detailed explanation)

Evaluating GPT Summaries

As with any modeling task, evaluating the GPT summaries was crucial in gauging the quality and reliability of the model outcomes. In the absence of standardized evaluation approaches and metrics for text generation, we decided to do a qualitative human evaluation of the summaries, as well as a quantitative evaluation using automatic metrics such as ROUGE-L, BLEU, METEOR, BERTScore, and BLANC.

For qualitative evaluation, we had a Security Researcher write summaries for a carefully chosen (to get a good distribution of short and long sessions) set of 10 sessions, without any knowledge of the GPT summaries. Three evaluators were asked to compare the GPT summaries against the human-generated summaries using three key criteria:

Factuality: Examine if the model summary retains key facts of the session as provided by Security experts
Authenticity: Check for hallucinations
Consistency: Check the consistency of the model output i.e. all the responses share a stable format and produce the same level of detail

Finally, each of the 10 summaries was assigned a final rating of "Good" or "Bad" based on a majority vote to combine the evaluators' choices.

While we recognize the small dataset size for evaluation, our qualitative assessment showed that GPT summaries aligned with human summaries 80% of the time. For the GPT summaries that received a "Bad" rating, the summaries didn't retain certain important facts because the aggregated JSON document only kept the top-N terms for certain fields.

The automated metrics didn't seem to match human preferences, nor did they reliably measure summary quality due to the structural differences between human and LLM-generated summaries, especially for reference-based metrics.

What's next

We are currently looking into further improving summarization via retrieval augmented generation (RAG), using tools in the Elastic Search and Relevance Engine (ESRE). We also experimented with using LLMs to categorize user sessions. Stay tuned for Part 2 of this blog to learn more about those experiments!

In the meantime, we’d love to hear about your experiments with LLMs, ESRE, etc. If you'd like to share what you're doing or run into any issues during the process, please reach out to us on our community Slack channel and discussion forums. Happy experimenting!

Identifying beaconing malware using Elastic

Wed, 01 Mar 2023 00:00:00 GMT

The early stages of an intrusion usually include initial access, execution, persistence, and command-and-control (C2) beaconing. When structured threats use zero-days, these first two stages are often not detected. It can often be challenging and time-consuming to identify persistence mechanisms left by an advanced adversary as we saw in the 2020 SUNBURST supply chain compromise. Could we then have detected SUNBURST in the initial hours or days by finding its C2 beacon?

The potential for beaconing detection is that it can serve as an early warning system and help discover novel persistence mechanisms in the initial hours or days after execution. This allows defenders to disrupt or evict the threat actor before they can achieve their objectives. So, while we are not quite "left of boom" by detecting C2 beaconing, we can make a big difference in the outcome of the attack by reducing its overall impact.

In this blog, we talk about a beaconing identification framework that we built using Painless and aggregations in the Elastic Stack. The framework can not only help threat hunters and analysts monitor network traffic for beaconing activity, but also provides useful indicators of compromise (IoCs) for them to start an investigation with. If you don’t have an Elastic Cloud cluster but would like to try out our beaconing identification framework, you can start a free 14-day trial of Elastic Cloud.

Beaconing — A primer

An enterprise's defense is only as good as its firewalls, antivirus, endpoint detection and intrusion detection capabilities, and SOC (Security Operations Center) — which consists of analysts, engineers, operators administrators, etc. who work round the clock to keep the organization secure. Malware however, enters enterprises in many different ways and uses a variety of techniques to go undetected. An increasingly common method being used by adversaries nowadays to evade detection is to use C2 beaconing as a part of their attack chain, given that it allows them to blend into networks like a normal user.

In networking, beaconing is a term used to describe a continuous cadence of communication between two systems. In the context of malware, beaconing is when malware periodically calls out to the attacker's C2 server to get further instructions on tasks to perform on the victim machine. The frequency at which the malware checks in and the methods used for the communications are configured by the attacker. Some of the common protocols used for C2 are HTTP/S, DNS, SSH, and SMTP, as well as common cloud services like Google, Twitter, Dropbox, etc. Using common protocols and services for C2 allows adversaries to masquerade as normal network traffic and hence evade firewalls.

While on the surface beaconing can appear similar to normal network traffic, it has some unique traits with respect to timing and packet size, which can be modeled using standard statistical and signal processing techniques.

Below is an example of a Koadic C2 beacon, which serves the malicious payload using the DLL host process. As you can see, the payload beacons consistently at an interval of 10 minutes, and the source, as well as destination packet sizes, are almost identical.

It might seem like a trivial task to catch C2 beaconing if all beacons were as neatly structured and predictable as the above. All one would have to look for is periodicity and consistency in packet sizes. However, malware these days is not as straightforward.

Most sophisticated malware nowadays adds a "jitter" or randomness to the beacon interval, making the signal more difficult to detect. Some malware authors also use longer beacon intervals. The beaconing identification framework we propose accounts for some of these elusive modifications to traditional beaconing behavior.

Our approach

We’ve discussed a bit about the why and what — in this section we dig deeper into how we identify beaconing traffic. Before we begin, it is important to note that beaconing is merely a communication characteristic. It is neither good nor evil by definition. While it is true that malware heavily relies on beaconing nowadays, a lot of legitimate software also exhibits beaconing behaviour.

While we have made efforts to reduce false positives, this framework should be looked at as a means for beaconing identification to help reduce the search space for a threat hunt, not as a means for detection. That said, indicators produced by this framework, when combined with other IoCs, can potentially be used to detect on malicious activity.

The beacons we are interested in comprise traffic from a single running process on a particular host machine to one or more external IPs. Given that the malware can have both short (order of seconds) and long (order of hours or days) check-in intervals, we will restrict our attention to a time window that works reasonably for both and attempt to answer the question: “What is beaconing in my environment right now or recently?” We have also parameterized the inputs to the framework to allow users to configure important settings like time window, etc. More on this in upcoming sections.

When dealing with large data sets, such as network data for an enterprise, you need to think carefully about what you can measure, which allows you to scale effectively. Scaling has several facets, but for our purposes, we have the following requirements:

Work can be parallelised over different shards of data stored on different machines
The amount of data that needs to move around to compute what is needed must be kept manageable.

Multiple approaches have been suggested for detecting beaconing characteristics, but not all of them satisfy these constraints. For example, a popular choice for detecting beacon timing characteristics is to measure the interval between events. This proves to be too inefficient to use on large datasets because the events can't be processed across multiple shards.

Driven by the need to scale, we chose to detect beaconing by bucketing the data in the time window to be analyzed. We gather the event count and average bytes sent and received in each bucket. These statistics can be computed in MapReduce fashion and values from different shards can be combined at the coordinating node of an Elasticsearch query.

Furthermore, by controlling the ratio between the bucket and window lengths, the data we pass per running process has predictable memory consumption, which is important for system stability. The whole process is illustrated diagrammatically below:

A key attribute of beaconing traffic is it often has similar netflow bytes for the majority of its communication. If we average the bytes over all the events that fall in a single bucket, the average for different buckets will in fact be even more similar. This is just the law of large numbers in action. A good way to measure similarity of several positive numbers (in our case these are average bucket netflow bytes) is using a statistic called the coefficient of variation (COV). This captures the average relative difference between the values and their mean. Because this is a relative value, a COV closer to 0 implies that values are tightly clustered around their mean.

We also found that occasional spikes in the netflow bytes in some beacons were inflating the COV statistic. In order to rectify this, we simply discarded low and high percentile values when computing the COV, which is a standard technique for creating a robust statistic. We threshold the value of this statistic to be significantly less than one to detect this characteristic of beacons.

For periodicity, we observed that signals displayed one of two characteristics when we viewed the bucket counts. If the period was less than the time bucket length (i.e. high frequency beacons), then the count showed little variation from bucket to bucket. If the period was longer than the time bucket length (i.e. low frequency beacons), then the signal had high autocorrelation. Let's discuss these in detail.

To test for high frequency beacons, we use a statistic called relative variance (RV). The rate of many naturally occurring phenomena are well described by a Poisson distribution. The reason for this is that if events arrive randomly at a constant average rate and the occurrence of one event doesn’t affect the chance of others occurring, then their count in a fixed time interval must be Poisson distributed.

Just to underline this point, it doesn’t matter the underlying mechanisms for that random delay between events (making a coffee, waiting for your software to build, etc.)— if those properties hold, their rate distribution is always the same. Therefore, we expect that the bucket counts to be Poisson distributed for much of the traffic in our network, but not for beacons, which are much more regular. A feature of the Poisson distribution is that its variance is equal to its average, i.e. its RV is 1. Loosely, this means that if the RV of our bucket counts is closer to 0, the signal is more regular than a Poisson process.

Autocorrelation is a useful statistic for understanding when a time series repeats itself. The basic idea behind autocorrelation is to compare the time series values to themselves after shifting them in time. Specifically, it is the covariance between the two sets of values (which is larger when they are more similar), normalized by dividing it by the square root of the variances of the two sets, which measures how much the values vary among themselves.

This process is illustrated schematically below. We apply this to the time series comprising the bucket counts: if the signal is periodic then the time bucketed counts must also repeat themselves. The nice thing about autocorrelation from our perspective is that it is capable of detecting any periodic pattern. For example, the events don’t need to be regularly spaced but might repeat like two events occurring close to one another in time, followed by a long gap and so on.

We don’t know the shift beforehand that will maximize the similarity between the two sets of values, so we search over all shifts for the maximum. This, in effect, is the period of the data — the closer its autocorrelation is to one, the closer the time series is to being truly periodic. We threshold the autocorrelation close to one to test for low frequency beacons.

Finally, we noted that most beaconing malware these days incorporates jitter. How does autocorrelation deal with this? Well first off, autocorrelation isn’t a binary measure — it is a sliding scale: the closer the value is to 1 the more similar the two sets of values are to one another. Even if they are not identical but similar it can still be close to one. In fact, we can do better than this by modelling how random jitter affects autocorrelation and undoing its effect. Provided the jitter isn’t too large, the process to do this turns out to be about as complex as just finding the maximum autocorrelation.

In our implementation, we’ve made the percentage configurable, although one would always use a small-ish percentage to avoid flagging too much traffic as periodic. If you'd like to dig into the gory details of our implementation, all the artifacts are available as a GitHub release in our detection rules repository.

How do we do this using Elasticsearch?

Elasticsearch has some very powerful tools for ad hoc data analysis. The scripted metric aggregation is one of them. The nice thing about this aggregation is that it allows you to write custom Painless scripts to derive different metrics about your data. We used the aggregation to script out the beaconing tests.

In a typical environment, the cardinality of the distinct processes running across endpoints is rather high. Trying to run an aggregation that partitions by every running process is therefore not feasible. This is where another feature of the Elastic Stack comes in handy. A transform is a complex aggregation which paginates through all your data and writes results to a destination index.

There are various basic operations available in transforms, one of them being partitioning data at scale. In our case, we partitioned our network event logs by host and process name and ran our scripted metric aggregation against each host-process name pair. The transform also writes out various beaconing related indicators and statistics. A sample document from the resulting destination index is as follows:

As you can see, the document contains valuable beaconing-related information about the process. First off, the beacon_stats.is_beaconing indicator says whether or not we found the process to be beaconing. If it is, as in the case above, the document will also contain important metadata, such as the frequency of the beacon. The indicator beacon_stats.periodic says whether or not the signal is a low-frequency beacon, while the indicator beacon_stats.low_count_variation indicates whether or not it is a high-frequency beacon.

Furthermore, the indicators beacon_stats.low_source_bytes_variation and low_destination_bytes_variation indicate whether or not the source and destination bytes sent during the beaconing communication were more or less uniform. Finally, you will also notice the beaconing_score indicator, which is a value from 1-3, representing the number of beaconing tests satisfied by the process for that time period.

Writing such metadata out to an index also means that you can search for different facets of beaconing software in your environment. For example, if you want to search for low frequency beaconing processes in your environment, you would query for documents where the beacon_stats.periodic indicator is true and beacon_stats.low_count_variation is false. You can also build second order analytics on top of the indexed data, such as using anomaly detection to find rare beaconing processes, or using a significant terms aggregation to detect lateral movement of beaconing malware in your environment.

Finally, we’ve included several dashboards for your threat hunters and analysts to use for monitoring beaconing activity in your environment. These can be found in the release package as well.

Tuning parameters and filtering

Advanced users can also tune important parameters to the scripted metric aggregation in the transforms, like jitter percentage, time window, etc. If you'd like to change the default parameters, all you would need to do is delete the transform, change the parameters, and restart it. The parameters you can tune are as follows:

number_buckets_in_range: The number of time buckets we split the time window into. You need enough to ensure you get reasonable estimates for the various statistics, but too many means the transform will use more memory and compute.
time_bucket_length: The length of each time bucket. This controls the time window, so the larger this value the longer the time window. You might set this longer if you want to check for very low frequency beacons.
number_destination_ips: The number of destination IPs to gather in the results. Setting this higher increases the transform resource usage.
max_beaconing_bytes_cov: The maximum coefficient of variation in the payload bytes for the low source and destination bytes variance test. Setting this higher will increase the chance of detecting traffic as beaconing, so would likely increase recall for malicious C2 beacons. However, it will also reduce the precision of the test.
max_beaconing_count_rv: The maximum relative variance in the bucket counts for the high frequency beacon test. As with max_beaconing_bytes_cov, we suggest tuning this parameter based on the kind of tradeoff you want between precision and recall.
truncate_at: The lower and upper fraction of bucket values discarded when computing max_beaconing_bytes_cov and max_beaconing_count_rv. This allows you to ignore occasional changes in traffic patterns. However, if you retain too small a fraction of the data, these tests will be unreliable.
min_beaconing_count_autocovariance: The minimum autocorrelation of the signal for the low frequency beacon test. Lowering this value will likely result in an increase in recall for malicious C2 beacons, at the cost of reduced test precision. As with some of the other parameters mentioned above, we suggest tuning this parameter based on the kind of tradeoff you want between precision and recall.
max_jitter: The maximum amount by which we assume that a periodic beacon is jittered, as a fraction of its period.

You can also make changes to the transform query. We currently look for beaconing activity over a 6h time range, but you can change this to a different time range. As mentioned previously, beaconing is not a characteristic specific to malware and a lot of legitimate, benign processes also exhibit beaconing-like activity.

In order to curb the false positive rate, we have included a starter list of filters in the transform query to exclude known benign beaconing processes that we observed during testing, and a list of IPs that fall into two categories:

The source IP is local and the destination is remote
For certain Microsoft processes, the destination IP is in a Microsoft block

You can add to this list based on what you see in your environment.

Evaluation

In order to measure the effectiveness of our framework as a reduced search space for beaconing activity, we wanted to test two aspects:

Does the framework flag actual malicious beaconing activity?
By how much does the framework reduce the search space for malicious beacons?

In order to test the performance on malware beacons, we ran the transform on some synthetic data as well as some real malware! We set up test ranges for Emotet and Koadic, and also tested it on NOBELIUM logs we had from several months ago. The results from the real malware tests are worth mentioning here.

For NOBELIUM, the beaconing transform catches the offending process, rundll32.exe, as well as the two destination IPs, 192.99.221.77 and 83.171.237.173, which were among the main IoCs for NOBELIUM.

For Koadic and Emotet as well, the transform was able to flag the process as well as the known destination IPs on which the test C2 listeners were running. The characteristics of each of the beacons were different. For example, Koadic was a straightforward, high-frequency beacon that satisfied all the beaconing criteria being checked in the transform i.e. periodicity, as well as low variation of source and destination bytes. Emotet was slightly trickier since it was a low frequency beacon with a high jitter percentage. But we were able to detect it due to the low variation in the source bytes of the beacon.

To test the amount of reduction in search space, we ran the transform over three weeks on an internal cluster that was receiving network event logs from ~ 2k hosts during the testing period. We measured the reduction in search space based on the number of network event log messages, processes, and hosts an analyst or threat hunter would have to sift through before and after running the transform, in order to identify malicious beacons. The numbers are as follows:

While the reduction in search space is obvious, another point to note is the scale of data that the transforms are able to churn through comfortably, which becomes an important aspect to consider, especially in production environments. Additionally, we have also released dashboards (available in the release package), which track metrics like prevalence of the beaconing processes, etc. that can help make informed decisions about further filtering of the search space.

While the released dashboards, and the statistics in the above table are based on cases where the beacon_stats.is_beaconing indicator is true i.e. beacons that satisfy either of the beaconing tests, threat hunters may want to further streamline their search by starting with the most obvious beaconing-like cases and then moving on to the less obvious ones. This can be done by filtering and searching by the beacon_stats.beaconing_score indicator instead of beacon_stats.is_beaconing, where a score of 3 indicates a typical beacon (satisfying tests for periodicity as well as low variation in packet bytes), and score of 1 indicates a less obvious beacon (satisfying only one of the three tests).

For reference, we observed the following on our internal cluster:

What's next

We’d love for you to try out our beaconing identification framework and give us feedback as we work on improving it. If you run into any issues during the process, please reach out to us on our community Slack channel, discussion forums, or even our open detections repository. Stay tuned for Part 2 of this blog, where we’ll cover going from identifying beaconing activity to actually detecting on malicious beacons!

Try out our beaconing identification framework with a free 14-day trial of Elastic Cloud.

Getting the Most Out of Transformers in Elastic

Tue, 23 Aug 2022 00:00:00 GMT

Preamble

In 8.3, our Elastic Stack Machine Learning team introduced a way to import third party Natural Language Processing (NLP) models into Elastic. As security researchers, we HAD to try it out on a security dataset. So we decided to build a model to identify malicious command lines by fine-tuning a pre-existing model available on the Hugging Face model hub.

Upon finding that the fine-tuned model was performing (surprisingly!) well, we wanted to see if it could replace or be combined with our previous tree-based model for detecting Living off the Land (LotL) attacks. But first, we had to make sure that the throughput and latency of this new model were reasonable enough for real-time inference. This resulted in a series of experiments, the results of which we will detail in this blog.

In this blog, we will briefly talk about how we fine-tuned a transformer model meant for a masked language modeling (MLM) task, to make it suitable for a classification task. We will also look at how to import custom models into Elastic. Finally, we’ll dive into all the experiments we did around using the fine-tuned model for real-time inference.

NLP for command line classification

Before you start building NLP models, it is important to understand whether an NLP model is even suitable for the task at hand. In our case, we wanted to classify command lines as being malicious or benign. Command lines are a set of commands provided by a user via the computer terminal. An example command line is as follows:

**move test.txt C:\**

The above command moves the file test.txt to the root of the **C:** directory.

Arguments in command lines are related in the way that the co-occurrence of certain values can be indicative of malicious activity. NLP models are worth exploring here since these models are designed to understand and interpret relationships in natural (human) language, and since command lines often use some natural language.

Fine-tuning a Hugging Face model

Hugging Face is a data science platform that provides tools for machine learning (ML) enthusiasts to build, train, and deploy ML models using open source code and technologies. Its model hub has a wealth of models, trained for a variety of NLP tasks. You can either use these pre-trained models as-is to make predictions on your data, or fine-tune the models on datasets specific to your NLP tasks.

The first step in fine-tuning is to instantiate a model with the model configuration and pre-trained weights of a specific model. Random weights are assigned to any task-specific layers that might not be present in the base model. Once initialized, the model can be trained to learn the weights of the task-specific layers, thus fine-tuning it for your task. Hugging Face has a method called from_pretrained that allows you to instantiate a model from a pre-trained model configuration.

For our command line classification model, we created a RoBERTa model instance with encoder weights copied from the roberta-base model, and a randomly initialized sequence classification head on top of the encoder:

model = RobertaForSequenceClassification.from_pretrained('roberta-base', num_labels=2)

Hugging Face comes equipped with a Tokenizers library consisting of some of today's most used tokenizers. For our model, we used the RobertaTokenizer which uses Byte Pair Encoding (BPE) to create tokens. This tokenization scheme is well-suited for data belonging to a different domain (command lines) from that of the tokenization corpus (English text). A code snippet of how we tokenized our dataset using RobertaTokenizer can be found here. We then used Hugging Face's Trainer API to train the model, a code snippet of which can be found here.

ML models do not understand raw text. Before using text data as inputs to a model, it needs to be converted into numbers. Tokenizers group large pieces of text into smaller semantically useful units, such as (but not limited to) words, characters, or subwords — called token —, which can, in turn, be converted into numbers using different encoding techniques.

Check out this video (2:57 onwards) to review additional pre-processing steps that might be needed after tokenization based on your dataset.

A complete tutorial on how to fine-tune pre-trained Hugging Face models can be found here.

Importing custom models into Elastic

Once you have a trained model that you are happy with, it's time to import it into Elastic. This is done using Eland, a Python client and toolkit for machine learning in Elasticsearch. A code snippet of how we imported our model into Elastic using Eland can be found here.
You can verify that the model has been imported successfully by navigating to Model Management \> Trained Models via the Machine Learning UI in Kibana:

Using the Transformer model for inference — a series of experiments

We ran a series of experiments to evaluate whether or not our Transformer model could be used for real-time inference. For the experiments, we used a dataset consisting of ~66k command lines.

Our first inference run with our fine-tuned RoBERTa model took ~4 hours on the test dataset. At the outset, this is much slower than the tree-based model that we were trying to beat at ~3 minutes for the entire dataset. It was clear that we needed to improve the throughput and latency of the PyTorch model to make it suitable for real-time inference, so we performed several experiments:

Using multiple nodes and threads

The latency numbers above were observed when the models were running on a single thread on a single node. If you have multiple Machine Learning (ML) nodes associated with your Elastic deployment, you can run inference on multiple nodes, and also on multiple threads on each node. This can significantly improve the throughput and latency of your models.

You can change these parameters while starting the trained model deployment via the API:

**POST \_ml/trained\_models/\\/deployment/\_start?number\_of\_allocations=2&threa ds\_per\_allocation=4**

number_of_allocations allows you to set the total number of allocations of a model across machine learning nodes and can be used to tune model throughput. threads_per_allocation allows you to set the number of threads used by each model allocation during inference and can be used to tune model latency. Refer to the API documentation for best practices around setting these parameters.

In our case, we set the number_of_allocations to 2 , as our cluster had two ML nodes and threads_per_allocation to 4 , as each node had four allocated processors.

Running inference using these settings resulted in a 2.7x speedup on the original inference time.

Dynamic quantization

Quantizing is one of the most effective ways of improving model compute cost, while also reducing model size. The idea here is to use a reduced precision integer representation for the weights and/or activations. While there are a number of ways to trade off model accuracy for increased throughput during model development, dynamic quantization helps achieve a similar trade-off after the fact, thus saving on time and resources spent on iterating over the model training.

Eland provides a way to dynamically quantize your model before importing it into Elastic. To do this, simply pass in quantize=True as an argument while creating the TransformerModel object (refer to the code snippet for importing models) as follows:

**# Load the custom model**
**tm = TransformerModel("model", "text\_classification", quantize=True)**

In the case of our command line classification model, we observed the model size drop from 499 MB to 242 MB upon dynamic quantization. Running inference on our test dataset using this model resulted in a 1.6x speedup on the original inference time, for a slight drop in model sensitivity (exact numbers in the following section) .

Knowledge Distillation

Knowledge Distillation is a way to achieve model compression by transferring knowledge from a large (teacher) model to a smaller (student) one while maintaining validity. At a high level, this is done by using the outputs from the teacher model at every layer, to backpropagate error through the student model. This way, the student model learns to replicate the behavior of the teacher model. Model compression is achieved by reducing the number of parameters, which is directly related to the latency of the model.

To study the effect of knowledge distillation on the performance of our model, we fine-tuned a distilroberta-base model (following the same procedure described in the fine-tuning section) for our command line classification task and imported it into Elastic. distilroberta-base has 82 million parameters, compared to its teacher model, roberta-base , which has 125 million parameters. The model size of the fine-tuned DistilRoBERTa model turned out to be 329 MB, down from 499 MB for the RoBERTa model.

Upon running inference with this model, we observed a 1.5x speedup on the original inference time and slightly better model sensitivity (exact numbers in the following section) than the fine-tuned roberta-base model.

Dynamic quantization and knowledge distillation

We observed that dynamic quantization and model distillation both resulted in significant speedups on the original inference time. So, our final experiment involved running inference with a quantized version of the fine-tuned DistilRoBERTa model.

We found that this resulted in a 2.6x speedup on the original inference time, and slightly better model sensitivity (exact numbers in the following section). We also observed the model size drop from 329 MB to 199 MB after quantization.

Bringing it all together

Based on our experiments, dynamic quantization and model distillation resulted in significant inference speedups. Combining these improvements with distributed and parallel computing, we were further able to reduce the total inference time on our test set from four hours to 35 minutes. However, even our fastest transformer model was still several magnitudes slower than the tree-based model, despite using significantly more CPU resources.

The Machine Learning team here at Elastic is introducing an inference caching mechanism in version 8.4 of the Elastic Stack, to save time spent on performing inference on repeat samples. These are a common occurrence in real-world environments, especially when it comes to Security. With this optimization in place, we are optimistic that we will be able to use transformer models alongside tree-based models in the future.

A comparison of the sensitivity (true positive rate) and specificity (true negative rate) of our tree-based and transformer models shows that an ensemble of the two could potentially result in a more performant model:


Model	Sensitivity (%)	False Negative Rate (%)	Specificity (%)	False Positive Rate (%)
Tree-based	99.53	0.47	99.99	0.01
RoBERTa	99.57	0.43	97.76	2.24
RoBERTa quantized	99.56	0.44	97.64	2.36
DistilRoBERTa	99.68	0.32	98.66	1.34
DistilRoBERTa quantized	99.69	0.31	98.71	1.29

As seen above, the tree-based model is better suited for classifying benign data while the transformer model does better on malicious samples, so a weighted average or voting ensemble could work well to reduce the total error by averaging the predictions from both the models.

What's next

We plan to cover our findings from inference caching and model ensembling in a follow-up blog. Stay tuned!

In the meanwhile, we’d love to hear about models you're building for inference in Elastic. If you'd like to share what you're doing or run into any issues during the process, please reach out to us on our community Slack channeland discussion forums. Happy experimenting!