<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/">
    <channel>
        <title>Elastic Security Labs - Articles by Apoorva Joshi</title>
        <link>https://www.elastic.co/es/security-labs</link>
        <description>Trusted security news &amp; research from the team at Elastic.</description>
        <lastBuildDate>Thu, 05 Mar 2026 22:21:01 GMT</lastBuildDate>
        <docs>https://validator.w3.org/feed/docs/rss2.html</docs>
        <generator>https://github.com/jpmonette/feed</generator>
        <image>
            <title>Elastic Security Labs - Articles by Apoorva Joshi</title>
            <url>https://www.elastic.co/es/security-labs/assets/security-labs-thumbnail.png</url>
            <link>https://www.elastic.co/es/security-labs</link>
        </image>
        <copyright>© 2026. Elasticsearch B.V. All Rights Reserved</copyright>
        <item>
            <title><![CDATA[Using LLMs and ESRE to find similar user sessions]]></title>
            <link>https://www.elastic.co/es/security-labs/using-llms-and-esre-to-find-similar-user-sessions</link>
            <guid>using-llms-and-esre-to-find-similar-user-sessions</guid>
            <pubDate>Tue, 19 Sep 2023 00:00:00 GMT</pubDate>
            <description><![CDATA[In our previous article, we explored using the GPT-4 Large Language Model (LLM) to condense Linux user sessions. In the context of the same experiment, we dedicated some time to examine sessions that shared similarities. These similar sessions can subsequently aid the analysts in identifying related suspicious activities.]]></description>
            <content:encoded><![CDATA[<h2>Using LLMs and ESRE to find similar user sessions</h2>
<p>In our <a href="https://www.elastic.co/es/security-labs/using-llms-to-summarize-user-sessions">previous article</a>, we explored using the GPT-4 Large Language Model (LLM) to condense complex Linux user sessions into concise summaries. We highlighted the key takeaways from our experiments, shedding light on the nuances of data preprocessing, prompt tuning, and model parameter adjustments. In the context of the same experiment, we dedicated some time to examine sessions that shared similarities. These similar sessions can subsequently aid the analysts in identifying related suspicious activities. We explored the following methods to find similarities in user sessions:</p>
<ul>
<li>In an endeavor to uncover similar user profiles and sessions, one approach we undertook was to categorize sessions according to the actions executed by users; we accomplished this by instructing the Language Model Model (LLM) to categorize user sessions into predefined categories</li>
<li>Additionally, we harnessed the capabilities of <a href="https://www.elastic.co/es/guide/en/machine-learning/current/ml-nlp-elser.html">ELSER</a> (Elastic’s retrieval model for semantic search) to execute a semantic search on the model summaries derived from the session summarization experiment</li>
</ul>
<p>This research focuses on our experiments using GPT-4 for session categorization and <a href="https://www.elastic.co/es/elasticsearch/elasticsearch-relevance-engine">ESRE</a> for semantic search.</p>
<h2>Leveraging GPT for Session Categorization</h2>
<p>We consulted a security research colleague with domain expertise to define nine categories for our dataset of 75 sessions. These categories generalize the main behaviors and significant features observed in the sessions. They include the following activities:</p>
<ul>
<li>Docker Execution</li>
<li>Network Operations</li>
<li>File Searches</li>
<li>Linux Command Line Usage</li>
<li>Linux Sandbox Application Usage</li>
<li>Pip Installations</li>
<li>Package Installations</li>
<li>Script Executions</li>
<li>Process Executions</li>
</ul>
<h2>Lessons learned</h2>
<p>For our experiments, we used a GPT-4 deployment in Azure AI Studio with a token limit of 32k. To explore the potential of the GPT model for session categorization, we conducted a series of experiments, directing the model to categorize sessions by inputting the same JSON summary document we used for the <a href="https://www.elastic.co/es/security-labs/using-llms-to-summarize-user-sessions">session summarization process</a>.</p>
<p>This effort included multiple iterations, during which we concentrated on enhancing prompts and <a href="https://help.openai.com/en/articles/6654000-best-practices-for-prompt-engineering-with-openai-api">Few-Shot</a> Learning. As for the model parameters, we maintained a <a href="https://txt.cohere.com/llm-parameters-best-outputs-language-ai/">Temperature of 0</a> in an effort to make the outputs less diverse.</p>
<h3>Prompt engineering</h3>
<p><em>Takeaway:</em> Including explanations for categories in the prompts does not impact the model's performance.</p>
<p>The session categorization component was introduced as an extension to the session summarization prompt. We explored the effect of incorporating contextual explanations for each category alongside the prompts. Intriguingly, our findings revealed that appending illustrative context did not significantly influence the model's performance, as compared to prompts devoid of such supplementary information.</p>
<p>Below is a template we used to guide the model's categorization process:</p>
<pre><code>You are a cybersecurity assistant, who helps Security analysts in summarizing activities that transpired in a Linux session. A summary of events that occurred in the session will be provided in JSON format. No need to explicitly list out process names and file paths. Summarize the session in ~3 paragraphs, focusing on the following: 
- Entities involved in the session: host name and user names.
- Overview of any network activity. What major source and destination ips are involved? Any malicious port activity?
- Overview of any file activity. Were any sensitive files or directories accessed?
- Highlight any other important process activity
- Looking at the process, network, and file activity, what is the user trying to do in the session? Does the activity indicate malicious behavior?

Also, categorize the below Linux session in one of the following 9 categories: Network, Script Execution, Linux Command Line Utility, File search, Docker Execution, Package Installations, Pip Installations, Process Execution and Linux Sandbox Application.

A brief description for each Linux session category is provided below. Refer to these explanations while categorizing the sessions.
- Docker Execution: The session involves command with docker operations, such as docker-run and others
- Network: The session involves commands with network operations
- File Search: The session involves file operations, pertaining to search
- Linux Command Line Utility: The session involves linux command executions
- Linux Sandbox Application: The session involves a sandbox application activity. 
- Pip Installations: The session involves python pip installations
- Package Installations: The session involves package installations or removal activities. This is more of apt-get, yum, dpkg and general command line installers as opposed to any software wrapper
- Script Execution: The session involves bash script invocations. All of these have pointed custom infrastructure script invocations
- Process Execution: The session focuses on other process executions and is not limited to linux commands. 
 ###
 Text: {your input here}
</code></pre>
<h3>Few-shot tuning</h3>
<p><em>Takeaway:</em> Adding examples for each category improves accuracy.</p>
<p>Simultaneously, we investigated the effectiveness of improving the model's performance by including one example for each category in the above prompt. This strategy resulted in a significant enhancement, notably boosting the model's accuracy by 20%.</p>
<h2>Evaluating GPT Categories</h2>
<p>The assessment of GPT categories is crucial in measuring the quality and reliability of the outcomes. In the evaluation of categorization results, a comparison was drawn between the model's categorization and the human categorization assigned by the security expert (referred to as &quot;Ground_Truth&quot; in the below image). We calculated the total accuracy based on the number of successful matches for categorization evaluation.</p>
<p><img src="https://www.elastic.co/es/security-labs/assets/images/using-llms-and-esre-to-find-similar-user-sessions/image2.png" alt="Evaluating Session Categories" /></p>
<p>We observed that GPT-4 faced challenges when dealing with samples bearing multiple categories. However, when assigning a single category, it aligned with the human categorization in 56% of cases. The &quot;Linux Command Line Utility&quot; category posed a particular challenge, with 47% of the false negatives, often misclassified as &quot;Process Execution&quot; or &quot;Script Execution.&quot; This discrepancy arose due to the closely related definitions of the &quot;Linux Command Line Utility&quot; and &quot;Process Execution&quot; categories and there may have also been insufficient information in the prompts, such as process command line arguments, which could have served as a valuable distinguishing factor for these categories.</p>
<p>Given the results from our evaluation, we conclude that we either need to tune the descriptions for each category in the prompt or provide more examples to the model via few-shot training. Additionally, it's worth considering whether GPT is the most suitable choice for classification, particularly within the context of the prompting paradigm.</p>
<h2>Semantic search with ELSER</h2>
<p>We also wanted to try <a href="https://www.elastic.co/es/guide/en/machine-learning/current/ml-nlp-elser.html#ml-nlp-elser">ELSER</a>, the Elastic Learned Sparse EncodeR for semantic search. Semantic search focuses on contextual meaning, rather than strictly exact keyword inputs, and ELSER is a retrieval model trained by Elastic that enables you to perform semantic search and retrieve more relevant results.</p>
<p>We tried some examples of semantic search questions on the session summaries. The session summaries were stored in an Elasticsearch index, and it was simple to download the ELSER model following an <a href="https://www.elastic.co/es/guide/en/machine-learning/current/ml-nlp-elser.html#ml-nlp-elser">official tutorial</a>. The tokens generated by ELSER are stored in the index, as shown in the image below:</p>
<p><img src="https://www.elastic.co/es/security-labs/assets/images/using-llms-and-esre-to-find-similar-user-sessions/image1.png" alt="Tokens generated by ELSER" /></p>
<p>Afterward, semantic search on the index was overall able to retrieve the most relevant events. Semantic search queries about the events included:</p>
<ul>
<li>Password related – yielding 1Password related logs</li>
<li>Java – yielding logs that used Java</li>
<li>Python – yielding logs that used Python</li>
<li>Non-interactive session</li>
<li>Interactive session</li>
</ul>
<p>An example of semantic search can be seen in the Dev Tools console through a <a href="https://www.elastic.co/es/guide/en/elasticsearch/reference/8.9/semantic-search-elser.html#text-expansion-query">text_expansion query</a>.</p>
<p><img src="https://www.elastic.co/es/security-labs/assets/images/using-llms-and-esre-to-find-similar-user-sessions/image5.png" alt="Example screenshot of using semantic search with the Elastic dev tools console" /></p>
<p>Some takeaways are:</p>
<ul>
<li>For semantic search, the prompt template can cause the summary to have too many unrelated keywords. For example, we wanted every summary to include an assessment of whether or not the session should be considered &quot;malicious&quot;, that specific word was always included in the resulting summary. Hence, the summaries of benign sessions and malicious sessions alike contained the word &quot;malicious&quot; through sentences like &quot;This session is malicious&quot; or &quot;This session is not malicious&quot;. This could have impacted the accuracy.</li>
<li>Semantic search seemed unable to differentiate effectively between certain related concepts, such as interactive vs. non-interactive. A small number of specific terms might not have been deemed important enough to the core meaning of the session summary for semantic search.</li>
<li>Semantic search works better than <a href="https://link.springer.com/referenceworkentry/10.1007/978-0-387-39940-9_921">BM25</a> for cases where the user doesn’t specify the exact keywords. For example, searching for &quot;Python&quot; or &quot;Java&quot; related logs and summaries is equally effective with both ELSER and BM25. However, ELSER could retrieve more relevant data when searching for “object oriented language” related logs. In contrast, using a keyword search for “object oriented language” doesn’t yield relevant results, as shown in the image below.</li>
</ul>
<p><img src="https://www.elastic.co/es/security-labs/assets/images/using-llms-and-esre-to-find-similar-user-sessions/image4.png" alt="Semantic search can yield more relevant results when keywords aren’t matching" /></p>
<h2>What's next</h2>
<p>We are currently looking into further improving summarization via <a href="https://arxiv.org/pdf/2005.11401.pdf">retrieval augmented generation (RAG)</a>, using tools in the <a href="https://www.elastic.co/es/guide/en/esre/current/index.html">Elastic Search and Relevance Engine</a> (ESRE). In the meantime, we’d love to hear about your experiments with LLMs, ESRE, etc. If you'd like to share what you're doing or run into any issues during the process, please reach out to us on our <a href="https://ela.st/slack">community Slack channel</a> and <a href="https://discuss.elastic.co/c/security">discussion forums</a>.</p>
]]></content:encoded>
            <category>security-labs</category>
            <enclosure url="https://www.elastic.co/es/security-labs/assets/images/using-llms-and-esre-to-find-similar-user-sessions/photo-edited-03@2x.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Using LLMs to summarize user sessions]]></title>
            <link>https://www.elastic.co/es/security-labs/using-llms-to-summarize-user-sessions</link>
            <guid>using-llms-to-summarize-user-sessions</guid>
            <pubDate>Mon, 11 Sep 2023 00:00:00 GMT</pubDate>
            <description><![CDATA[In this publication, we will talk about lessons learned and key takeaways from our experiments using GPT-4 to summarize user sessions.]]></description>
            <content:encoded><![CDATA[<h2>Using LLMs to summarize user sessions</h2>
<p>With the introduction of the <a href="https://www.elastic.co/es/guide/en/security/current/security-assistant.html">AI Assistant</a> into the Security Solution in 8.8, the Security Machine Learning team at Elastic has been exploring how to optimize Security operations with LLMs like GPT-4. User session summarization seemed like the perfect use case to start experimenting with for several reasons:</p>
<ul>
<li>User session summaries can help analysts quickly decide whether a particular session's activity is worth investigating or not</li>
<li>Given the diversity of data that LLMs like GPT-4 are trained on, it is not hard to imagine that they have already been trained on <a href="https://en.wikipedia.org/wiki/Man_page">man pages</a>, and other open Security content, which can provide useful context for session investigation</li>
<li>Session summaries could potentially serve as a good supplement to the <a href="https://www.elastic.co/es/guide/en/security/current/session-view.html">Session View</a> tool, which is available in the Elastic Security Solution as of 8.2.</li>
</ul>
<p>In this publication, we will talk about lessons learned and key takeaways from our experiments using GPT-4 to summarize user sessions.</p>
<p>In our <a href="https://www.elastic.co/es/security-labs/using-llms-and-esre-to-find-similar-user-sessions">follow-on research</a>, we dedicated some time to examine sessions that shared similarities. These similar sessions can subsequently aid the analysts in identifying related suspicious activities.</p>
<h2>What is a session?</h2>
<p>In Linux, and other Unix-like systems, a &quot;user session&quot; refers to the period during which a user is logged into the system. A session begins when a user logs into the system, either via graphical login managers (GDM, LightDM) or via command-line interfaces (terminal, SSH).</p>
<p>Upon starting a Linux Kernel, a special process called the &quot;init' process is created, which is responsible for starting configured services such as databases, web servers, and remote access services such as <code>sshd</code>. These services, and any shells or processes spawned by them, are typically encapsulated within their own sessions and tied together by a single session ID (SID).</p>
<p>The detailed and chronological process information captured by sessions makes them an extremely useful asset for alerting, compliance, and threat hunting.</p>
<h2>Lessons learned</h2>
<p>For our experiments, we used a GPT-4 deployment with a 32k token limit available via Azure AI Studio. Tokens are basic units of text or code that LLMs use to process and generate language. Our goal here was to see how far we can get with user session summarization within the prompting paradigm alone. We learned some things along the way as it related to data processing, prompt engineering, hallucinations, parameter tuning, and evaluating the GPT summaries.</p>
<h3>Data processing</h3>
<p><em>Takeaway:</em> An aggregated JSON snapshot of the session is an effective input format for summarization.</p>
<p>A session here is simply a collection of process, network, file, and alert events. The number of events in a user session can range from a handful (&lt; 10) to hundreds of thousands. Each event log itself can be quite verbose, containing several hundred fields. For longer sessions with a large number of events, one can quickly run into token limits for models like GPT-4. Hence, passing raw logs as input to GPT-4 is not as useful for our specific use case. We saw this during experimentation, even when using tabular formats such as CSV, and using a small subset of fields in the logs.</p>
<p><img src="https://www.elastic.co/es/security-labs/assets/images/using-llms-to-summarize-user-sessions/image1.png" alt="Max token limit (32k) is reached for sessions containing a few hundred events" /></p>
<p>To get around this issue, we had to come up with an input format that retains as much of the session's context as possible, while also keeping the number of input tokens more or less constant irrespective of the length of the session. We experimented with several log de-duplication and aggregation strategies and found that an aggregated JSON snapshot of the session works well for summarization. An example document is as follows:</p>
<p><img src="https://www.elastic.co/es/security-labs/assets/images/using-llms-to-summarize-user-sessions/image3.jpg" alt="Aggregated JSON snapshot of session activity" /></p>
<p>This JSON snapshot highlights the most prominent activities in the session using de-duplicated lists, aggregate counts, and top-N (20 in our case) most frequent terms, with self-explanatory field names.</p>
<h3>Prompt engineering</h3>
<p><em>Takeaway:</em> Few-shot tuning with high-level instructions worked best.</p>
<p>Apart from data processing, most of our time during experimentation was spent on prompt tuning. We started with a basic prompt and found that the model had a hard time connecting the dots to produce a useful summary:</p>
<pre><code>You are an AI assistant that helps people find information.
</code></pre>
<p>We then tried providing very detailed instructions in the prompt but noticed that the model ignored some of the instructions:</p>
<pre><code>You are a cybersecurity assistant, who helps Security analysts in summarizing activities that transpired in a Linux session. A summary of events that occurred in the session will be provided in JSON format. No need to explicitly list out process names and file paths. Summarize the session in ~3 paragraphs, focusing on the following: 
- Entities involved in the session: host name and user names.
- Overview of any network activity. What major source and destination ips are involved? Any malicious port activity?
- Overview of any file activity. Were any sensitive files or directories accessed?
- Highlight any other important process activity
- Looking at the process, network, and file activity, what is the user trying to do in the session? Does the activity indicate malicious behavior?
</code></pre>
<p>Based on the above prompt, the model did not reliably adhere to the 3 paragraph request and also listed out process names and file paths which it was explicitly told not to do.</p>
<p>Finally, we landed on the following prompt that provided high-level instructions for the model:</p>
<pre><code>Analyze the following Linux user session, focusing on:      
- Identifying the host and user names      
- Observing activities and identifying key patterns or trends      
- Noting any indications of malicious or suspicious behavior such as tunneling or encrypted traffic, login failures, access to sensitive files, large number of file creations and deletions, disabling or modifying Security software, use of Shadow IT, unusual parent-child process executions, long-running processes
- Conclude with a comprehensive summary of what the user might be trying to do in the session, based on the process, network, and file activity     
 ###
 Text: {your input here}
</code></pre>
<p>We also noticed that the model follows instructions more closely when they're provided in user prompts rather than in the system prompts (a system prompt is the initial instruction to the model telling it how it should behave and the user prompts are the questions/queries asked by a user to the model). After the above prompt, we were happy with the content of the summaries, but the output format was inconsistent, with the model switching between paragraphs and bulleted lists. We were able to resolve this with <a href="https://arxiv.org/pdf/2203.04291.pdf">few-shot tuning</a>, by providing the model with two examples of user prompts vs. expected responses.</p>
<h3>Hallucinations</h3>
<p><em>Takeaway:</em> The model occasionally hallucinates while generating net new content for the summaries.</p>
<p>We observed that the model does not typically <a href="https://arxiv.org/pdf/2110.10819.pdf">hallucinate</a> while summarizing facts that are immediately apparent in the input such as user and host entities, network ports, etc. Occasionally, the model hallucinates while summarizing information that is not obvious, for example, in this case summarizing the overall user intent in the session. Some relatively easy avenues we found to mitigate hallucinations were as follows:</p>
<ul>
<li>Prompt the model to focus on specific behaviors while summarizing</li>
<li>Re-iterate that the model should fact-check its output</li>
<li>Set the <a href="https://learnprompting.org/docs/basics/configuration_hyperparameters">temperature</a> to a low value (less than or equal to 0.2) to get the model to generate less diverse responses, hence reducing the chances of hallucinations</li>
<li>Limit the response length, thus reducing the opportunity for the model to go off-track — This works especially  well if the length of the texts to be summarized is more or less constant, which it was in our case</li>
</ul>
<h3>Parameter tuning</h3>
<p><em>Takeaway:</em> Temperature = 0 does not guarantee determinism.</p>
<p>For summarization, we explored tuning parameters such as <a href="https://txt.cohere.com/llm-parameters-best-outputs-language-ai/">Temperature and Top P</a>, to get deterministic responses from the model. Our observations were as follows:</p>
<ul>
<li>Tuning both together is not recommended, and it's also difficult to observe the effect of each when combined</li>
<li>Solely setting the temperature to a low value (&lt; 0.2) without altering Top P is usually sufficient</li>
<li>Even setting the temperature to 0 does not result in fully deterministic outputs given the inherent non-deterministic nature of floating point calculations (see <a href="https://community.openai.com/t/a-question-on-determinism/8185">this</a> post from OpenAI for a more detailed explanation)</li>
</ul>
<h2>Evaluating GPT Summaries</h2>
<p>As with any modeling task, evaluating the GPT summaries was crucial in gauging the quality and reliability of the model outcomes. In the absence of standardized evaluation approaches and metrics for text generation, we decided to do a qualitative human evaluation of the summaries, as well as a quantitative evaluation using automatic metrics such as <a href="https://en.wikipedia.org/wiki/ROUGE_(metric)">ROUGE-L</a>, <a href="https://en.wikipedia.org/wiki/BLEU">BLEU</a>, <a href="https://en.wikipedia.org/wiki/METEOR">METEOR</a>, <a href="https://arxiv.org/abs/1904.09675">BERTScore</a>, and <a href="https://aclanthology.org/2020.eval4nlp-1.2/">BLANC</a>.</p>
<p>For qualitative evaluation, we had a Security Researcher write summaries for a carefully chosen (to get a good distribution of short and long sessions) set of 10 sessions, without any knowledge of the GPT summaries. Three evaluators were asked to compare the GPT summaries against the human-generated summaries using three key criteria:</p>
<ul>
<li>Factuality:  Examine if the model summary retains key facts of the session as provided by Security experts</li>
<li>Authenticity: Check for hallucinations</li>
<li>Consistency: Check the consistency of the model output i.e. all the responses share a stable format and produce the same level of detail</li>
</ul>
<p>Finally, each of the 10 summaries was assigned a final rating of &quot;Good&quot; or &quot;Bad&quot; based on a majority vote to combine the evaluators' choices.</p>
<p><img src="https://www.elastic.co/es/security-labs/assets/images/using-llms-to-summarize-user-sessions/image2.png" alt="Summarization evaluation matrix" /></p>
<p>While we recognize the small dataset size for evaluation, our qualitative assessment showed that GPT summaries aligned with human summaries 80% of the time. For the GPT summaries that received a &quot;Bad&quot; rating, the summaries didn't retain certain important facts because the aggregated JSON document only kept the top-N terms for certain fields.</p>
<p>The automated metrics didn't seem to match human preferences, nor did they reliably measure summary quality due to the structural differences between human and LLM-generated summaries, especially for reference-based metrics.</p>
<h2>What's next</h2>
<p>We are currently looking into further improving summarization via <a href="https://arxiv.org/pdf/2005.11401.pdf">retrieval augmented generation (RAG)</a>, using tools in the <a href="https://www.elastic.co/es/guide/en/esre/current/index.html">Elastic Search and Relevance Engine (ESRE)</a>. We also experimented with using LLMs to categorize user sessions. Stay tuned for Part 2 of this blog to learn more about those experiments!</p>
<p>In the meantime, we’d love to hear about your experiments with LLMs, ESRE, etc. If you'd like to share what you're doing or run into any issues during the process, please reach out to us on our <a href="https://ela.st/slack">community Slack channel</a> and <a href="https://discuss.elastic.co/c/security">discussion forums</a>. Happy experimenting!</p>
]]></content:encoded>
            <category>security-labs</category>
            <enclosure url="https://www.elastic.co/es/security-labs/assets/images/using-llms-to-summarize-user-sessions/photo-edited-01@2x.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Identifying beaconing malware using Elastic]]></title>
            <link>https://www.elastic.co/es/security-labs/identifying-beaconing-malware-using-elastic</link>
            <guid>identifying-beaconing-malware-using-elastic</guid>
            <pubDate>Wed, 01 Mar 2023 00:00:00 GMT</pubDate>
            <description><![CDATA[In this blog, we walk users through identifying beaconing malware in their environment using our beaconing identification framework.]]></description>
            <content:encoded><![CDATA[<p>The early stages of an intrusion usually include initial access, execution, persistence, and command-and-control (C2) beaconing. When structured threats use zero-days, these first two stages are often not detected. It can often be challenging and time-consuming to identify persistence mechanisms left by an advanced adversary as we saw in the <a href="https://www.elastic.co/es/blog/elastic-security-provides-free-and-open-protections-for-sunburst">2020 SUNBURST supply chain compromise</a>. Could we then have detected SUNBURST in the initial hours or days by finding its C2 beacon?</p>
<p>The potential for beaconing detection is that it can serve as an early warning system and help discover novel persistence mechanisms in the initial hours or days after execution. This allows defenders to disrupt or evict the threat actor before they can achieve their objectives. So, while we are not quite &quot;left of boom&quot; by detecting C2 beaconing, we can make a big difference in the outcome of the attack by reducing its overall impact.</p>
<p>In this blog, we talk about a beaconing identification framework that we built using Painless and aggregations in the Elastic Stack. The framework can not only help threat hunters and analysts monitor network traffic for beaconing activity, but also provides useful indicators of compromise (IoCs) for them to start an investigation with. If you don’t have an Elastic Cloud cluster but would like to try out our beaconing identification framework, you can start a <a href="https://cloud.elastic.co/registration">free 14-day trial</a> of Elastic Cloud.</p>
<h2><strong>Beaconing — A primer</strong></h2>
<p>An enterprise's defense is only as good as its firewalls, antivirus, endpoint detection and intrusion detection capabilities, and SOC (Security Operations Center) — which consists of analysts, engineers, operators administrators, etc. who work round the clock to keep the organization secure. Malware however, enters enterprises in many different ways and uses a variety of techniques to go undetected. An increasingly common method being used by adversaries nowadays to evade detection is to use C2 beaconing as a part of their attack chain, given that it allows them to blend into networks like a normal user.</p>
<p>In networking, beaconing is a term used to describe a continuous cadence of communication between two systems. In the context of malware, beaconing is when malware periodically calls out to the attacker's C2 server to get further instructions on tasks to perform on the victim machine. The frequency at which the malware checks in and the methods used for the communications are configured by the attacker. Some of the common protocols used for C2 are HTTP/S, DNS, SSH, and SMTP, as well as common cloud services like Google, Twitter, Dropbox, etc. Using common protocols and services for C2 allows adversaries to masquerade as normal network traffic and hence evade firewalls.</p>
<p>While on the surface beaconing can appear similar to normal network traffic, it has some unique traits with respect to timing and packet size, which can be modeled using standard statistical and signal processing techniques.</p>
<p>Below is an example of a Koadic C2 beacon, which serves the malicious payload using the DLL host process. As you can see, the payload beacons consistently at an interval of 10 minutes, and the source, as well as destination packet sizes, are almost identical.</p>
<p><img src="https://www.elastic.co/es/security-labs/assets/images/identifying-beaconing-malware-using-elastic/1-koadic-beacon.png" alt="Example of a Koadic C2 beacon" /></p>
<p>It might seem like a trivial task to catch C2 beaconing if all beacons were as neatly structured and predictable as the above. All one would have to look for is periodicity and consistency in packet sizes. However, malware these days is not as straightforward.</p>
<p><img src="https://www.elastic.co/es/security-labs/assets/images/identifying-beaconing-malware-using-elastic/2-emotet-beacon.jpg" alt="Example of an Emotet beacon" /></p>
<p>Most sophisticated malware nowadays adds a &quot;jitter&quot; or randomness to the beacon interval, making the signal more difficult to detect. Some malware authors also use longer beacon intervals. The beaconing identification framework we propose accounts for some of these elusive modifications to traditional beaconing behavior.</p>
<h2><strong>Our approach</strong></h2>
<p>We’ve discussed a bit about the why and what — in this section we dig deeper into how we identify beaconing traffic. Before we begin, it is important to note that beaconing is merely a communication characteristic. It is neither good nor evil by definition. While it is true that malware heavily relies on beaconing nowadays, a lot of legitimate software also exhibits beaconing behaviour.</p>
<p>While we have made efforts to reduce false positives, this framework should be looked at as a means for beaconing identification to help reduce the search space for a threat hunt, not as a means for detection. That said, indicators produced by this framework, when combined with other IoCs, can potentially be used to detect on malicious activity.</p>
<p>The beacons we are interested in comprise traffic from a single running process on a particular host machine to one or more external IPs. Given that the malware can have both short (order of seconds) and long (order of hours or days) check-in intervals, we will restrict our attention to a time window that works reasonably for both and attempt to answer the question: “What is beaconing in my environment right now or recently?” We have also parameterized the inputs to the framework to allow users to configure important settings like time window, etc. More on this in upcoming sections.</p>
<p>When dealing with large data sets, such as network data for an enterprise, you need to think carefully about what you can measure, which allows you to scale effectively. Scaling has several facets, but for our purposes, we have the following requirements:</p>
<ol>
<li>Work can be parallelised over different shards of data stored on different machines</li>
<li>The amount of data that needs to move around to compute what is needed must be kept manageable.</li>
</ol>
<p>Multiple approaches have been suggested for detecting beaconing characteristics, but not all of them satisfy these constraints. For example, a popular choice for detecting beacon timing characteristics is to measure the interval between events. This proves to be too inefficient to use on large datasets because the events can't be processed across multiple shards.</p>
<p>Driven by the need to scale, we chose to detect beaconing by bucketing the data in the time window to be analyzed. We gather the event count and average bytes sent and received in each bucket. These statistics can be computed in MapReduce fashion and values from different shards can be combined at the coordinating node of an Elasticsearch query.</p>
<p>Furthermore, by controlling the ratio between the bucket and window lengths, the data we pass per running process has predictable memory consumption, which is important for system stability. The whole process is illustrated diagrammatically below:</p>
<p><img src="https://www.elastic.co/es/security-labs/assets/images/identifying-beaconing-malware-using-elastic/3-bucketing-data.jpg" alt="Bucketing data for analysis" /></p>
<p>A key attribute of beaconing traffic is it often has similar netflow bytes for the majority of its communication. If we average the bytes over all the events that fall in a single bucket, the average for different buckets will in fact be even more similar. This is just the law of large numbers in action. A good way to measure similarity of several positive numbers (in our case these are average bucket netflow bytes) is using a statistic called the <a href="https://en.wikipedia.org/wiki/Coefficient_of_variation">coefficient of variation</a> (COV). This captures the average relative difference between the values and their mean. Because this is a relative value, a COV closer to 0 implies that values are tightly clustered around their mean.</p>
<p>We also found that occasional spikes in the netflow bytes in some beacons were inflating the COV statistic. In order to rectify this, we simply discarded low and high percentile values when computing the COV, which is a standard technique for creating a robust statistic. We threshold the value of this statistic to be significantly less than one to detect this characteristic of beacons.</p>
<p>For periodicity, we observed that signals displayed one of two characteristics when we viewed the bucket counts. If the period was less than the time bucket length (i.e. high frequency beacons), then the count showed little variation from bucket to bucket. If the period was longer than the time bucket length (i.e. low frequency beacons), then the signal had high autocorrelation. Let's discuss these in detail.</p>
<p>To test for high frequency beacons, we use a statistic called <a href="https://en.wikipedia.org/wiki/Index_of_dispersion">relative variance</a> (RV). The rate of many naturally occurring phenomena are well described by a <a href="https://en.wikipedia.org/wiki/Poisson_distribution#Occurrence_and_applications">Poisson distribution</a>. The reason for this is that if events arrive randomly at a constant average rate and the occurrence of one event doesn’t affect the chance of others occurring, then their count in a fixed time interval must be Poisson distributed.</p>
<p>Just to underline this point, it doesn’t matter the underlying mechanisms for that random delay between events (making a coffee, waiting for your software to build, etc.)— if those properties hold, their rate distribution is always the same. Therefore, we expect that the bucket counts to be Poisson distributed for much of the traffic in our network, but not for beacons, which are much more regular. A feature of the Poisson distribution is that its variance is equal to its average, i.e. its RV is 1. Loosely, this means that if the RV of our bucket counts is closer to 0, the signal is more regular than a Poisson process.</p>
<p><a href="https://en.wikipedia.org/wiki/Autocorrelation">Autocorrelation</a> is a useful statistic for understanding when a time series repeats itself. The basic idea behind autocorrelation is to compare the time series values to themselves after shifting them in time. Specifically, it is the covariance between the two sets of values (which is larger when they are more similar), normalized by dividing it by the square root of the variances of the two sets, which measures how much the values vary among themselves.</p>
<p>This process is illustrated schematically below. We apply this to the time series comprising the bucket counts: if the signal is periodic then the time bucketed counts must also repeat themselves. The nice thing about autocorrelation from our perspective is that it is capable of detecting any periodic pattern. For example, the events don’t need to be regularly spaced but might repeat like two events occurring close to one another in time, followed by a long gap and so on.</p>
<p><img src="https://www.elastic.co/es/security-labs/assets/images/identifying-beaconing-malware-using-elastic/4-diagramming-representation.jpg" alt="" /></p>
<p>We don’t know the shift beforehand that will maximize the similarity between the two sets of values, so we search over all shifts for the maximum. This, in effect, is the period of the data — the closer its autocorrelation is to one, the closer the time series is to being truly periodic. We threshold the autocorrelation close to one to test for low frequency beacons.</p>
<p>Finally, we noted that most beaconing malware these days incorporates jitter. How does autocorrelation deal with this? Well first off, autocorrelation isn’t a binary measure — it is a sliding scale: the closer the value is to 1 the more similar the two sets of values are to one another. Even if they are not identical but similar it can still be close to one. In fact, we can do better than this by modelling how random jitter affects autocorrelation and undoing its effect. Provided the jitter isn’t too large, the process to do this turns out to be about as complex as just finding the maximum autocorrelation.</p>
<p>In our implementation, we’ve made the percentage configurable, although one would always use a small-ish percentage to avoid flagging too much traffic as periodic. If you'd like to dig into the gory details of our implementation, all the artifacts are available as a GitHub <a href="https://github.com/elastic/detection-rules/releases/tag/ML-Beaconing-20211216-1">release</a> in our detection rules repository.</p>
<h2><strong>How do we do this using Elasticsearch?</strong></h2>
<p>Elasticsearch has some very powerful tools for ad hoc data analysis. The <a href="https://www.elastic.co/es/guide/en/elasticsearch/reference/current/search-aggregations-metrics-scripted-metric-aggregation.html">scripted metric aggregation</a> is one of them. The nice thing about this aggregation is that it allows you to write custom Painless scripts to derive different metrics about your data. We used the aggregation to script out the beaconing tests.</p>
<p>In a typical environment, the cardinality of the distinct processes running across endpoints is rather high. Trying to run an aggregation that partitions by every running process is therefore not feasible. This is where another feature of the Elastic Stack comes in handy. A <a href="https://www.elastic.co/es/guide/en/elasticsearch/reference/current/transforms.html">transform</a> is a complex aggregation which paginates through all your data and writes results to a destination index.</p>
<p>There are various basic operations available in transforms, one of them being partitioning data at scale. In our case, we partitioned our network event logs by host and process name and ran our scripted metric aggregation against each host-process name pair. The transform also writes out various beaconing related indicators and statistics. A sample document from the resulting destination index is as follows:</p>
<p><img src="https://www.elastic.co/es/security-labs/assets/images/identifying-beaconing-malware-using-elastic/5-sample-beaconing.jpg" alt="Sample document produced by the beaconing transform" /></p>
<p>As you can see, the document contains valuable beaconing-related information about the process. First off, the beacon_stats.is_beaconing indicator says whether or not we found the process to be beaconing. If it is, as in the case above, the document will also contain important metadata, such as the frequency of the beacon. The indicator beacon_stats.periodic says whether or not the signal is a low-frequency beacon, while the indicator beacon_stats.low_count_variation indicates whether or not it is a high-frequency beacon.</p>
<p>Furthermore, the indicators beacon_stats.low_source_bytes_variation and low_destination_bytes_variation indicate whether or not the source and destination bytes sent during the beaconing communication were more or less uniform. Finally, you will also notice the beaconing_score indicator, which is a value from 1-3, representing the number of beaconing tests satisfied by the process for that time period.</p>
<p>Writing such metadata out to an index also means that you can search for different facets of beaconing software in your environment. For example, if you want to search for low frequency beaconing processes in your environment, you would query for documents where the beacon_stats.periodic indicator is true and beacon_stats.low_count_variation is false. You can also build second order analytics on top of the indexed data, such as using <a href="https://www.elastic.co/es/guide/en/kibana/current/xpack-ml-anomalies.html">anomaly detection</a> to find rare beaconing processes, or using a <a href="https://www.elastic.co/es/guide/en/elasticsearch/reference/current/search-aggregations-bucket-significantterms-aggregation.html">significant terms aggregation</a> to detect lateral movement of beaconing malware in your environment.</p>
<p>Finally, we’ve included several dashboards for your threat hunters and analysts to use for monitoring beaconing activity in your environment. These can be found in the <a href="https://github.com/elastic/detection-rules/releases/tag/ML-Beaconing-20211216-1">release package</a> as well.</p>
<h2><strong>Tuning parameters and filtering</strong></h2>
<p>Advanced users can also tune important parameters to the scripted metric aggregation in the transforms, like jitter percentage, time window, etc. If you'd like to change the default parameters, all you would need to do is delete the transform, change the parameters, and restart it. The parameters you can tune are as follows:</p>
<ul>
<li>number_buckets_in_range: The number of time buckets we split the time window into. You need enough to ensure you get reasonable estimates for the various statistics, but too many means the transform will use more memory and compute.</li>
<li>time_bucket_length: The length of each time bucket. This controls the time window, so the larger this value the longer the time window. You might set this longer if you want to check for very low frequency beacons.</li>
<li>number_destination_ips: The number of destination IPs to gather in the results. Setting this higher increases the transform resource usage.</li>
<li>max_beaconing_bytes_cov: The maximum coefficient of variation in the payload bytes for the low source and destination bytes variance test. Setting this higher will increase the chance of detecting traffic as beaconing, so would likely increase <a href="https://en.wikipedia.org/wiki/Precision_and_recall">recall</a> for malicious C2 beacons. However, it will also reduce the <a href="https://en.wikipedia.org/wiki/Precision_and_recall">precision</a> of the test.</li>
<li>max_beaconing_count_rv: The maximum relative variance in the bucket counts for the high frequency beacon test. As with max_beaconing_bytes_cov, we suggest tuning this parameter based on the kind of tradeoff you want between precision and recall.</li>
<li>truncate_at: The lower and upper fraction of bucket values discarded when computing max_beaconing_bytes_cov and max_beaconing_count_rv. This allows you to ignore occasional changes in traffic patterns. However, if you retain too small a fraction of the data, these tests will be unreliable.</li>
<li>min_beaconing_count_autocovariance: The minimum autocorrelation of the signal for the low frequency beacon test. Lowering this value will likely result in an increase in recall for malicious C2 beacons, at the cost of reduced test precision. As with some of the other parameters mentioned above, we suggest tuning this parameter based on the kind of tradeoff you want between precision and recall.</li>
<li>max_jitter: The maximum amount by which we assume that a periodic beacon is jittered, as a fraction of its period.</li>
</ul>
<p>You can also make changes to the transform query. We currently look for beaconing activity over a 6h time range, but you can change this to a different time range. As mentioned previously, beaconing is not a characteristic specific to malware and a lot of legitimate, benign processes also exhibit beaconing-like activity.</p>
<p>In order to curb the false positive rate, we have included a starter list of filters in the transform query to exclude known benign beaconing processes that we observed during testing, and a list of IPs that fall into two categories:</p>
<ol>
<li>The source IP is local and the destination is remote</li>
<li>For certain Microsoft processes, the destination IP is in a Microsoft block</li>
</ol>
<p>You can add to this list based on what you see in your environment.</p>
<h2><strong>Evaluation</strong></h2>
<p>In order to measure the effectiveness of our framework as a reduced search space for beaconing activity, we wanted to test two aspects:</p>
<ol>
<li>Does the framework flag actual malicious beaconing activity?</li>
<li>By how much does the framework reduce the search space for malicious beacons?</li>
</ol>
<p>In order to test the performance on malware beacons, we ran the transform on some synthetic data as well as some real malware! We set up test ranges for Emotet and Koadic, and also tested it on NOBELIUM logs we had from several months ago. The results from the real malware tests are worth mentioning here.</p>
<p><img src="https://www.elastic.co/es/security-labs/assets/images/identifying-beaconing-malware-using-elastic/6-beaconing-metadata.jpg" alt="Beaconing metadata for NOBELIUM" /></p>
<p>For NOBELIUM, the beaconing transform catches the offending process, rundll32.exe, as well as the two destination IPs, 192.99.221.77 and 83.171.237.173, which were among the main IoCs for NOBELIUM.</p>
<p>For Koadic and Emotet as well, the transform was able to flag the process as well as the known destination IPs on which the test C2 listeners were running. The characteristics of each of the beacons were different. For example, Koadic was a straightforward, high-frequency beacon that satisfied all the beaconing criteria being checked in the transform i.e. periodicity, as well as low variation of source and destination bytes. Emotet was slightly trickier since it was a low frequency beacon with a high jitter percentage. But we were able to detect it due to the low variation in the source bytes of the beacon.</p>
<p>To test the amount of reduction in search space, we ran the transform over three weeks on an internal cluster that was receiving network event logs from ~ 2k hosts during the testing period. We measured the reduction in search space based on the number of network event log messages, processes, and hosts an analyst or threat hunter would have to sift through before and after running the transform, in order to identify malicious beacons. The numbers are as follows:</p>
<p><img src="https://www.elastic.co/es/security-labs/assets/images/identifying-beaconing-malware-using-elastic/7-search-space-reduction.jpg" alt="Search space reduction metrics as a result of the beaconing transform" /></p>
<p>While the reduction in search space is obvious, another point to note is the scale of data that the transforms are able to churn through comfortably, which becomes an important aspect to consider, especially in production environments. Additionally, we have also released dashboards (available in the <a href="https://github.com/elastic/detection-rules/releases/tag/ML-Beaconing-20211216-1">release package</a>), which track metrics like prevalence of the beaconing processes, etc. that can help make informed decisions about further filtering of the search space.</p>
<p>While the released dashboards, and the statistics in the above table are based on cases where the beacon_stats.is_beaconing indicator is true i.e. beacons that satisfy either of the beaconing tests, threat hunters may want to further streamline their search by starting with the most obvious beaconing-like cases and then moving on to the less obvious ones. This can be done by filtering and searching by the beacon_stats.beaconing_score indicator instead of beacon_stats.is_beaconing, where a score of 3 indicates a typical beacon (satisfying tests for periodicity as well as low variation in packet bytes), and score of 1 indicates a less obvious beacon (satisfying only one of the three tests).</p>
<p>For reference, we observed the following on our internal cluster:</p>
<p><img src="https://www.elastic.co/es/security-labs/assets/images/identifying-beaconing-malware-using-elastic/Screen_Shot_2022-01-06_at_4.36.40_PM.jpg" alt="Streamlining your threat hunt using the Beaconing Score indicator" /></p>
<h2>What's next</h2>
<p>We’d love for you to try out our beaconing identification framework and give us feedback as we work on improving it. If you run into any issues during the process, please reach out to us on our <a href="https://ela.st/slack">community Slack channel</a>, <a href="https://discuss.elastic.co/c/security">discussion forums</a>, or even our <a href="https://github.com/elastic/detection-rules">open detections repository</a>. Stay tuned for Part 2 of this blog, where we’ll cover going from identifying beaconing activity to actually detecting on malicious beacons!</p>
<p>Try out our beaconing identification framework with a <a href="https://cloud.elastic.co/registration">free 14-day trial</a> of Elastic Cloud.</p>
]]></content:encoded>
            <category>security-labs</category>
            <enclosure url="https://www.elastic.co/es/security-labs/assets/images/identifying-beaconing-malware-using-elastic/blog-thumbnail-securitymaze.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Getting the Most Out of Transformers in Elastic]]></title>
            <link>https://www.elastic.co/es/security-labs/getting-the-most-out-of-transforms-in-elastic</link>
            <guid>getting-the-most-out-of-transforms-in-elastic</guid>
            <pubDate>Tue, 23 Aug 2022 00:00:00 GMT</pubDate>
            <description><![CDATA[In this blog, we will briefly talk about how we fine-tuned a transformer model meant for a masked language modeling (MLM) task, to make it suitable for a classification task.]]></description>
            <content:encoded><![CDATA[<h2>Preamble</h2>
<p>In 8.3, our Elastic Stack Machine Learning team introduced a way to import <a href="https://www.elastic.co/es/guide/en/machine-learning/master/ml-nlp-model-ref.html">third party Natural Language Processing (NLP) models</a> into Elastic. As security researchers, we HAD to try it out on a security dataset. So we decided to build a model to identify malicious command lines by fine-tuning a pre-existing model available on the <a href="https://huggingface.co/models">Hugging Face model hub</a>.</p>
<p>Upon finding that the fine-tuned model was performing (surprisingly!) well, we wanted to see if it could replace or be combined with our previous <a href="https://www.elastic.co/es/blog/problemchild-detecting-living-off-the-land-attacks">tree-based model</a> for detecting Living off the Land (LotL) attacks. But first, we had to make sure that the throughput and latency of this new model were reasonable enough for real-time inference. This resulted in a series of experiments, the results of which we will detail in this blog.</p>
<p>In this blog, we will briefly talk about how we fine-tuned a transformer model meant for a masked language modeling (MLM) task, to make it suitable for a classification task. We will also look at how to import custom models into Elastic. Finally, we’ll dive into all the experiments we did around using the fine-tuned model for real-time inference.</p>
<h2>NLP for command line classification</h2>
<p>Before you start building NLP models, it is important to understand whether an <a href="https://www.ibm.com/cloud/learn/natural-language-processing">NLP</a> model is even suitable for the task at hand. In our case, we wanted to classify command lines as being malicious or benign. Command lines are a set of commands provided by a user via the computer terminal. An example command line is as follows:</p>
<pre><code>**move test.txt C:\**
</code></pre>
<p>The above command moves the file <strong>test.txt</strong> to the root of the **C:** directory.</p>
<p>Arguments in command lines are related in the way that the co-occurrence of certain values can be indicative of malicious activity. NLP models are worth exploring here since these models are designed to understand and interpret relationships in natural (human) language, and since command lines often use some natural language.</p>
<h2>Fine-tuning a Hugging Face model</h2>
<p>Hugging Face is a data science platform that provides tools for machine learning (ML) enthusiasts to build, train, and deploy ML models using open source code and technologies. Its model hub has a wealth of models, trained for a variety of NLP tasks. You can either use these pre-trained models as-is to make predictions on your data, or fine-tune the models on datasets specific to your <a href="https://www.ibm.com/cloud/learn/natural-language-processing">NLP</a> tasks.</p>
<p>The first step in fine-tuning is to instantiate a model with the model configuration and pre-trained weights of a specific model. Random weights are assigned to any task-specific layers that might not be present in the base model. Once initialized, the model can be trained to learn the weights of the task-specific layers, thus fine-tuning it for your task. Hugging Face has a method called <a href="https://huggingface.co/docs/transformers/v4.21.1/en/main_classes/model#transformers.PreTrainedModel.from_pretrained">from_pretrained</a> that allows you to instantiate a model from a pre-trained model configuration.</p>
<p>For our command line classification model, we created a <a href="https://huggingface.co/docs/transformers/model_doc/roberta">RoBERTa</a> model instance with encoder weights copied from the <a href="https://huggingface.co/roberta-base">roberta-base</a> model, and a randomly initialized sequence classification head on top of the encoder:</p>
<p><strong>model = RobertaForSequenceClassification.from_pretrained('roberta-base', num_labels=2)</strong></p>
<p>Hugging Face comes equipped with a <a href="https://huggingface.co/docs/transformers/v4.21.0/en/main_classes/tokenizer">Tokenizers</a> library consisting of some of today's most used tokenizers. For our model, we used the <a href="https://huggingface.co/docs/transformers/model_doc/roberta#transformers.RobertaTokenizer">RobertaTokenizer</a> which uses <a href="https://en.wikipedia.org/wiki/Byte_pair_encoding">Byte Pair Encoding</a> (BPE) to create tokens. This tokenization scheme is well-suited for data belonging to a different domain (command lines) from that of the tokenization corpus (English text). A code snippet of how we tokenized our dataset using <strong>RobertaTokenizer</strong> can be found <a href="https://gist.github.com/ajosh0504/4560af91adb48212402300677cb65d4a#file-tokenize-py">here</a>. We then used Hugging Face's <a href="https://huggingface.co/docs/transformers/v4.21.0/en/main_classes/trainer#transformers.Trainer">Trainer</a> API to train the model, a code snippet of which can be found <a href="https://gist.github.com/ajosh0504/4560af91adb48212402300677cb65d4a#file-train-py">here</a>.</p>
<p>ML models do not understand raw text. Before using text data as inputs to a model, it needs to be converted into numbers. Tokenizers group large pieces of text into smaller semantically useful units, such as (but not limited to) words, characters, or subwords — called token —, which can, in turn, be converted into numbers using different encoding techniques.</p>
<blockquote>
<ul>
<li>Check out <a href="https://youtu.be/_BZearw7f0w">this</a> video (2:57 onwards) to review additional pre-processing steps that might be needed after tokenization based on your dataset.</li>
<li>A complete tutorial on how to fine-tune pre-trained Hugging Face models can be found <a href="https://huggingface.co/docs/transformers/training">here</a>.</li>
</ul>
</blockquote>
<h2>Importing custom models into Elastic</h2>
<p>Once you have a trained model that you are happy with, it's time to import it into Elastic. This is done using <a href="https://www.elastic.co/es/guide/en/elasticsearch/client/eland/current/machine-learning.html">Eland</a>, a Python client and toolkit for machine learning in Elasticsearch. A code snippet of how we imported our model into Elastic using Eland can be found <a href="https://gist.github.com/ajosh0504/4560af91adb48212402300677cb65d4a#file-import-py">here</a>.<br />
You can verify that the model has been imported successfully by navigating to <strong>Model Management \&gt; Trained Models</strong> via the Machine Learning UI in Kibana:</p>
<p><img src="https://www.elastic.co/es/security-labs/assets/images/getting-the-most-out-of-transforms-in-elastic/Imported_model_in_the_Trained_Models_UI.png" alt="Imported model in the Trained Models UI" /></p>
<h2>Using the Transformer model for inference — a series of experiments</h2>
<p>We ran a series of experiments to evaluate whether or not our Transformer model could be used for real-time inference. For the experiments, we used a dataset consisting of ~66k command lines.</p>
<p>Our first inference run with our fine-tuned <strong>RoBERTa</strong> model took ~4 hours on the test dataset. At the outset, this is much slower than the tree-based model that we were trying to beat at ~3 minutes for the entire dataset. It was clear that we needed to improve the throughput and latency of the PyTorch model to make it suitable for real-time inference, so we performed several experiments:</p>
<h3>Using multiple nodes and threads</h3>
<p>The latency numbers above were observed when the models were running on a single thread on a single node. If you have multiple Machine Learning (ML) nodes associated with your Elastic deployment, you can run inference on multiple nodes, and also on multiple threads on each node. This can significantly improve the throughput and latency of your models.</p>
<p>You can change these parameters while starting the trained model deployment via the <a href="https://www.elastic.co/es/guide/en/elasticsearch/reference/master/start-trained-model-deployment.html">API</a>:</p>
<pre><code>**POST \_ml/trained\_models/\\&lt;model\_id\\&gt;/deployment/\_start?number\_of\_allocations=2&amp;threa ds\_per\_allocation=4**
</code></pre>
<p><strong>number_of_allocations</strong> allows you to set the total number of allocations of a model across machine learning nodes and can be used to tune model throughput. <strong>threads_per_allocation</strong> allows you to set the number of threads used by each model allocation during inference and can be used to tune model latency. Refer to the <a href="https://www.elastic.co/es/guide/en/elasticsearch/reference/master/start-trained-model-deployment.html">API documentation</a> for best practices around setting these parameters.</p>
<p>In our case, we set the <strong>number_of_allocations</strong> to <strong>2</strong> , as our cluster had two ML nodes and <strong>threads_per_allocation</strong> to <strong>4</strong> , as each node had four allocated processors.</p>
<p>Running inference using these settings <strong>resulted in a 2.7x speedup</strong> on the original inference time.</p>
<h3>Dynamic quantization</h3>
<p>Quantizing is one of the most effective ways of improving model compute cost, while also reducing model size. The idea here is to use a reduced precision integer representation for the weights and/or activations. While there are a number of ways to trade off model accuracy for increased throughput during model development, <a href="https://pytorch.org/tutorials/intermediate/dynamic_quantization_bert_tutorial.html">dynamic quantization</a> helps achieve a similar trade-off after the fact, thus saving on time and resources spent on iterating over the model training.</p>
<p>Eland provides a way to dynamically quantize your model before importing it into Elastic. To do this, simply pass in quantize=True as an argument while creating the TransformerModel object (refer to the code snippet for importing models) as follows:</p>
<pre><code>**# Load the custom model**
**tm = TransformerModel(&quot;model&quot;, &quot;text\_classification&quot;, quantize=True)**
</code></pre>
<p>In the case of our command line classification model, we observed the model size drop from 499 MB to 242 MB upon dynamic quantization. Running inference on our test dataset using this model <strong>resulted in a 1.6x speedup</strong> on the original inference time, for a slight drop in model <a href="https://en.wikipedia.org/wiki/Sensitivity_and_specificity"><strong>sensitivity</strong></a> (exact numbers in the following section) <strong>.</strong></p>
<h3>Knowledge Distillation</h3>
<p><a href="https://towardsdatascience.com/knowledge-distillation-simplified-dd4973dbc764">Knowledge Distillation</a> is a way to achieve model compression by transferring knowledge from a large (teacher) model to a smaller (student) one while maintaining validity. At a high level, this is done by using the outputs from the teacher model at every layer, to backpropagate error through the student model. This way, the student model learns to replicate the behavior of the teacher model. Model compression is achieved by reducing the number of parameters, which is directly related to the latency of the model.</p>
<p>To study the effect of knowledge distillation on the performance of our model, we fine-tuned a <a href="https://huggingface.co/distilroberta-base">distilroberta-base</a> model (following the same procedure described in the fine-tuning section) for our command line classification task and imported it into Elastic. <strong>distilroberta-base</strong> has 82 million parameters, compared to its teacher model, <strong>roberta-base</strong> , which has 125 million parameters. The model size of the fine-tuned <strong>DistilRoBERTa</strong> model turned out to be <strong>329</strong> MB, down from <strong>499</strong> MB for the <strong>RoBERTa</strong> model.</p>
<p>Upon running inference with this model, we <strong>observed a 1.5x speedup</strong> on the original inference time and slightly better model sensitivity (exact numbers in the following section) than the fine-tuned roberta-base model.</p>
<h3>Dynamic quantization and knowledge distillation</h3>
<p>We observed that dynamic quantization and model distillation both resulted in significant speedups on the original inference time. So, our final experiment involved running inference with a quantized version of the fine-tuned <strong>DistilRoBERTa</strong> model.</p>
<p>We found that this <strong>resulted in a 2.6x speedup</strong> on the original inference time, and slightly better model sensitivity (exact numbers in the following section). We also observed the model size drop from <strong>329</strong> MB to <strong>199</strong> MB after quantization.</p>
<h2>Bringing it all together</h2>
<p>Based on our experiments, dynamic quantization and model distillation resulted in significant inference speedups. Combining these improvements with distributed and parallel computing, we were further able to <strong>reduce the total inference time on our test set from four hours to 35 minutes</strong>. However, even our fastest transformer model was still several magnitudes slower than the tree-based model, despite using significantly more CPU resources.</p>
<p>The Machine Learning team here at Elastic is introducing an inference caching mechanism in version 8.4 of the Elastic Stack, to save time spent on performing inference on repeat samples. These are a common occurrence in real-world environments, especially when it comes to Security. With this optimization in place, we are optimistic that we will be able to use transformer models alongside tree-based models in the future.</p>
<p>A comparison of the sensitivity (true positive rate) and specificity (true negative rate) of our tree-based and transformer models shows that an ensemble of the two could potentially result in a more performant model:</p>
<table>
<thead>
<tr>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>Model</td>
<td>Sensitivity (%)</td>
<td>False Negative Rate (%)</td>
<td>Specificity (%)</td>
<td>False Positive Rate (%)</td>
</tr>
<tr>
<td>Tree-based</td>
<td>99.53</td>
<td>0.47</td>
<td>99.99</td>
<td>0.01</td>
</tr>
<tr>
<td>RoBERTa</td>
<td>99.57</td>
<td>0.43</td>
<td>97.76</td>
<td>2.24</td>
</tr>
<tr>
<td>RoBERTa quantized</td>
<td>99.56</td>
<td>0.44</td>
<td>97.64</td>
<td>2.36</td>
</tr>
<tr>
<td>DistilRoBERTa</td>
<td>99.68</td>
<td>0.32</td>
<td>98.66</td>
<td>1.34</td>
</tr>
<tr>
<td>DistilRoBERTa quantized</td>
<td>99.69</td>
<td>0.31</td>
<td>98.71</td>
<td>1.29</td>
</tr>
</tbody>
</table>
<p>As seen above, the tree-based model is better suited for classifying benign data while the transformer model does better on malicious samples, so a weighted average or voting ensemble could work well to reduce the total error by averaging the predictions from both the models.</p>
<h2>What's next</h2>
<p>We plan to cover our findings from inference caching and model ensembling in a follow-up blog. Stay tuned!</p>
<p>In the meanwhile, we’d love to hear about models you're building for inference in Elastic. If you'd like to share what you're doing or run into any issues during the process, please reach out to us on our <a href="https://ela.st/slack">community Slack channel</a>and <a href="https://discuss.elastic.co/c/security">discussion forums</a>. Happy experimenting!</p>
]]></content:encoded>
            <category>security-labs</category>
            <enclosure url="https://www.elastic.co/es/security-labs/assets/images/getting-the-most-out-of-transforms-in-elastic/machine-learning-1200x628px-2021-notext.jpg" length="0" type="image/jpg"/>
        </item>
    </channel>
</rss>