Elastic Security Labs - Articles by Susan Chang

Elastic Advances LLM Security with Standardized Fields and Integrations

Mon, 06 May 2024 00:00:00 GMT

Introduction

Last week, security researcher Mika Ayenson authored a publication highlighting potential detection strategies and an LLM content auditing prototype solution via a proxy implemented during Elastic’s OnWeek event series. This post highlighted the importance of research pertaining to the safety of LLM technology implemented in different environments, and the research focus we’ve taken at Elastic Security Labs.

Given Elastic's unique vantage point leveraging LLM technology in our platform to power capabilities such as the Security AI Assistant, our desire for more formal detection rules, integrations, and research content has been growing. This publication highlights some of the recent advancements we’ve made in LLM integrations, our thoughts around detections aligned with industry standards, and ECS field mappings.

We are committed to a comprehensive security strategy that protects not just the direct user-based LLM interactions but also the broader ecosystem surrounding them. This approach involves layers of security detection engineering opportunities to address not only the LLM requests/responses but also the underlying systems and integrations used by the models.

These detection opportunities collectively help to secure the LLM ecosystem and can be broadly grouped into five categories:

Prompt and Response: Detection mechanisms designed to identify and mitigate threats based on the growing variety of LLM interactions to ensure that all communications are securely audited.
Infrastructure and Platform: Implementing detections to protect the infrastructure hosting LLMs (including wearable AI Pin devices), including detecting threats against the data stored, processing activities, and server communication.
API and Integrations: Detecting threats when interacting with LLM APIs and protecting integrations with other applications that ingest model output.
Operational Processes and Data: Monitoring operational processes (including in AI agents) and data flows while protecting data throughout its lifecycle.
Compliance and Ethical: Aligning detection strategies with well-adopted industry regulations and ethical standards.

Securing the LLM Ecosystem: five categories

Another important consideration for these categories expands into who can best address risks or who is responsible for each category of risk pertaining to LLM systems.

Similar to existing Shared Security Responsibility models, Elastic has assessed four broad categories, which will eventually be expanded upon further as we continue our research into detection engineering strategies and integrations. Broadly, this publication considers security protections that involve the following responsibility owners:

LLM Creators: Organizations who are building, designing, hosting, and training LLMs, such as OpenAI, Amazon Web Services, or Google
LLM Integrators: Organizations and individuals who integrate existing LLM technologies produced by LLM Creators into other applications
LLM Maintainers: Individuals who monitor operational LLMs for performance, reliability, security, and integrity use-cases and remain directly involved in the maintenance of the codebase, infrastructure, and software architecture
Security Users: People who are actively looking for vulnerabilities in systems through traditional testing mechanisms and means. This may expand beyond the traditional risks discussed in OWASP’s LLM Top 10 into risks associated with software and infrastructure surrounding these systems

This broader perspective showcases a unified approach to LLM detection engineering that begins with ingesting data using native Elastic integrations; in this example, we highlight the AWS Bedrock Model Invocation use case.

Integrating LLM logs into Elastic

Elastic integrations simplify data ingestion into Elastic from various sources, ultimately enhancing our security solution. These integrations are managed through Fleet in Kibana, allowing users to easily deploy and manage data within the Elastic Agent. Users can quickly adapt Elastic to new data sources by selecting and configuring integrations through Fleet. For more details, see Elastic’s blog on making it easier to integrate your systems with Elastic.

The initial ONWeek work undertaken by the team involved a simple proxy solution that extracted fields from interactions with the Elastic Security AI Assistant. This prototype was deployed alongside the Elastic Stack and consumed data from a vendor solution that lacked security auditing capabilities. While this initial implementation proved conceptually interesting, it prompted the team to invest time in assessing existing Elastic integrations from one of our cloud provider partners, Amazon Web Services. This methodology guarantees streamlined accessibility for our users, offering seamless, one-click integrations for data ingestion. All ingest pipelines conform to ECS/OTel normalization standards, encompassing comprehensive content, including dashboards, within a unified package. Furthermore, this strategy positions us to leverage additional existing integrations, such as Azure and GCP, for future LLM-focused integrations.

Vendor selection and API capabilities

When selecting which LLM providers to create integrations for, we looked at the types of fields we need to ingest for our security use cases. For the starting set of rules detailed here, we needed information such as timestamps and token counts; we found that vendors such as Azure OpenAI provided content moderation filtering on the prompts and generated content. LangSmith (part of the LangChain tooling) was also a top contender, as the data contains the type of vendor used (e.g., OpenAI, Bedrock, etc.) and all the respective metadata. However, this required that the user also have LangSmith set up. For this implementation, we decided to go with first-party supported logs from a vendor that provides LLMs.

As we went deeper into potential integrations, we decided to land with AWS Bedrock, for a few specific reasons. Firstly, Bedrock logging has first-party support to Amazon CloudWatch Logs and Amazon S3. Secondly, the logging is built specifically for model invocation, including data specific to LLMs (as opposed to other operations and machine learning models), including prompts and responses, and guardrail/content filtering. Thirdly, Elastic already has a robust catalog of integrations with AWS, so we were able to quickly create a new integration for AWS Bedrock model invocation logs specifically. The next section will dive into this new integration, which you can use to capture your Bedrock model invocation logs in the Elastic stack.

Elastic AWS Bedrock model integration

Overview

The new Elastic AWS Bedrock integration for model invocation logs provides a way to collect and analyze data from AWS services quickly, specifically focusing on the model. This integration provides two primary methods for log collection: Amazon S3 buckets and Amazon CloudWatch. Each method is optimized to offer robust data retrieval capabilities while considering cost-effectiveness and performance efficiency. We use these LLM-specific fields collected for detection engineering purposes.

Note: While this integration does not cover every proposed field, it does standardize existing AWS Bedrock fields into the gen_ai category. This approach makes it easier to maintain detection rules across various data sources, minimizing the need for separate rules for each LLM vendor.

Configuring integration data collection method

Collecting logs from S3 buckets

This integration allows for efficient log collection from S3 buckets using two distinct methods:

SQS Notification: This is the preferred method for collecting. It involves reading S3 notification events from an AWS Simple Queue Service (SQS) queue. This method is less costly and provides better performance compared to direct polling.
Direct S3 Bucket Polling: This method directly polls a list of S3 objects within an S3 bucket and is recommended only when SQS notifications cannot be configured. This approach is more resource-intensive, but it provides an alternative when SQS is not feasible.

Collecting logs from CloudWatch

Logs can also be collected directly from CloudWatch, where the integration taps into all log streams within a specified log group using the filterLogEvents AWS API. This method is an alternative to using S3 buckets altogether.

Integration installation

The integration can be set up within the Elastic Agent by following normal Elastic installation steps.

Navigate to the AWS Bedrock integration
Configure the queue_url for SQS or bucket_arn for direct S3 polling.

Configuring Bedrock Guardrails

AWS Bedrock Guardrails enable organizations to enforce security by setting policies that limit harmful or undesirable content in LLM interactions. These guardrails can be customized to include denied topics to block specific subjects and content filters to moderate the severity of content in prompts and responses. Additionally, word and sensitive information filters block profanity and mask personally identifiable information (PII), ensuring interactions comply with privacy and ethical standards. This feature helps control the content generated and consumed by LLMs and, ideally, reduces the risk associated with malicious prompts.

Note: other guardrail examples include Azure OpenAI’s content and response filters, which we aim to capture in our proposed LLM standardized fields for vendor-agnostic logging.

When LLM interaction content triggers these filters, the response objects are populated with amazon-bedrock-trace and amazon-bedrock-guardrailAction fields, providing details about the Guardrails outcome, and nested fields indicating whether the input matched the content filter. This response object enrichment with detailed filter outcomes improves the overall data quality, which becomes particularly effective when these nested fields are aligned with ECS mappings.

The importance of ECS mappings

Field mapping is a critical part of the process for integration development, primarily to improve our ability to write broadly scoped and widely compatible detection rules. By standardizing how data is ingested and analyzed, organizations can more effectively detect, investigate, and respond to potential threats or anomalies in logs ingested into Elastic, and in this specific case, LLM logs.

Our initial mapping begins by investigating fields provided by the vendor and existing gaps, leading to the establishment of a comprehensive schema tailored to the nuances of LLM operations. We then reconciled the fields to align with our OpenTelemetry semantic conventions. These mappings shown in the table cover various aspects:

General LLM Interaction Fields: These include basic but critical information such as the content of requests and responses, token counts, timestamps, and user identifiers, which are foundational for understanding the context and scope of interactions.
Text Quality and Relevance Metric Fields: Fields measuring text readability, complexity, and similarity scores help assess the quality and relevance of model outputs, ensuring that responses are not only accurate but also user-appropriate.
Security Metric Fields: This class of metrics is important for identifying and quantifying potential security risks, including regex pattern matches and scores related to jailbreak attempts, prompt injections, and other security concerns such as hallucination consistency and refusal responses.
Policy Enforcement Fields: These fields capture details about specific policy enforcement actions taken during interactions, such as blocking or modifying content, and provide insights into the confidence levels of these actions, enhancing security and compliance measures.
Threat Analysis Fields: Focused on identifying and quantifying potential threats, these fields provide a detailed analysis of risk scores, types of detected threats, and the measures taken to mitigate these threats.
Compliance Fields: These fields help ensure that interactions comply with various regulatory standards, detailing any compliance violations detected and the specific rules that were triggered during the interaction.
OWASP Top Ten Specific Fields: These fields map directly to the OWASP Top 10 risks for LLM applications, helping to align security measures with recognized industry standards.
Sentiment and Toxicity Analysis Fields: These analyses are essential to gauge the tone and detect any harmful content in the response, ensuring that outputs align with ethical guidelines and standards. This includes sentiment scores, toxicity levels, and identification of inappropriate or sensitive content.
Performance Metric Fields: These fields measure the performance aspects of LLM interactions, including response times and sizes of requests and responses, which are critical for optimizing system performance and ensuring efficient operations.

Note: See the gist for an extended table of fields proposed.

These fields are mapped by our LLM integrations and ultimately used within our detections. As we continue to understand the threat landscape, we will continue to refine these fields to ensure additional fields populated by other LLM vendors are standardized and conceptually reflected within the mapping.

Broader Implications and Benefits of Standardization

Standardizing security fields within the LLM ecosystem (e.g., user interaction and application integration) facilitates a unified approach to the security domain. Elastic endeavors to lead the charge by defining and promoting a set of standard fields. This effort not only enhances the security posture of individual organizations but also fosters a safer industry.

Integration with Security Tools: By standardizing responses from LLM-related security tools, it enriches security analysis fields that can be shipped with the original LLM vendor content to a security solution. If operationally chained together in the LLM application’s ecosystem, security tools can audit each invocation request and response. Security teams can then leverage these fields to build complex detection mechanisms that can identify subtle signs of misuse or vulnerabilities within LLM interactions.

Consistency Across Vendors: Insisting that all LLM vendors adopt these standard fields drives a singular goal to effectively protect applications, but in a way that establishes a baseline that all industry users can adhere to. Users are encouraged to align to a common schema regardless of the platform or tool.

Enhanced Detection Engineering: With these standard fields, detection engineering becomes more robust and the change of false positives is decreased. Security engineers can create effective rules that identify potential threats across different models, interactions, and ecosystems. This consistency is especially important for organizations that rely on multiple LLMs or security tools and need to maintain a unified platform.

Sample LLM-specific fields: AWS Bedrock use case

Based on the integration’s ingestion pipeline, field mappings, and processors, the AWS Bedrock data is cleaned up, standardized, and mapped to Elastic Common Schema (ECS) fields. The core Bedrock fields are then introduced under the aws.bedrock group which includes details about the model invocation like requests, responses, and token counts. The integration populates additional fields tailored for the LLM to provide deeper insights into the model’s interactions which are later used in our detections.

LLM detection engineering examples

With the standardized fields and the Elastic AWS Bedrock integration, we can begin crafting detection engineering rules that showcase the proposed capability with varying complexity. The below examples are written using ES|QL.

Note: Check out the detection-rules hunting directory and aws_bedrock rules for more details about these queries.

Basic detection of sensitive content refusal

With current policies and standards on sensitive topics within the organization, it is important to have mechanisms in place to ensure LLMs also adhere to compliance and ethical standards. Organizations have an opportunity to monitor and capture instances where an LLM directly refuses to respond to sensitive topics.

Sample Detection:

from logs-aws_bedrock.invocation-*
 | WHERE @timestamp > NOW() - 1 DAY
   AND (
     gen_ai.completion LIKE "*I cannot provide any information about*"
     AND gen_ai.response.finish_reasons LIKE "*end_turn*"
   )
 | STATS user_request_count = count() BY gen_ai.user.id
 | WHERE user_request_count >= 3

Detection Description: This query is used to detect instances where the model explicitly refuses to provide information on potentially sensitive or restricted topics multiple times. Combined with predefined formatted outputs, the use of specific phrases like "I cannot provide any information about" within the output content indicates that the model has been triggered by a user prompt to discuss something it's programmed to treat as confidential or inappropriate.

Security Relevance: Monitoring LLM refusals helps to identify attempts to probe the model for sensitive data or to exploit it in a manner that could lead to the leakage of proprietary or restricted information. By analyzing the patterns and frequency of these refusals, security teams can investigate if there are targeted attempts to breach information security policies.

Potential denial of service or resource exhaustion attacks

Due to the engineering design of LLMs being highly computational and data-intensive, they are susceptible to resource exhaustion and denial of service (DoS) attacks. High usage patterns may indicate abuse or malicious activities designed to degrade the LLM’s availability. Due to the ambiguity of correlating prompt request size directly with token count, it is essential to consider the implications of high token counts in prompts which may not always result from larger requests bodies. Token count and character counts depend on the specific model, where each can be different and is related to how embeddings are generated.

Sample Detection:

from logs-aws_bedrock.invocation-*
 | WHERE @timestamp > NOW() - 1 DAY
   AND (
     gen_ai.usage.prompt_tokens > 8000 OR
     gen_ai.usage.completion_tokens > 8000 OR
     gen_ai.performance.request_size > 8000
   )
 | STATS max_prompt_tokens = max(gen_ai.usage.prompt_tokens),
         max_request_tokens = max(gen_ai.performance.request_size),
         max_completion_tokens = max(gen_ai.usage.completion_tokens),
         request_count = count() BY cloud.account.id
 | WHERE request_count > 1
 | SORT max_prompt_tokens, max_request_tokens, max_completion_tokens DESC

Detection Description: This query identifies high-volume token usage which could be indicative of abuse or an attempted denial of service (DoS) attack. Monitoring for unusually high token counts (input or output) helps detect patterns that could slow down or overwhelm the system, potentially leading to service disruptions. Given each application may leverage a different token volume, we’ve chosen a simple threshold based on our existing experience that should cover basic use cases.

Security Relevance: This form of monitoring helps detect potential concerns with system availability and performance. It helps in the early detection of DoS attacks or abusive behavior that could degrade service quality for legitimate users. By aggregating and analyzing token usage by account, security teams can pinpoint sources of potentially malicious traffic and take appropriate measures.

Monitoring for latency anomalies

Latency-based metrics can be a key indicator of underlying performance issues or security threats that overload the system. By monitoring processing delays, organizations can ensure that servers are operating as efficiently as expected.

Sample Detection:

from logs-aws_bedrock.invocation-*
 | WHERE @timestamp > NOW() - 1 DAY
 | EVAL response_delay_seconds = gen_ai.performance.start_response_time / 1000
 | WHERE response_delay_seconds > 5
 | STATS max_response_delay = max(response_delay_seconds),
         request_count = count() BY gen_ai.user.id
 | WHERE request_count > 3
 | SORT max_response_delay DESC

Detection Description: This updated query monitors the time it takes for an LLM to start sending a response after receiving a request, focusing on the initial response latency. It identifies significant delays by comparing the actual start of the response to typical response times, highlighting instances where these delays may be abnormally long.

Security Relevance: Anomalous latencies can be symptomatic of issues such as network attacks (e.g., DDoS) or system inefficiencies that need to be addressed. By tracking and analyzing latency metrics, organizations can ensure that their systems are running efficiently and securely, and can quickly respond to potential threats that might manifest as abnormal delays.

Advanced LLM detection engineering use cases

This section explores potential use cases that could be addressed with an Elastic Security integration. It assumes that these fields are fully populated and that necessary security auditing enrichment features (e.g., Guardrails) have been implemented, either within AWS Bedrock or via a similar approach provided by the LLM vendor. In combination with the available data source and Elastic integration, detection rules can be built on top of these Guardrail requests and responses to detect misuse of LLMs in deployment.

Malicious model uploads and cross-tenant escalation

A recent investigation into the Hugging Face Interface API revealed a significant risk where attackers could upload a maliciously crafted model to perform arbitrary code execution. This was achieved by using a Python Pickle file that, when deserialized, executed embedded malicious code. These vulnerabilities highlight the need for rigorous security measures to inspect and sanitize all inputs in AI-as-a-Service (AIAAS) platforms from the LLM, to the infrastructure that hosts the model, and the application API integration. Refer to this article for more details.

Potential Detection Opportunity: Use fields like gen_ai.request.model.id, gen_ai.request.model.version, and prompt gen_ai.completion to detect interactions with anomalous models. Monitoring unusual values or patterns in the model identifiers and version numbers along with inspecting the requested content (e.g., looking for typical Python Pickle serialization techniques) may indicate suspicious behavior. Similarly, a check prior to uploading the model using similar fields may block the upload. Cross-referencing additional fields like gen_ai.user.id can help identify malicious cross-tenant operations performing these types of activities.

Unauthorized URLs and external communication

As LLMs become more integrated into operational ecosystems, their ability to interact with external capabilities like email or webhooks can be exploited by attackers. To protect against these interactions, it’s important to implement detection rules that can identify suspicious or unauthorized activities based on the model’s outputs and subsequent integrations.

Potential Detection Opportunity: Use fields like gen_ai.completion, and gen_ai.security.regex_pattern_count to triage malicious external URLs and webhooks. These regex patterns need to be predefined based on well-known suspicious patterns.

Hierarchical instruction prioritization

LLMs are increasingly used in environments where they receive instructions from various sources (e.g., ChatGPT Custom Instructions), which may not always have benign intentions. This build-your-own model workflow can lead to a range of potential security vulnerabilities, if the model treats all instructions with equal importance, and they go unchecked. Reference here.

Potential Detection Opportunity: Monitor fields like gen_ai.model.instructions and gen_ai.completion to identify discrepancies between given instructions and the models responses which may indicate cases where models treat all instructions with equal importance. Additionally, analyze the gen_ai.similarity_score, to discern how similar the response is from the original request.

Extended detections featuring additional Elastic rule types

This section introduces additional detection engineering techniques using some of Elastic’s rule types, Threshold, Indicator Match, and New Terms to provide a more nuanced and robust security posture.

Threshold Rules: Identify high frequency of denied requests over a short period of time grouped by gen_ai.user.id that could be indicative of abuse attempts. (e.g. OWASP’s LLM04)
Indicator Match Rules: Match known malicious threat intel provided indicators such as the LLM user ID like the gen_ai.user.id which contain these user attributes. (e.g. arn:aws:iam::12345678912:user/thethreatactor)
New Terms Rules: Detect new or unusual terms in user prompts that could indicate usual activity outside of the normal usage for the user’s role, potentially indicating new malicious behaviors.

Summary

Elastic is pioneering the standardization of LLM-based fields across the generative AI landscape to enable security detections across the ecosystem. This initiative not only aligns with our ongoing enhancements in LLM integration and security strategies but also supports our broad security framework that safeguards both direct user interactions and the underlying system architectures. By promoting a uniform language among LLM vendors for enhanced detection and response capabilities, we aim to protect the entire ecosystem, making it more secure and dependable. Elastic invites all stakeholders within the industry, creators, maintainers, integrators and users, to adopt these standardized practices, thereby strengthening collective security measures and advancing industry-wide protections.

As we continue to add and enhance our integrations, starting with AWS Bedrock, we are strategizing to align other LLM-based integrations to the new standards we’ve set, paving the way for a unified experience across the Elastic ecosystem. The seamless overlap with existing Elasticsearch capabilities empowers users to leverage sophisticated search and analytics directly on the LLM data, driving existing workflows back to tools users are most comfortable with.

Check out the LLM Safety Assessment, which delves deeper into these topics.

The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.

Using LLMs and ESRE to find similar user sessions

Tue, 19 Sep 2023 00:00:00 GMT

Using LLMs and ESRE to find similar user sessions

In our previous article, we explored using the GPT-4 Large Language Model (LLM) to condense complex Linux user sessions into concise summaries. We highlighted the key takeaways from our experiments, shedding light on the nuances of data preprocessing, prompt tuning, and model parameter adjustments. In the context of the same experiment, we dedicated some time to examine sessions that shared similarities. These similar sessions can subsequently aid the analysts in identifying related suspicious activities. We explored the following methods to find similarities in user sessions:

In an endeavor to uncover similar user profiles and sessions, one approach we undertook was to categorize sessions according to the actions executed by users; we accomplished this by instructing the Language Model Model (LLM) to categorize user sessions into predefined categories
Additionally, we harnessed the capabilities of ELSER (Elastic’s retrieval model for semantic search) to execute a semantic search on the model summaries derived from the session summarization experiment

This research focuses on our experiments using GPT-4 for session categorization and ESRE for semantic search.

Leveraging GPT for Session Categorization

We consulted a security research colleague with domain expertise to define nine categories for our dataset of 75 sessions. These categories generalize the main behaviors and significant features observed in the sessions. They include the following activities:

Docker Execution
Network Operations
File Searches
Linux Command Line Usage
Linux Sandbox Application Usage
Pip Installations
Package Installations
Script Executions
Process Executions

Lessons learned

For our experiments, we used a GPT-4 deployment in Azure AI Studio with a token limit of 32k. To explore the potential of the GPT model for session categorization, we conducted a series of experiments, directing the model to categorize sessions by inputting the same JSON summary document we used for the session summarization process.

This effort included multiple iterations, during which we concentrated on enhancing prompts and Few-Shot Learning. As for the model parameters, we maintained a Temperature of 0 in an effort to make the outputs less diverse.

Prompt engineering

Takeaway: Including explanations for categories in the prompts does not impact the model's performance.

The session categorization component was introduced as an extension to the session summarization prompt. We explored the effect of incorporating contextual explanations for each category alongside the prompts. Intriguingly, our findings revealed that appending illustrative context did not significantly influence the model's performance, as compared to prompts devoid of such supplementary information.

Below is a template we used to guide the model's categorization process:

You are a cybersecurity assistant, who helps Security analysts in summarizing activities that transpired in a Linux session. A summary of events that occurred in the session will be provided in JSON format. No need to explicitly list out process names and file paths. Summarize the session in ~3 paragraphs, focusing on the following: 
- Entities involved in the session: host name and user names.
- Overview of any network activity. What major source and destination ips are involved? Any malicious port activity?
- Overview of any file activity. Were any sensitive files or directories accessed?
- Highlight any other important process activity
- Looking at the process, network, and file activity, what is the user trying to do in the session? Does the activity indicate malicious behavior?

Also, categorize the below Linux session in one of the following 9 categories: Network, Script Execution, Linux Command Line Utility, File search, Docker Execution, Package Installations, Pip Installations, Process Execution and Linux Sandbox Application.

A brief description for each Linux session category is provided below. Refer to these explanations while categorizing the sessions.
- Docker Execution: The session involves command with docker operations, such as docker-run and others
- Network: The session involves commands with network operations
- File Search: The session involves file operations, pertaining to search
- Linux Command Line Utility: The session involves linux command executions
- Linux Sandbox Application: The session involves a sandbox application activity. 
- Pip Installations: The session involves python pip installations
- Package Installations: The session involves package installations or removal activities. This is more of apt-get, yum, dpkg and general command line installers as opposed to any software wrapper
- Script Execution: The session involves bash script invocations. All of these have pointed custom infrastructure script invocations
- Process Execution: The session focuses on other process executions and is not limited to linux commands. 
 ###
 Text: {your input here}

Few-shot tuning

Takeaway: Adding examples for each category improves accuracy.

Simultaneously, we investigated the effectiveness of improving the model's performance by including one example for each category in the above prompt. This strategy resulted in a significant enhancement, notably boosting the model's accuracy by 20%.

Evaluating GPT Categories

The assessment of GPT categories is crucial in measuring the quality and reliability of the outcomes. In the evaluation of categorization results, a comparison was drawn between the model's categorization and the human categorization assigned by the security expert (referred to as "Ground_Truth" in the below image). We calculated the total accuracy based on the number of successful matches for categorization evaluation.

We observed that GPT-4 faced challenges when dealing with samples bearing multiple categories. However, when assigning a single category, it aligned with the human categorization in 56% of cases. The "Linux Command Line Utility" category posed a particular challenge, with 47% of the false negatives, often misclassified as "Process Execution" or "Script Execution." This discrepancy arose due to the closely related definitions of the "Linux Command Line Utility" and "Process Execution" categories and there may have also been insufficient information in the prompts, such as process command line arguments, which could have served as a valuable distinguishing factor for these categories.

Given the results from our evaluation, we conclude that we either need to tune the descriptions for each category in the prompt or provide more examples to the model via few-shot training. Additionally, it's worth considering whether GPT is the most suitable choice for classification, particularly within the context of the prompting paradigm.

Semantic search with ELSER

We also wanted to try ELSER, the Elastic Learned Sparse EncodeR for semantic search. Semantic search focuses on contextual meaning, rather than strictly exact keyword inputs, and ELSER is a retrieval model trained by Elastic that enables you to perform semantic search and retrieve more relevant results.

We tried some examples of semantic search questions on the session summaries. The session summaries were stored in an Elasticsearch index, and it was simple to download the ELSER model following an official tutorial. The tokens generated by ELSER are stored in the index, as shown in the image below:

Afterward, semantic search on the index was overall able to retrieve the most relevant events. Semantic search queries about the events included:

Password related – yielding 1Password related logs
Java – yielding logs that used Java
Python – yielding logs that used Python
Non-interactive session
Interactive session

An example of semantic search can be seen in the Dev Tools console through a text_expansion query.

Some takeaways are:

For semantic search, the prompt template can cause the summary to have too many unrelated keywords. For example, we wanted every summary to include an assessment of whether or not the session should be considered "malicious", that specific word was always included in the resulting summary. Hence, the summaries of benign sessions and malicious sessions alike contained the word "malicious" through sentences like "This session is malicious" or "This session is not malicious". This could have impacted the accuracy.
Semantic search seemed unable to differentiate effectively between certain related concepts, such as interactive vs. non-interactive. A small number of specific terms might not have been deemed important enough to the core meaning of the session summary for semantic search.
Semantic search works better than BM25 for cases where the user doesn’t specify the exact keywords. For example, searching for "Python" or "Java" related logs and summaries is equally effective with both ELSER and BM25. However, ELSER could retrieve more relevant data when searching for “object oriented language” related logs. In contrast, using a keyword search for “object oriented language” doesn’t yield relevant results, as shown in the image below.

What's next

We are currently looking into further improving summarization via retrieval augmented generation (RAG), using tools in the Elastic Search and Relevance Engine (ESRE). In the meantime, we’d love to hear about your experiments with LLMs, ESRE, etc. If you'd like to share what you're doing or run into any issues during the process, please reach out to us on our community Slack channel and discussion forums.

Using LLMs to summarize user sessions

Mon, 11 Sep 2023 00:00:00 GMT

Using LLMs to summarize user sessions

With the introduction of the AI Assistant into the Security Solution in 8.8, the Security Machine Learning team at Elastic has been exploring how to optimize Security operations with LLMs like GPT-4. User session summarization seemed like the perfect use case to start experimenting with for several reasons:

User session summaries can help analysts quickly decide whether a particular session's activity is worth investigating or not
Given the diversity of data that LLMs like GPT-4 are trained on, it is not hard to imagine that they have already been trained on man pages, and other open Security content, which can provide useful context for session investigation
Session summaries could potentially serve as a good supplement to the Session View tool, which is available in the Elastic Security Solution as of 8.2.

In this publication, we will talk about lessons learned and key takeaways from our experiments using GPT-4 to summarize user sessions.

In our follow-on research, we dedicated some time to examine sessions that shared similarities. These similar sessions can subsequently aid the analysts in identifying related suspicious activities.

What is a session?

In Linux, and other Unix-like systems, a "user session" refers to the period during which a user is logged into the system. A session begins when a user logs into the system, either via graphical login managers (GDM, LightDM) or via command-line interfaces (terminal, SSH).

Upon starting a Linux Kernel, a special process called the "init' process is created, which is responsible for starting configured services such as databases, web servers, and remote access services such as sshd. These services, and any shells or processes spawned by them, are typically encapsulated within their own sessions and tied together by a single session ID (SID).

The detailed and chronological process information captured by sessions makes them an extremely useful asset for alerting, compliance, and threat hunting.

Lessons learned

For our experiments, we used a GPT-4 deployment with a 32k token limit available via Azure AI Studio. Tokens are basic units of text or code that LLMs use to process and generate language. Our goal here was to see how far we can get with user session summarization within the prompting paradigm alone. We learned some things along the way as it related to data processing, prompt engineering, hallucinations, parameter tuning, and evaluating the GPT summaries.

Data processing

Takeaway: An aggregated JSON snapshot of the session is an effective input format for summarization.

A session here is simply a collection of process, network, file, and alert events. The number of events in a user session can range from a handful (< 10) to hundreds of thousands. Each event log itself can be quite verbose, containing several hundred fields. For longer sessions with a large number of events, one can quickly run into token limits for models like GPT-4. Hence, passing raw logs as input to GPT-4 is not as useful for our specific use case. We saw this during experimentation, even when using tabular formats such as CSV, and using a small subset of fields in the logs.

To get around this issue, we had to come up with an input format that retains as much of the session's context as possible, while also keeping the number of input tokens more or less constant irrespective of the length of the session. We experimented with several log de-duplication and aggregation strategies and found that an aggregated JSON snapshot of the session works well for summarization. An example document is as follows:

This JSON snapshot highlights the most prominent activities in the session using de-duplicated lists, aggregate counts, and top-N (20 in our case) most frequent terms, with self-explanatory field names.

Prompt engineering

Takeaway: Few-shot tuning with high-level instructions worked best.

Apart from data processing, most of our time during experimentation was spent on prompt tuning. We started with a basic prompt and found that the model had a hard time connecting the dots to produce a useful summary:

You are an AI assistant that helps people find information.

We then tried providing very detailed instructions in the prompt but noticed that the model ignored some of the instructions:

You are a cybersecurity assistant, who helps Security analysts in summarizing activities that transpired in a Linux session. A summary of events that occurred in the session will be provided in JSON format. No need to explicitly list out process names and file paths. Summarize the session in ~3 paragraphs, focusing on the following: 
- Entities involved in the session: host name and user names.
- Overview of any network activity. What major source and destination ips are involved? Any malicious port activity?
- Overview of any file activity. Were any sensitive files or directories accessed?
- Highlight any other important process activity
- Looking at the process, network, and file activity, what is the user trying to do in the session? Does the activity indicate malicious behavior?

Based on the above prompt, the model did not reliably adhere to the 3 paragraph request and also listed out process names and file paths which it was explicitly told not to do.

Finally, we landed on the following prompt that provided high-level instructions for the model:

Analyze the following Linux user session, focusing on:      
- Identifying the host and user names      
- Observing activities and identifying key patterns or trends      
- Noting any indications of malicious or suspicious behavior such as tunneling or encrypted traffic, login failures, access to sensitive files, large number of file creations and deletions, disabling or modifying Security software, use of Shadow IT, unusual parent-child process executions, long-running processes
- Conclude with a comprehensive summary of what the user might be trying to do in the session, based on the process, network, and file activity     
 ###
 Text: {your input here}

We also noticed that the model follows instructions more closely when they're provided in user prompts rather than in the system prompts (a system prompt is the initial instruction to the model telling it how it should behave and the user prompts are the questions/queries asked by a user to the model). After the above prompt, we were happy with the content of the summaries, but the output format was inconsistent, with the model switching between paragraphs and bulleted lists. We were able to resolve this with few-shot tuning, by providing the model with two examples of user prompts vs. expected responses.

Hallucinations

Takeaway: The model occasionally hallucinates while generating net new content for the summaries.

We observed that the model does not typically hallucinate while summarizing facts that are immediately apparent in the input such as user and host entities, network ports, etc. Occasionally, the model hallucinates while summarizing information that is not obvious, for example, in this case summarizing the overall user intent in the session. Some relatively easy avenues we found to mitigate hallucinations were as follows:

Prompt the model to focus on specific behaviors while summarizing
Re-iterate that the model should fact-check its output
Set the temperature to a low value (less than or equal to 0.2) to get the model to generate less diverse responses, hence reducing the chances of hallucinations
Limit the response length, thus reducing the opportunity for the model to go off-track — This works especially well if the length of the texts to be summarized is more or less constant, which it was in our case

Parameter tuning

Takeaway: Temperature = 0 does not guarantee determinism.

For summarization, we explored tuning parameters such as Temperature and Top P, to get deterministic responses from the model. Our observations were as follows:

Tuning both together is not recommended, and it's also difficult to observe the effect of each when combined
Solely setting the temperature to a low value (< 0.2) without altering Top P is usually sufficient
Even setting the temperature to 0 does not result in fully deterministic outputs given the inherent non-deterministic nature of floating point calculations (see this post from OpenAI for a more detailed explanation)

Evaluating GPT Summaries

As with any modeling task, evaluating the GPT summaries was crucial in gauging the quality and reliability of the model outcomes. In the absence of standardized evaluation approaches and metrics for text generation, we decided to do a qualitative human evaluation of the summaries, as well as a quantitative evaluation using automatic metrics such as ROUGE-L, BLEU, METEOR, BERTScore, and BLANC.

For qualitative evaluation, we had a Security Researcher write summaries for a carefully chosen (to get a good distribution of short and long sessions) set of 10 sessions, without any knowledge of the GPT summaries. Three evaluators were asked to compare the GPT summaries against the human-generated summaries using three key criteria:

Factuality: Examine if the model summary retains key facts of the session as provided by Security experts
Authenticity: Check for hallucinations
Consistency: Check the consistency of the model output i.e. all the responses share a stable format and produce the same level of detail

Finally, each of the 10 summaries was assigned a final rating of "Good" or "Bad" based on a majority vote to combine the evaluators' choices.

While we recognize the small dataset size for evaluation, our qualitative assessment showed that GPT summaries aligned with human summaries 80% of the time. For the GPT summaries that received a "Bad" rating, the summaries didn't retain certain important facts because the aggregated JSON document only kept the top-N terms for certain fields.

The automated metrics didn't seem to match human preferences, nor did they reliably measure summary quality due to the structural differences between human and LLM-generated summaries, especially for reference-based metrics.

What's next

We are currently looking into further improving summarization via retrieval augmented generation (RAG), using tools in the Elastic Search and Relevance Engine (ESRE). We also experimented with using LLMs to categorize user sessions. Stay tuned for Part 2 of this blog to learn more about those experiments!

In the meantime, we’d love to hear about your experiments with LLMs, ESRE, etc. If you'd like to share what you're doing or run into any issues during the process, please reach out to us on our community Slack channel and discussion forums. Happy experimenting!