Elastic Security Labs - Articles by Eric Forte

The Engineer's Guide to Elastic Detections as Code

Wed, 04 Feb 2026 00:00:00 GMT

In an ever-evolving threat landscape, security operations are reaching a tipping point. As the velocity and complexity of threats increase, teams expand and managed environments multiply. Commonly, manual approaches to rule management become a bottleneck. This is where Detections as Code (DaC) steps in, not just as a tool, but as a methodology.

DaC as a methodology applies software development practices to the creation, management, and deployment of security detection rules. By treating detection rules as code, it enables version control, automated testing, and deployment processes, enhancing collaboration, consistency, and agility in response to threats. DaC streamlines the detection rule lifecycle, ensuring high-quality detections through peer reviews and automated tests. This methodology also supports compliance with change management requirements and fosters a mature security posture.

That's why we’re excited to share the latest updates to Elastic's detection-rules, our open repository for writing, testing, and managing security detection rules in Elastic, that also allows you to create your own Detections as Code (DaC) framework. Continue reading for highlighted implementation examples using extended functionality, and the announcement of Elastic's free Detections as Code Workshop.

Elastic Security DaC: The journey from alpha to general availability

With the functionality now provided in detection-rules repository, users can manage all their detection rules as code, review rule tunings, automatically test and validate rules, and automate rules deployment across their environments.

Pre-2024: Elastic’s internal use of DaC

Elastic threat research and detection engineering team created and used the detection-rules repository to develop, test, manage and release prebuilt rules, following DaC principles - reviewing rules as a team, automating their tests and release. The repository also has an interactive CLI to create rules, so engineers could start working on the rules right there.

As the security community's interest in as-code principles grew, and the available Elastic Security APIs already allowed users to implement their custom Detections as code solutions, Elastic decided to extend the detection-rules repository functionality to enable our users to benefit from our tooling and aid them in creating their DaC processes.

Here are the key milestones of Elastic’s user-focused DaC development from alpha to general availability.

May 2024: Alpha release of new "roll your own” features

Our detection-rules repository is adjusted for customer use, allowing for managing custom rules, adapting the test suite for user needs, and allowing for management of actions and exceptions alongside the rules.

Key additions:

Custom rules directory support
Select which test to run based on your requirements
Exceptions and Actions support

We also published an extensive guidance for Detections as Code with examples of implementation with Elastic Security using detection-rules repository.

August 2024: "Roll your own” features now beta

The functionality is extended to allow import and export of custom rules between Elastic Security and repository, more configuration options and versioning functionality extended to custom rules.

New features added:

Bulk import/export of custom rules (based on Elastic Security APIs)
Fully configurable unit test, validation, and schemas
Version lock for custom rules

March - August 2025: are generally available and supported

Using DaC with Elastic Security 8.18 and up:

Supports prebuilt rules management. You can export all prebuilt rules from Elastic Security and store them alongside your custom rules.
Support for rules filtering for export added.

Adjacent to DaC efforts, we also released new Terraform resources (V0.12.0 and V0.13.0) in October-December 2025, allowing Terraform users to manage detection rules and exceptions.

With this foundation spelled out, let's explore the powerful features that are available to streamline your detection engineering process.

Detection-rules DaC functionality highlights

There are a few worthwhile additions since our last DaC publication, which we’ll expand on below.

Additional filters

The filter functionality available when exporting rules from Kibana has been extended to allow you to precisely define which rules to sync in DaC. Here are the new flags:

Flag	Description
-cro	Filters the export to only include rules created by the user (not Elastic prebuilt rules).
-eq	Applies a query filter to the rules being exported.

Let’s take an example of when you wish to organize rules by data source, and want to export the AWS rules to a specific folder. In this case, let’s use filtering on tags for data sources and export all rules with the Data Source AWS tag:

python -m detection_rules kibana export-rules -d dac_test/rules #add rules to the dac_test/rules folder
-sv #strip the version fields from all rules
-cro #export only custom rules
-eq "alert.attributes.tags: "Data Source: AWS"" # export only rules with "Data Source: AWS" tag

See Kibana documentation for query string filtering for the underlying API call used here and the list all detection rules API call for example available fields to construct the query filter.

Custom folder structure

In the detection-rules repo, we use a folder structure based on platform, integration, and MITRE ATT&CK information. This helps us with our organization and rule development. This is by no means the only method of organization. You may want to organize your rules by customer, date, or source as examples. This will vary greatly depending on your use case.

Whether you use this export process or manual organization, once you have your rules in a location or folder structure that you like, you can now keep this local structure even when re-exporting rules. It is important to note that the new rules need to be placed in their desired location manually. The local rule-loading mechanism detects where the rules are placed in order to know where to put them. If the rule is not there, it will then use the specified output directory to place the new rule(s). To use the local rule loading for updating existing rules use the --load-rule-loading / -lr flag for the kibana export-rules and import-rules-to-repo commands. These flags enable you to make use of the local folders specified in your config.yaml.

Let’s look at example with the rules organised in folders the following way:

rule_dirs:
- rules
my_test_rule.toml
- another_rules_dir
high_number_of_process_and_or_service_terminations.toml

We’ll specify the following in the config.yaml file:

rule_dirs:
- rules
- another_rules_dir

With the new -lr option, rule updates from Kibana will now use these additional paths instead of exporting directly to the specified directory.

Running python -m detection_rules kibana --space test_local export-rules -d dac_test/rules/ -sv -ac -e -lr,will export rules from test_local space, my_test_rule.toml will be written to dac_test/rules/ as it was already on disk there and high_number_of_process_and_or_service_terminations.toml will be written to dac_test/another_rules_dir/.

This can be particularly useful if you have the same rules in different sub-folder configurations for different customers. For example, let’s say you have your rules broken down by platform and integration similar to Elastic’s prebuilt rule folder structure. For your customers, SOCs, or threat-hunting teams, having the rules organized underneath these platform/integration folders may be the most useful mechanism for them to manage the rules. However, your information security team or primary detection engineering team may want to manage the rules by initiative or rule author instead so that all the rules a particular individual or team is responsible for are organized in one place. Now with the local rule-loading flags, you can simply have two configuration files and the duplicated rules in each structure. When you are exporting updates for the rules, you would then use the environment variable to select the appropriate configuration file and export the rule updates. These updates will then be applied to the rules in place, maintaining the directory structure.

Miscellaneous local loading updates

In addition to the above, we have added two smaller new features designed to help users who are adding local information in the detection rules TOML files and schema. These are as follows:

Local date support from the local files where the local date will be maintained from the original file
Upgrades to the auto gen feature to inherit known types from existing schema.

The local date component can be useful when one wants more manual control over the date field in the file. Without using the override, the date will be based on when the Kibana rule contents were exported. Using the --local-creation-date flag, the date will not be updated when the file contents are re-exported.

The automatic schema generation has been updated to inherit the types from other indices/integrations if they are present. This provides a potentially more accurate schema, as well as reducing the need for manual updates after the fact. For example, you have a rule that uses the index “new-integration*” with the following fields:

host.os.type.new_field
dll.Ext.relative_file_creation_time
process.name.okta.thread

Instead of each of these fields being added to the schema with a default type, their types are inherited from existing schemas. In this case, the types for dll.Ext.relative_file_creation_time and process.name.okta.thread are inherited.

{
  "new-integration*": {
    "dll.Ext.relative_file_creation_time": "double",
    "host.os.type.new_field": "keyword",
    "process.name.okta.thread": "keyword"
  }
}

To see how to use this with your custom data types, see the Custom schemas usage section within the Implementation examples part of this blog.

Expanding on usage examples

Below you will find more examples of DaC implementations, these are not focused on new functionality additions, but go deeper on the topics we see discussed in the community.

It’s worth noting that Detections as Code features are provided as components that can be used to build a custom implementation for your chosen process and architecture. When implementing DaC in your production environment, treat it as an engineering process and follow the best practices.

DaC implementation with Gitlab

When we look at implementations of DaC typically this revolves around using some form of CI/CD product to automatically perform rule management based on a given trigger. These triggers vary considerably based on the desired setup, specifically the authoritative source of rules and the desired state of your version control system (VCS). For a much more in-depth exploration of some of these considerations, see our DaC Reference Material. Below is a simple example using Gitlab as VCS provider and using its in-built CI/CD via Gitlab Actions.

stages:                # Define the pipeline stages
  - sync               # Add a 'sync' stage

sync-to-production:    # Define a job named 'sync-to-production'
  stage: sync          # Assign this job to the 'sync' stage
  image: python:3.12   # Use the Python 3.12 Docker image
  variables:
    CUSTOM_RULES_DIR: $CUSTOM_RULES_DIR    # Set custom rules env var
  script:                                  # List of commands to run 
    - python -m pip install --upgrade pip  # Upgrade pip
    - pip cache purge                      # Clear pip cache
    - pip install .[dev]                   # Install package w/ dev deps
    - |  # Multi-line command to import rules                                        
      FLAGS="-d ${CUSTOM_RULES_DIR}/rules/ --overwrite -e -ac"
      python -m detection_rules kibana --space production import-rules $FLAGS
  environment:
    name: production   # Specify deployment environment as 'production'
  only:
    refs:
      - main           # Run this job only on the 'main' branch
    changes:
      - '**/*.toml'    # Run this job only if .toml files have changed

This is very similar to other inbuilt CI/CD from other Git-based VCS like Gitlab and Gitea. The main difference being in the syntax determining the triggering event. The DaC commands such as kibana import-rules would be the same regardless of VCS. In this example, we are syncing rules from our fork of the detection-rules repo to our Kibana Production Space. This is based on a number of prior decisions being made, for instance requiring unit tests to pass before merging rule updates and that rules on main being ready for prod. For a Github-based walkthrough of these considerations for this particular approach, please take a look at our demo video.

Custom Unit Testing tips and examples

When considering DaC as a capability to add to your detection toolkit, setting up the CI/CD and base infrastructure should be considered as the first step in an ongoing process to improve the quality and usefulness of your rules. One of the key purposes in having “as code” tooling is adding the ability to further customize tooling to your needs and environment.

One example of this is unit testing for rules. Beyond base functionality testing, some other key existing unit tests enforce Elastic-specific considerations around rule performance and optimization, as well as organization of metadata and tagging. This helps detection engineers and threat researchers remain consistent in their rule development. Building on this example, one may want to consider adding custom unit tests based on your specific needs.

To illustrate this, take a Security Operations Center (SOC) environment where there are a number of analysts responsible for various different domains and tasks. When an alert is raised in the SIEM, it may not be immediately obvious who should handle remediation, or what team(s) need to be informed of the incident. Tagging the rules with a team tag: e.g. Team: Windows Servers similarly to how Elastic uses tags for data sources, can provide the SOC with a point of contact directly in the alert for who can help with remediation.

In our DaC environment, we can quickly create a new testing module to enforce this on all of the custom rules (or pre-built too). For this test, we are going to enforce having a Team: tag on all production rules that are not authored by Elastic. In the detection-rules repo, our testing is handled through the Python test suite called pytest and as such unit tests are organized into python modules (files) and subsequent classes and functions in these files under the tests/ folder. To add tests simply either add classes or functions to the existing files or create a new one. In general, we recommend creating new test files so that you can receive updates to the existing tests from Elastic without having to merge the differences.

We will start by creating a new python file called test_custom_rules.py in the tests/ directory with the following contents:

# test_custom_rules.py

"""Unit Tests for Custom Rules."""

from .base import BaseRuleTest


class TestCustomRules(BaseRuleTest):
    """Test custom rules for given criteria."""

    def test_custom_rule_team_tag(self):
        """Unit test that all custom rules have a Team:  tag."""
        tag_format = "Team: "
        for rule in self.all_rules:
            if "Elastic" not in rule.contents.data.author:
                tags = rule.contents.data.tags
                if tags:
                    self.assertTrue(
                        any(tag.startswith("Team: ") for tag in tags),
                        f"Custom rule {rule.contents.data.rule_id} does not have a {tag_format} tag",
                    )
                else:
                    raise AssertionError(
                        f"Custom rule {rule.contents.data.rule_id} does not have any tags, include a {tag_format} tag"
                    )

Now each non-Elastic rule will be required to have a tag in the specified pattern for a team responsible for remediation. E.g. Team: Team A.

Custom schemas usage

Elastic’s ability to bring your own data types also extends to our DaC capabilities. For example, let’s take a look at some custom schemas for network protocols. Diverse data you have in your stack can of course be queried by your rules, and we will also want to leverage the applicable validation and testing for any custom rules on these data types too. This is where Custom schemas come in handy.

When we are validating queries, the query is parsed into the respective fields and the types of these fields are compared against what is provided in a given schema (e.g. ECS schema, the AWS Integration for AWS data, etc.). For custom data types, this follows the same validation path, with the ability to pull from locally defined custom schemas. These schema files can be built by hand as one or more json files; however, if you have some sample data already in your stack, you can take advantage of this and use it as validation and generate your schemas automatically.

Assuming you already have a custom rules folder configured (if not see instructions), you can turn on automatic schema generation by adding auto_gen_schema_file: to your config file. This will generate a schema file in the specified location that will be used to add entries for each field and index combination. The file will be updated during any command where rule contents are validated against a schema, including import-rules-to-repo, kibana export-rules, view-rule, and others. This will also automatically add it to your stack-schema-map.yaml file when using a custom rules directory and config.

With this power comes an increased responsibility on rule reviewers as any field used in the query is immediately assumed to be valid and added to the schema. One way to mitigate risk is to utilize a development space that has access to the data. In the PR, one can then link to a successful execution of the query with stack level validation on its data types. Once this is approved, one can remove the auto_gen_schema_file addition to the config and you now have a known valid schema based on your custom data. This provides a baseline for other rule authors to build upon as needed and maintains the type checking validation.

Learn more about DaC and try it yourself

You can experience Elastic Security's Detections as Code (DaC) functionality firsthand with our interactive Instruqt training. This training provides a straightforward way to explore core DaC features in a pre-configured test environment, eliminating the need for manual setup. Give it a try!

If you are implementing DaC, share your experience, ask your questions and help others on the community slack DaC channel.

Trial Elastic Security

To experience the full benefits of what Elastic has to offer for detection engineers, start your Elastic Security free trial. Visit elastic.co/security to learn more.

The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.

Cups Overflow: When your printer spills more than Ink

Sat, 28 Sep 2024 00:00:00 GMT

Update October 2, 2024

The following packages introduced out-of-the-box (OOTB) rules to detect the exploitation of these vulnerabilities. Please check your "Prebuilt Security Detection Rules" integration versions or visit the Downloadable rule updates site.

Stack Version 8.15 - Package Version 8.15.6+
Stack Version 8.14 - Package Version 8.14.12+
Stack Version 8.13 - Package Version 8.13.18+
Stack Version 8.12 - Package Version 8.12.23+

Key takeaways

On September 26, 2024, security researcher Simone Margaritelli (@evilsocket) disclosed multiple vulnerabilities affecting the cups-browsed, libscupsfilters, and libppd components of the CUPS printing system, impacting versions <= 2.0.1.
The vulnerabilities allow an unauthenticated remote attacker to exploit the printing system via IPP (Internet Printing Protocol) and mDNS to achieve remote code execution (RCE) on affected systems.
The attack can be initiated over the public internet or local network, targeting the UDP port 631 exposed by cups-browsed without any authentication requirements.
The vulnerability chain includes the foomatic-rip filter, which permits the execution of arbitrary commands through the FoomaticRIPCommandLine directive, a known (CVE-2011-2697, CVE-2011-2964) but unpatched issue since 2011.
Systems affected include most GNU/Linux distributions, BSDs, ChromeOS, and Solaris, many of which have the cups-browsed service enabled by default.
By the title of the publication, “Attacking UNIX Systems via CUPS, Part I” Margaritelli likely expects to publish further research on the topic.
Elastic has provided protections and guidance to help organizations detect and mitigate potential exploitation of these vulnerabilities.

The CUPS RCE at a glance

On September 26, 2024, security researcher Simone Margaritelli (@evilsocket) uncovered a chain of critical vulnerabilities in the CUPS (Common Unix Printing System) utilities, specifically in components like cups-browsed, libcupsfilters, and libppd. These vulnerabilities — identified as CVE-2024-47176, CVE-2024-47076, CVE-2024-47175, and CVE-2024-47177 — affect widely adopted UNIX systems such as GNU/Linux, BSDs, ChromeOS, and Solaris, exposing them to remote code execution (RCE).

At the core of the issue is the lack of input validation in the CUPS components, which allows attackers to exploit the Internet Printing Protocol (IPP). Attackers can send malicious packets to the target's UDP port 631 over the Internet (WAN) or spoof DNS-SD/mDNS advertisements within a local network (LAN), forcing the vulnerable system to connect to a malicious IPP server.

For context, the IPP is an application layer protocol used to send and receive print jobs over the network. These communications include sending information regarding the state of the printer (paper jams, low ink, etc.) and the state of any jobs. IPP is supported across all major operating systems including Windows, macOS, and Linux. When a printer is available, the printer broadcasts (via DNS) a message stating that the printer is ready including its Uniform Resource Identifier (URI). When Linux workstations receive this message, many Linux default configurations will automatically add and register the printer for use within the OS. As such, the malicious printer in this case will be automatically registered and made available for print jobs.

Upon connecting, the malicious server returns crafted IPP attributes that are injected into PostScript Printer Description (PPD) files, which are used by CUPS to describe printer properties. These manipulated PPD files enable the attacker to execute arbitrary commands when a print job is triggered.

One of the major vulnerabilities in this chain is the foomatic-rip filter, which has been known to allow arbitrary command execution through the FoomaticRIPCommandLine directive. Despite being vulnerable for over a decade, it remains unpatched in many modern CUPS implementations, further exacerbating the risk.

While these vulnerabilities are highly critical with a CVSS score as high as 9.9, they can be mitigated by disabling cups-browsed, blocking UDP port 631, and updating CUPS to a patched version. Many UNIX systems have this service enabled by default, making this an urgent issue for affected organizations to address.

Elastic’s POC analysis

Elastic’s Threat Research Engineers initially located the original proof-of-concept written by @evilsocket, which had been leaked. However, we chose to utilize the cupshax proof of concept (PoC) based on its ability to execute locally.

To start, the PoC made use of a custom Python class that was responsible for creating and registering the fake printer service on the network using mDNS/ZeroConf. This is mainly achieved by creating a ZeroConf service entry for the fake Internet Printing Protocol (IPP) printer.

Upon execution, the PoC broadcasts a fake printer advertisement and listens for IPP requests. When a vulnerable system sees the broadcast, the victim automatically requests the printer's attributes from a URL provided in the broadcast message. The PoC responds with IPP attributes including the FoomaticRIPCommandLine parameter, which is known for its history of CVEs. The victim generates and saves a PostScript Printer Description (PPD) file from these IPP attributes.

At this point, continued execution requires user interaction to start a print job and choose to send it to the fake printer. Once a print job is sent, the PPD file tells CUPS how to handle the print job. The included FoomaticRIPCommandLine directive allows the arbitrary command execution on the victim machine.

During our review and testing of the exploits with the Cupshax PoC, we identified several notable hurdles and key details about these vulnerable endpoint and execution processes.

When running arbitrary commands to create files, we noticed that lp is the user and group reported for arbitrary command execution, the default printing group on Linux systems that use CUPS utilities. Thus, the Cupshax PoC/exploit requires both the CUPS vulnerabilities and the lp user to have sufficient permissions to retrieve and run a malicious payload. By default, the lp user on many systems will have these permissions to run effective payloads such as reverse shells; however, an alternative mitigation is to restrict lp such that these payloads are ineffective through native controls available within Linux such as AppArmor or SELinux policies, alongside firewall or IPtables enforcement policies.

The lp user in many default configurations has access to commands that are not required for the print service, for instance telnet. To reduce the attack surface, we recommend removing unnecessary services and adding restrictions to them where needed to prevent the lp user from using them.

We also took note that interactive reverse shells are not immediately supported through this technique, since the lp user does not have a login shell; however, with some creative tactics, we were able to still accomplish this with the PoC. Typical PoCs test the exploit by writing a file to /tmp/, which is trivial to detect in most cases. Note that the user writing this file will be lp so similar behavior will be present for attackers downloading and saving a payload on disk.

Alongside these observations, the parent process, foomatic-rip was observed in our telemetry executing a shell, which is highly uncommon

Executing the ‘Cupshax’ POC

To demonstrate the impact of these vulnerabilities, we attempted to accomplish two different scenarios: using a payload for a reverse shell using living off the land techniques and retrieving and executing a remote payload. These actions are often common for adversarial groups to attempt to leverage once a vulnerable system is identified. While in its infancy, widespread exploitation has not been observed, but likely will replicate some of the scenarios depicted below.

Our first attempts running the Cupshax PoC were met with a number of minor roadblocks due to the default user groups assigned to the lp user — namely restrictions around interactive logon, an attribute common to users that require remote access to systems. This did not, however, impact our ability to download a remote payload, compile, and execute on the impacted host system:

Continued testing was performed around reverse shell invocation, successfully demonstrated below:

Assessing impact

Severity: These vulnerabilities are given CVSS scores controversially up to 9.9, indicating a critical severity. The widespread use of CUPS and the ability to remotely exploit these vulnerabilities make this a high-risk issue.
Who is affected?: The vulnerability affects most UNIX-based systems, including major GNU/Linux distributions and other operating systems like ChromeOS and BSDs running the impacted CUPS components. Public-facing or network-exposed systems are particularly at risk. Further guidance, and notifications will likely be provided by vendors as patches become available, alongside further remediation steps. Even though CUPS usually listens on localhost, the Shodan Report highlights that over 75,000 CUPS services are exposed on the internet.
Potential Damage: Once exploited, attackers can gain control over the system to run arbitrary commands. Depending on the environment, this can lead to data exfiltration, ransomware installation, or other malicious actions. Systems connected to printers over WAN are especially at risk since attackers can exploit this without needing internal network access.

Remediations

As highlighted by @evilsocket, there are several remediation recommendations.

Disable and uninstall the cups-browsed service. For example, see the recommendations from Red Hat and Ubuntu.
Ensure your CUPS packages are updated to the latest versions available for your distribution.
If updating isn’t possible, block UDP port 631 and DNS-SD traffic from potentially impacted hosts, and investigate the aforementioned recommendations to further harden the lp user and group configuration on the host.

Elastic protections

In this section, we look into detection and hunting queries designed to uncover suspicious activity linked to the currently published vulnerabilities. By focusing on process behaviors and command execution patterns, these queries help identify potential exploitation attempts before they escalate into full-blown attacks.

cupsd or foomatic-rip shell execution

The first detection rule targets processes on Linux systems that are spawned by foomatic-rip and immediately launch a shell. This is effective because legitimate print jobs rarely require shell execution, making this behavior a strong indicator of malicious activity. Note: A shell may not always be an adversary’s goal if arbitrary command execution is possible.

process where host.os.type == "linux" and event.type == "start" and
 event.action == "exec" and process.parent.name == "foomatic-rip" and
 process.name in ("bash", "dash", "sh", "tcsh", "csh", "zsh", "ksh", "fish") 
 and not process.command_line like ("*/tmp/foomatic-*", "*-sDEVICE=ps2write*")

This query managed to detect all 33 PoC attempts that we performed:

https://github.com/elastic/detection-rules/blob/a3e89a7fabe90a6f9ce02b58d5a948db8d231ee5/rules/linux/execution_cupsd_foomatic_rip_shell_execution.toml

Printer user (lp) shell execution

This detection rule assumes that the default printer user (lp) handles the printing processes. By specifying this user, we can narrow the scope while broadening the parent process list to include cupsd. Although there's currently no indication that RCE can be exploited through cupsd, we cannot rule out the possibility.

process where host.os.type == "linux" and event.type == "start" and
 event.action == "exec" and user.name == "lp" and
 process.parent.name in ("cupsd", "foomatic-rip", "bash", "dash", "sh", 
 "tcsh", "csh", "zsh", "ksh", "fish") and process.name in ("bash", "dash", 
 "sh", "tcsh", "csh", "zsh", "ksh", "fish") and not process.command_line 
 like ("*/tmp/foomatic-*", "*-sDEVICE=ps2write*")

By focusing on the username lp, we broadened the scope and detected, like previously, all of the 33 PoC executions:

https://github.com/elastic/detection-rules/blob/a3e89a7fabe90a6f9ce02b58d5a948db8d231ee5/rules/linux/execution_cupsd_foomatic_rip_lp_user_execution.toml

Network connection by CUPS foomatic-rip child

This rule identifies network connections initiated by child processes of foomatic-rip, which is a behavior that raises suspicion. Since legitimate operations typically do not involve these processes establishing outbound connections, any detected activity should be closely examined. If such communications are expected in your environment, ensure that the destination IPs are properly excluded to avoid unnecessary alerts.

sequence by host.id with maxspan=10s
  [process where host.os.type == "linux" and event.type == "start" 
   and event.action == "exec" and
   process.parent.name == "foomatic-rip" and
   process.name in ("bash", "dash", "sh", "tcsh", "csh", "zsh", "ksh", "fish")] 
   by process.entity_id
  [network where host.os.type == "linux" and event.type == "start" and 
   event.action == "connection_attempted"] by process.parent.entity_id

By capturing the parent/child relationship, we ensure the network connections originate from the potentially compromised application.

https://github.com/elastic/detection-rules/blob/a3e89a7fabe90a6f9ce02b58d5a948db8d231ee5/rules/linux/command_and_control_cupsd_foomatic_rip_netcon.toml

File creation by CUPS foomatic-rip child

This rule detects suspicious file creation events initiated by child processes of foomatic-rip. As all current proof-of-concepts have a default testing payload of writing to a file in /tmp/, this rule would catch that. Additionally, it can detect scenarios where an attacker downloads a malicious payload and subsequently creates a file.

sequence by host.id with maxspan=10s
  [process where host.os.type == "linux" and event.type == "start" and 
   event.action == "exec" and process.parent.name == "foomatic-rip" and 
   process.name in ("bash", "dash", "sh", "tcsh", "csh", "zsh", "ksh", "fish")] by process.entity_id
  [file where host.os.type == "linux" and event.type != "deletion" and
   not (process.name == "gs" and file.path like "/tmp/gs_*")] by process.parent.entity_id

The rule excludes /tmp/gs_* to account for default cupsd behavior, but for enhanced security, you may choose to remove this exclusion, keeping in mind that it may generate more noise in alerts.

https://github.com/elastic/detection-rules/blob/a3e89a7fabe90a6f9ce02b58d5a948db8d231ee5/rules/linux/execution_cupsd_foomatic_rip_file_creation.toml

Suspicious execution from foomatic-rip or cupsd parent

This rule detects suspicious command lines executed by child processes of foomatic-rip and cupsd. It focuses on identifying potentially malicious activities, including persistence mechanisms, file downloads, encoding/decoding operations, reverse shells, and shared-object loading via GTFOBins.

process where host.os.type == "linux" and event.type == "start" and 
 event.action == "exec" and process.parent.name in 
 ("foomatic-rip", "cupsd") and process.command_line like (
  // persistence
  "*cron*", "*/etc/rc.local*", "*/dev/tcp/*", "*/etc/init.d*", 
  "*/etc/update-motd.d*", "*/etc/sudoers*",
  "*/etc/profile*", "*autostart*", "*/etc/ssh*", "*/home/*/.ssh/*", 
  "*/root/.ssh*", "*~/.ssh/*", "*udev*", "*/etc/shadow*", "*/etc/passwd*",
    // Downloads
  "*curl*", "*wget*",

  // encoding and decoding
  "*base64 *", "*base32 *", "*xxd *", "*openssl*",

  // reverse connections
  "*GS_ARGS=*", "*/dev/tcp*", "*/dev/udp/*", "*import*pty*spawn*", "*import*subprocess*call*", "*TCPSocket.new*",
  "*TCPSocket.open*", "*io.popen*", "*os.execute*", "*fsockopen*", "*disown*", "*nohup*",

  // SO loads
  "*openssl*-engine*.so*", "*cdll.LoadLibrary*.so*", "*ruby*-e**Fiddle.dlopen*.so*", "*Fiddle.dlopen*.so*",
  "*cdll.LoadLibrary*.so*",

  // misc. suspicious command lines
   "*/etc/ld.so*", "*/dev/shm/*", "*/var/tmp*", "*echo*", "*>>*", "*|*"
)

By making an exception of the command lines as we did in the rule above, we can broaden the scope to also detect the cupsd parent, without the fear of false positives.

https://github.com/elastic/detection-rules/blob/a3e89a7fabe90a6f9ce02b58d5a948db8d231ee5/rules/linux/execution_cupsd_foomatic_rip_suspicious_child_execution.toml

Elastic’s Attack Discovery

In addition to prebuilt content published, Elastic’s Attack Discovery can provide context and insights by analyzing alerts in your environment and identifying threats by leveraging Large Language Models (LLMs). In the following example, Attack Discovery provides a short summary and a timeline of the activity. The behaviors are then mapped to an attack chain to highlight impacted stages and help triage the alerts.

Conclusion

The recent CUPS vulnerability disclosure highlights the evolving threat landscape, underscoring the importance of securing services like printing. With a high CVSS score, this issue calls for immediate action, particularly given how easily these flaws can be exploited remotely. Although the service is installed by default on some UNIX OS (based on supply chain), manual user interaction is needed to trigger the printer job. We recommend that users remain vigilant, continue hunting, and not underestimate the risk. While the threat requires user interaction, if paired with a spear phishing document, it may coerce victims to print using the rogue printer. Or even worse, silently replacing existing printers or installing new ones as indicated by @evilsocket.

We expect more to be revealed as the initial disclosure was labeled part 1. Ultimately, visibility and detection capabilities remain at the forefront of defensive strategies for these systems, ensuring that attackers cannot exploit overlooked vulnerabilities.

Key References

Storm on the Horizon: Inside the AJCloud IoT Ecosystem

Fri, 20 Sep 2024 00:00:00 GMT

Introduction

Wi-Fi cameras are some of the most common IoT devices found in households, businesses, and other public spaces. They tend to be quite affordable and provide users with easy access to a live video stream on their mobile device from anywhere on the planet. As is often the case with IoT devices, security tends to be overlooked in these cameras, leaving them open to critical vulnerabilities. If exploited, these vulnerabilities can lead to devastating effects on the cameras and the networks within which they’re deployed. They can lead to the compromise of the sensitive PII of their users.

A recent Elastic ON Week afforded us the opportunity to explore the attack surface of these types of devices to gain a deeper understanding of how they are being compromised. We focused primarily on performing vulnerability research on the Wansview Q5 (along with the nearly identical Q6), one of the more popular and affordable cameras sold on Amazon. Wansview is a provider of security products based in Shenzhen, China, and one of Amazon's more prominent distributors of Wi-Fi cameras.

The Q5 offers the same basic feature set seen in most cameras:

Pan / tilt / zoom
Night vision
Two-way audio
Video recording to SD card
Integration with Smart Home AI assistants (e.g. Alexa)
ONVIF for interoperability with other security products
RTSP for direct access to video feed within LAN
Automated firmware updates from the cloud
Remote technical support
Shared device access with other accounts
Optional monthly subscription for cloud storage and motion detection

Like most other Wi-Fi cameras, these models require an active connection to their vendor cloud infrastructure for basic operation; without access to the Internet, they simply will not operate. Before a camera can go live, it must be paired to a registered user account via Wansview’s official mobile app and a standard QR code-based setup process. Once this process is complete, the camera will be fully online and operational.

AJCloud: A Brief Introduction

Though Wansview has been in operation since 2009, at the moment they primarily appear to be a reseller of camera products built by a separate company based in Nanjing, China: AJCloud.

AJCloud provides vendors with access to manufactured security devices, the necessary firmware, mobile and desktop user applications, the cloud management platform, and services that connect everything together. Since AJCloud was founded in 2018, they have partnered with several vendors, both large and small, including but not limited to the following:

A cursory review of mobile and desktop applications developed and published by AJCloud on Google Play, Apple’s App Store, and the Microsoft Store reveals their ties to each of these vendors. Besides superficial company branding, these applications are identical in form and function, and they all require connectivity with the AJCloud management platform.

As for the cameras, it is apparent that these vendors are selling similar models with only minor modifications to the camera housing and underlying hardware.

The resemblance between the Faleemi 886 and the Wansview Q6 (1080p) is obvious

Reusing hardware manufacturing and software development resources likely helps to control costs and simplify logistics for AJCloud and its resellers. However, this streamlining of assets also means that security vulnerabilities discovered in one camera model would likely permeate all products associated with AJCloud.

Despite its critical role in bringing these devices to consumers, AJCloud has a relatively low public profile. However, IPVM researchers recently published research on a significant vulnerability (which has since been resolved) in AJCloud’s GitLab repository. This vulnerability would allow any user to access source code, credentials, certificates, and other sensitive data without requiring authentication.

Though total sales figures are difficult to derive for Wansview and other vendors in the Wi-Fi camera space, IPVM estimated that at least one million devices were connected to the AJCloud platform at the time of publication of their report. As camera sales continue to soar into the hundreds of millions, it is safe to assume that more of AJCloud’s devices will be connected in homes across the world for years to come.

Initial Vulnerability Research Efforts

To gain a deeper understanding of the security posture of the Wansview Q5, we attacked it from multiple angles:

At first, our efforts were primarily focused on active and passive network reconnaissance of the camera and the Android version of Wansview Cloud, Wansview’s official mobile app. We scanned for open ports, eavesdropped on network communications through man-in-the-middle (MitM) attacks, attempted to coerce unpredictable behavior from the cameras through intentional misconfiguration in the app, and disrupted the operation of the cameras by abusing the QR code format and physically interacting with the camera. The devices and their infrastructure were surprisingly resilient to these types of surface-level attacks, and our initial efforts yielded few noteworthy successes.

We were particularly surprised by our lack of success intercepting network communications on both the camera and the app. We repeatedly encountered robust security features (e.g., certificate pinning, app and OS version restrictions, and properly secured TLS connections) that disrupted our attempts.

Reverse engineering tools allowed us to analyze the APK much more closely, though the complexity of the code obfuscation observed within the decompiled Java source code would require an extended length of time to fully piece together.

Our limited initial success would require us to explore further options that would provide us with more nuanced insight into the Q5 and how it operates.

Initial Hardware Hacking

To gain more insight into how the camera functioned, we decided to take a closer look at the camera firmware. While some firmware packages are available online, we wanted to take a look at the code directly and be able to monitor it and the resulting logs while the camera was running. To do this, we first took a look at the hardware diagram for the system on a chip (SoC) to see if there were any hardware avenues we might be able to leverage. The Wansview Q5 uses a Ingenic Xburst T31 SoC, its system block diagram is depicted below.

One avenue that stood out to us was the I2Cx3/UARTx2/SPIx2 SPI I/O block. If accessible, these I/O blocks often provide log output interfaces and/or shell interfaces, which can be used for debugging and interacting with the SoC. Appearing promising, we then performed a hardware teardown of the camera and found what appeared to be a UART serial interface to the SoC, shown below.

Next, we connected a logic analyzer to see what protocol was being used over these pins, and when decoded, the signal was indeed UART.

Now that we can access an exposed UART interface, we then looked to establish a shell connection to the SoC via UART. There are a number of different software mechanisms to do this, but for our purposes we used the Unix utility screen with the detected baud rate from the logic analyzer.

Upon opening and monitoring the boot sequence, we discovered that secure boot was not enabled despite being supported by the SoC. We then proceeded to modify the configuration to boot into single user mode providing a root shell for us to use to examine the firmware before the initialization processes were performed, shown below.

Once in single-user mode, we were able to pull the firmware files for static analysis using the binwalk utility, as shown below.

At this stage, the filesystem is generally read-only; however, we wanted to be able to make edits and instantiate only specific parts of the firmware initialization as needed, so we did some quick setups for additional persistence beyond single-user mode access. This can be done in a number of ways, but there are two primary methods one may wish to use. Generally speaking, in both approaches, one will want to make as few modifications to the existing configuration as possible. This is generally preferred when running dynamic analysis if possible, as we have had the least impact on the run time environment. One method we used for this approach is to make a tmpfs partition for read/write access in memory and mount it via fstab. In our case fstab was already considered in such a way that supported this, and as such made it a very minimal change. See the commands and results for this approach below.

Another method is to pull existing user credentials and attempt to use these to log in. This approach was also successful. The password hash for the root user can be found in the etc/passwd file and decrypted using a tool like John the Ripper. In our above examples, we were transferring data and files entirely over the serial connection. The camera also has an available SD card slot that can be mounted and used to transfer files. Going forward, we will be using the SD card or local network for moving files as the bandwidth makes for faster and easier transfer; however, serial can still be used for all communications for the hardware setup and debugging if preferred.

Now, we have root level access to the camera providing access to the firmware and dmesg logs while the software is running. Using both the firmware and logs as reference, we then looked to further examine the user interfaces for the camera to see if there was a good entry point we could use to gain further insight.

Wansview Cloud for Windows

After the mobile apps proved to be more secure than we had originally anticipated, we shifted our focus to an older version of the Wansview Cloud application built for Windows 7. This app, which is still available for download, would provide us with direct insight into the network communications involved with cameras connected to the AJCloud platform.

Thanks in large part to overindulgent debug logging on behalf of the developers, the Windows app spills out its secrets with reckless abandon seldom seen in commercial software. The first sign that things are amiss is that user login credentials are logged in cleartext.

Reverse engineering the main executable and DLLs (which are not packed, unlike the Wansview Cloud APK) was expedited thanks to the frequent use of verbose log messages containing unique strings. Identifying references to specific files and lines within its underlying codebase helped us to quickly map out core components of the application and establish the high level control flow.

Network communications, which were difficult for us to intercept on Android, are still transmitted over TLS, though they are conveniently logged to disk in cleartext. With full access to all HTTP POST request and response data (which is packed into JSON objects), there was no further need to pursue MitM attacks on the application side.

Within the POST responses, we found sensitive metadata including links to publicly accessible screen captures along with information about the camera’s location, network configuration, and its firmware version.

After documenting all POST requests and responses found within the log data, we began to experiment with manipulating different fields in each request in an attempt to access data not associated with our camera or account. We would eventually utilize a debugger to change the deviceId to that of a target camera not paired with the current logged in account. A camera deviceId doubles as its serial number and can be found printed on a sticker label located on either the back or bottom of a camera.

We found the most appropriate target for our attack in a code section where the deviceId is first transmitted in a POST request to https://sdc-us.ajcloud.net/api/v1/dev-config:

Our plan was to set a breakpoint at the instruction highlighted in the screenshot above, swap out the deviceId within memory, and then allow the app to resume execution.

Amazingly enough, this naive approach not only worked to retrieve sensitive data stored in the AJCloud platform associated with the target camera and the account it is tied to, but it also connected us to the camera itself. This allowed us to access its video and audio streams and remotely control it through the app as if it were our own camera.

Through exploiting this vulnerability and testing against multiple models from various vendors, we determined that all devices connected to the AJCloud platform could be remotely accessed and controlled in this manner. We wrote a PoC exploit script to automate this process and effectively demonstrate the ease with which this access control vulnerability within AJCloud’s infrastructure can be trivially exploited.

Exploring the network communications

Though we were able to build and reliably trigger an exploit against a critical vulnerability in the AJCloud platform, we would need to dig further in order to gain a better understanding of the inner workings of the apps, the camera firmware, and the cloud infrastructure.

As we explored beyond the POST requests and responses observed throughout the sign-in process, we noticed a plethora of UDP requests and responses from a wide assortment of IPs. Little in the way of discernible plaintext data could be found throughout these communications, and the target UDP port numbers for the outbound requests seemed to vary. Further investigation would later reveal that this UDP activity was indicative of PPPP, an IoT peer-to-peer (P2P) protocol that was analyzed and demonstrated extensively by Paul Marrapesse during his presentation at DEF CON 28. We would later conclude that the way in which we exploited the vulnerability we discovered was facilitated through modified P2P requests, which led us to further explore the critical role that P2P plays in the AJCloud platform.

The main purpose of P2P is to facilitate communication between applications and IoT devices, regardless of the network configurations involved. P2P primarily utilizes an approach based around UDP hole punching to create temporary communication pathways that allow requests to reach their target either directly or through a relay server located in a more accessible network environment. The core set of P2P commands integrated into AJCloud’s apps provides access to video and audio streams as well as the microphone and pan/tilt/zoom.

Advanced Hardware Hacking

With our additional understanding of the P2P communications, it was now time to examine the camera itself more closely during these P2P conversations, including running the camera software in a debugger. To start, we set up the camera with a live logging output via the UART serial connection that we established earlier, shown below.

This provided a live look at the log messages from the applications as well as any additional logging sources we needed. From this information, we identified the primary binary that is used to establish communication between the camera and the cloud as well as providing the interfaces to access the camera via P2P.

This binary is locally called initApp, and it runs once the camera has been fully initialized and the boot sequence is completed. Given this, we set out to run this binary with a debugger to better evaluate the local functions. In attempting to do so, we encountered a kernel watchdog that detected when initApp was not running and would forcibly restart the camera if it detected a problem. This watchdog checks for writes to /dev/watchdog and, if these writes cease, will trigger a timer that will reboot the camera if the writes do not resume. This makes debugging more difficult as when one pauses the execution of initApp, the writes to the watchdog pause as well. An example of this stopping behavior is shown below:

To avoid this, one could simply try writing to the watchdog whenever initApp stops to prevent the reboot. However, another cleaner option is to make use of the magic close feature of the Linux Kernel Watchdog Driver API. In short, if one writes a specific magic character ‘V’ /dev/watchdog the watchdog will be disabled. There are other methods of defeating the watchdog as well, but this was the one we chose for our research as it makes it easy to enable and disable the watchdog at will.

With the watchdog disabled, setting up to debug initApp is fairly straightforward. We wanted to run the code directly on the camera, if possible, instead of using an emulator. The architecture of the camera is Little Endian MIPS (MIPSEL). We were fortunate that pre-built GDB and GDBServer binaries were able to function without modification; however, we did not know this initially, so we also set up a toolchain to compile GDBServer specifically for the camera. One technique that might be useful if you find yourself in a similar situation is to use a compilation tool like gcc to compile some source code to your suspected target architecture and see if it runs; see the example below.

In our case, since our SoC was known to us, we were fairly certain of the target architecture; however, in certain situations, this may not be so simple to discover, and working from hello world binaries can be useful to establish an initial understanding. Once we were able to compile binaries, we then compiled GDBServer for our camera and then used it to attach and launch initApp. Then, we connected to it from another computer on the same local network as the camera. An example of this is shown below:

As a note for the above example, we are using the -x parameter to pass in some commands for convenience, but they are not necessary for debugging. For more information on any of the files or commands, please see our elastic/camera-hacks GitHub repo. In order for initApp to load properly, we also needed to ensure that the libraries used by the binary were accessible via the PATH and LD_LIBARY_PATH environment variables. With this setup, we were then able to debug the binary as we needed. Since we also used the magic character method of defeating the watchdog earlier we also will need to make sure to control instances where the watchdog can be re-enabled. In most cases, we do not want this to happen. As such, we overwrote the watchdog calls in initApp so that the watchdog would not be re-enabled while we were debugging, as shown below.

The following video shows the full setup process from boot to running GDBServer. In the video, we also start a new initApp process, and as such, we need to kill both the original process and the daemon.sh shell script that will spawn a new initApp process if it is killed.

Building a P2P Client

In order to further explore the full extent of capabilities which P2P provides to AJCloud IoT devices and how they can be abused by attackers, we set out to build our own standalone client. This approach would remove the overhead of manipulating the Wansview Cloud Windows app while allowing us to more rapidly connect to cameras and test out commands we derive from reverse engineering the firmware.

From the configuration data we obtained earlier from the Windows app logs, we knew that a client issues requests to up to three different servers as part of the connection process. These servers provide instructions to clients as to where traffic should be routed in order to access a given camera. If you would like to discover more of these servers out in the open, you can scan the Internet using the following four-byte UDP payload on port 60722. Paul Marrapese used this technique to great effect as part of his research.

In order to properly establish a P2P connection, a client must first send a simple hello message (MSG_HELLO), which needs to be ACK’d (MSG_HELLO_ACK) by a peer-to-peer server. The client then queries the server (MSG_P2P_REQ) for a particular deviceId. If the server is aware of that device, then it will respond (MSG_PUNCH_TO) to the client with a target IP address and UDP port number pair. The client will then attempt to connect (MSG_PUNCH_PKT) to the IP and port pair along with other ports within a predetermined range as part of a UDP hole punching routine. If successful, the target will send a message (MSG_PUNCH_PKT) back to the client along with a final message (MSG_P2P_RDY) to confirm that the connection has been established.

After connecting to a camera, we are primarily interested in sending different MSG_DRW packets and observing their behavior. These packets contain commands which will allow us to physically manipulate the camera, view and listen to its video and audio streams, access data stored within it, or alter its configuration. The most straightforward command we started with involved panning the camera counter clockwise, which we could easily identify as a single message transmission.

Debug log messages on the camera allowed us to easily locate where this command was processed within the firmware.

Locating the source of this particular message placed us in the main routine which handles processing MSG_DRW messages, which provided us with critical insight into how this command is invoked and what other commands are supported by the firmware.

Extensive reverse engineering and testing allowed us to build a PoC P2P client which allows users to connect to any camera on the AJCloud platform, provided they have access to its deviceId. Basic commands supported by the client include camera panning and tilting, rebooting, resetting, playing audio clips, and even crashing the firmware.

The most dangerous capability we were able to implement was through a command which modifies a core device configuration file: /var/syscfg/config_default/app_ajy_sn.ini. On our test camera, the file’s contents were originally as follows:

[common]
product_name=Q5
model=NAV
vendor=WVC
serialnum=WVCD7HUJWJNXEKXF
macaddress=
wifimacaddress=

While this appears to contain basic device metadata, this file is the only means through which the camera knows how to identify itself. Upon startup, the camera reads in the contents of this file and then attempts to connect to the AJCloud platform through a series of curl requests to various API endpoints. These curl requests pass along the product name, camera model, vendor code, and serial number values extracted from the INI file as query string arguments. We used our client to deliver a message which overwrites the contents like so:

[common]
product_name=
model=OPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~HH01
vendor=YZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~HH01
serialnum=defghijklmnopqrstuvwxyz{|}~HH01
macaddress=
wifimacaddress=

After the camera is reset, all curl requests issued to AJCloud platform API endpoints as part of the startup routine will fail due to the malformed data contained within the INI file. These requests will continue to periodically be sent, but they will never succeed and the camera will remain inactive and inaccessible through any apps. Unfortunately, there is no simple way to restore the previous file contents through resetting the camera, updating its firmware, or restoring the factory settings. File modifications carried out through this command will effectively brick a camera and render it useless.

Taking a closer look at the decompiled function (syscfg_setAjySnParams) which overwrites the values stored in app_ajy_sn.ini, we can see that input parameters, extracted from the MSG_DRW command are used to pass along string data which will be used to overwrite the model, vendor, and serial number fields in the file. memset is used to overwrite three global variables, intended to store these input strings, with null bytes. strcpy is then used to transfer the input parameters into these globals. In each instance, this will result in bytes being copied directly from the MSG_DRW command buffer until it encounters a null character.

Because no validation is enforced on the length of these input parameters extracted from the command, it is trivial to craft a message of sufficient length which will trigger a buffer overflow. While we did not leverage this vulnerability as part of our attack to brick the camera, this appears to be an instance where an exploit could be developed which would allow for an attacker to achieve remote code execution on the camera.

Impact

We have confirmed that a broad range of devices across several vendors affiliated with AJCloud and several different firmware versions are affected by these vulnerabilities and flaws. Overall, we successfully demonstrated our attacks against fifteen different camera products from Wansview, Galayou, Cinnado, and Faleemi. Based on our findings, it is safe to assume that all devices which operate AJCloud firmware and connect to the AJCloud platform are affected.

All attempts to contact both AJCloud and Wansview in order to disclose these vulnerabilities and flaws were unsuccessful.

What did the vendors do right?

Despite the vulnerabilities we discovered and discussed previously, there are a number of the security controls that AJCloud and the camera vendors implemented well. For such a low cost device, many best practices were implemented. First, the network communications are secured well using certificate based WebSocket authentication. In addition to adding encryption, putting many of the API endpoints behind the certificate auth makes man in the middle attacks significantly more challenging. Furthermore, the APKs for the mobile apps were signed and obfuscated making manipulating these apps very time consuming.

Additionally, the vendors also made some sound decisions with the camera hardware and firmware. The local OS for the camera is effectively limited, focusing on just the needed functionality for their product. The file system is configured to be read only, outside of logging, and the kernel watchdog is an effective method of ensuring uptime and reducing risk of being stuck in a failed state. The Ingenic Xburst T31 SoC, provides a capable platform with a wide range of support including secure boot, a Power-On Reset (POR) watchdog, and a separate RISC-V processor capable of running some rudimentary machine learning on the camera input.

What did the vendors do wrong?

Unfortunately, there were a number of missed opportunities with these available features. Potentially the most egregious is the unauthenticated cloud access. Given the API access controls established for many of the endpoints, having the camera user access endpoints available via serial number without authentication is a huge and avoidable misstep. The P2P protocol is also vulnerable as we showcased, but compared to the API access which should be immediately fixable, this may take some more time to fix the protocol. It is a very dangerous vulnerability, but it is a little bit more understandable as it requires considerably more time investment to both discover and fix.

From the application side, the primary issue is with the Windows app which has extensive debug logging which should have been removed before releasing publicly. As for the hardware, it can be easily manipulated with physical access (exposed reset button, etc.). This is not so much an issue given the target consumer audience. It is expected to err on the side of usability rather than security, especially given physical access to the device. On a similar note, secure boot should be enabled, especially given that the T31 SoC supports it. While not strictly necessary, this would make it much harder to debug the source code and firmware of the device directly, making it more difficult to discover vulnerabilities that may be present. Ideally it would be implemented in such a way that the bootloader could still load an unsigned OS to allow for easier tinkering and development, but would prevent the signed OS from loading until the boot loader configuration is restored. However, one significant flaw in the current firmware is the dependence on the original serial number that is not stored in a read only mount point while the system is running. Manipulating the serial number should not permanently brick the device. It should either have a mechanism for requesting a new serial number (or restoring its original serial number) should its serial number be overwritten, or the serial number should be immutable.

Mitigations

Certain steps can be taken in order to reduce the attack surface and limit potential adverse effects in the event of an attack, though they vary in their effectiveness.

Segmenting Wi-Fi cameras and other IoT devices off from the rest of your network is a highly recommended countermeasure which will prevent attackers from pivoting laterally to more critical systems. However, this approach does not prevent an attacker from obtaining sensitive user data through exploiting the access control vulnerability we discovered in the AJCloud platform. Also, considering the ease in which we were able to demonstrate how cameras could be accessed and manipulated remotely via P2P, any device connected to the AJCloud platform is still at significant risk of compromise regardless of its local network configuration.

Restricting all network communications to and from these cameras would not be feasible due to how essential connectivity to the AJCloud platform is to their operation. As previously mentioned, the devices will simply not operate if they are unable to connect to various API endpoints upon startup.

A viable approach could be restricting communications beyond the initial startup routine. However, this would prevent remote access and control via mobile and desktop apps, which would defeat the entire purpose of these cameras in the first place. For further research in this area, please refer to “Blocking Without Breaking: Identification and Mitigation of Non-Essential IoT Traffic”, which explored this approach more in-depth across a myriad of IoT devices and vendors.

The best approach to securing any Wi-Fi camera, regardless of vendor, while maintaining core functionality would be to flash it with alternative open source firmware such as OpenIPC or thingino. Switching to open source firmware avoids the headaches associated with forced connectivity to vendor cloud platforms by providing users with fine grain control of device configuration and remote network accessibility. Open access to the firmware source helps to ensure that critical flaws and vulnerabilities are quickly identified and patched by diligent project contributors.

Key Takeaways

Our research revealed several critical vulnerabilities that span all aspects of cameras operating AJCloud firmware which are connected to their platform. Significant flaws in access control management on their platform and the PPPP peer protocol provides an expansive attack surface which affects millions of active devices across the world. Exploiting these flaws and vulnerabilities leads to the exposure of sensitive user data and provides attackers with full remote control of any camera connected to the AJCloud platform. Furthermore, a built-in P2P command, which intentionally provides arbitrary write access to a key configuration file, can be leveraged to either permanently disable cameras or facilitate remote code execution through triggering a buffer overflow.

Please visit our GitHub repository for custom tools and scripts we have built along with data and notes we have captured which we felt would provide the most benefit to the security research community.

Now in beta: New Detection as Code capabilities

Thu, 08 Aug 2024 00:00:00 GMT

Exciting news! Our Detections as Code (DaC) improvements to the detection-rules repo are now in beta. In May this year, we shared the Alpha stages of our research into Rolling your own Detections as Code with Elastic Security. Elastic is working on supporting DaC in Elastic Security. While in the future DaC will be integrated within the UI, the current updates are focused on the detection rules repo on main to allow users to set up DaC quickly and get immediate value with available tests and commands integration with Elastic Security. We have a considerable amount of documentation and examples, but let’s take a quick look at what this means for our users.

Why DaC?

From validation and automation to enhancing cross-vendor content, there are several reasons previously discussed to use a DaC approach for rule management. Our team of detection engineers have been using the detection rules repo for testing and validation of our rules for some time. We now can provide the same testing and validation that we perform in a more accessible way. We aim to empower our users by adding straightforward CLI commands within our detection-rules repo, to help manage rules across the full rule lifecycle between version control systems (VCS) and Kibana. This allows users to move, unit test, and validate their rules in a single command easily using CI/CD pipelines.

Improving Process Maturity

Security organizations are facing the same bottomline, which is that we can’t rely on static out-of-the-box signatures. At its core, DaC is a methodology that applies software development practices to the creation and management of security detection rules, enabling automation, version control, testing, and collaboration in the development & deployment of security detections. Unit testing, peer review, and CI/CD enable software developers to be confident in their processes. These help catch errors and inefficiencies before they impact their customers. The same should be true in detection engineering. Fitting with this declaration here are some examples of some of the new features we are supporting. See our DaC Reference Guide for complete documentation.

Bulk Import and Export of Custom Rules

Custom rules can now be moved in bulk to and from Kibana using the kibana import-rules and kibana export-rules commands. Additionally, one can move them in bulk to and from TOML format to ndjson using the import-rules-to-repo and export-rules-from-repo commands. In addition to rules, these commands support moving exceptions and exception lists using the appropriate flag. The ndjson approach's benefit is that it allows engineers to manage and share a collection of rules in a single file (exported by the CLI or from Kibana), which is helpful when access is not permitted to the other Elastic environment. When moving rules using either of these methods, the rules pass through schema validation unless otherwise specified to ensure that the rules contain the appropriate data fields. For more information on these commands, please see the CLI.md file in detection rules.

Configurable Unit Tests, Validation, and Schemas

With this new feature, we've now included the ability to configure the behavior of unit tests and schema validation using configuration files. In these files, you can now set specific tests to be bypassed, specify only specific tests to run, and likewise with schema validation against specific rules. You can run this validation and unit tests at any time by running make test. Furthermore, you can now bring your schema (JSON file) to our validation process. You can also specify which schemas to use against which target versions of your Stack. For example, if you have custom schemas that only apply to rules in 8.14 while you have a different schema that should be used for 8.10, this can now be managed via a configuration file. For more information, please see our example configuration file or use our custom-rules setup-config command from the detection rules repo to generate an example for you.

Custom Version Control

We now are providing the ability to manage custom rules using the same version lock logic that Elastic’s internal team uses to manage our rules for release. This is done through a version lock file that checks the hash of the rule contents and determines whether or not they have changed. Additionally, we are providing a configuration option to disable this version lock file to allow users to use an alternative means of version control such as using a git repo directly. For more information please see the version control section of our documentation. Note that you can still rely on Kibana’s versioning fields.

Having these systems in place provides auditable evidence for maintaining security rules. Adopting some or all of these best practices can dramatically improve quality in maintaining and developing security rules.

Broader Adoption of Automation

While quality is critical, security teams and organizations face growing rule sets to respond to an ever-expanding threat landscape. As such, it is just as crucial to reduce the strain on security analysts by providing rapid deployment and execution. For our repo, we have a single-stop shop where you can set your configuration, focus on rule development, and let the automation handle the rest.

Lowering the Barrier to Entry

To start, simply clone or fork our detection rules repo, run custom-rules setup-config to generate an initial config, and import your rules. From here, you now have unit tests and validation ready for use. If you are using GitLab, you can quickly create CI/CD to push the latest rules to Kibana and run these tests. Here is an example of what that could look like:

High Flexibility

While we use GitHub CI/CD for managing our release actions, by no means are we prescribing that this is the only way to manage detection rules. Our CLI commands have no dependencies outside of their python requirements. Perhaps you have already started implementing some DaC practices, and you may be looking to take advantage of the Python libraries we provide. Whatever the case may be, we want to encourage you to try adopting DaC principles in your workflows and we would like to provide flexible tooling to accomplish these goals.

To illustrate an example, let’s say we have an organization that is already managing their own rules with a VCS and has built automation to move rules back and forth from deployment environments. However, they would like to augment these movements with testing based on telemetry which they are collecting and storing in a database. Our DaC features already provide custom unit testing classes that can run per rule. Realizing this goal may be as simple as forking the detection rules repo and writing a single unit test. The figure below shows an example of what this could look like.

This new unit test could utilize our unit test classes and rule loading to provide scaffolding to load rules from a file or Kibana instance. Next, one could create different integration tests against each rule ID to see if they pass the organization's desired results (e.g. does the rule identify the correct behaviors). If they do, the CI/CD tooling can proceed as originally planned. If they fail, one can use DaC tooling to move those rules to a “needs tuning” folder and/or upload those rules to a “Tuning” Kibana space. In this way, one could use a hybrid of our tooling and one's own tooling to keep an up to date Kibana space (or VCS controlled folder) of what rules require updates. As updates are made and issues addressed, they could also be continually synchronized across spaces, leading to a more cohesive environment.

This is just one idea of how one can take advantage of our new DaC features in your environment. In practice, there are a vast number of different ways they can be utilized.

In Practice

Now, let’s take a look at how we can tie these new features together into a cohesive DaC strategy. As a reminder, this is not prescriptive. Rather, this should be thought of as an optional, introductory strategy that can be built on to achieve your DaC goals.

Establishing a DaC Baseline

In detection engineering, we would like collaboration to be a default rather than an exception. Detection Rules is a public repo precisely with this precept in mind. Now, it can become a basis for the community and teammates to not only collaborate with us, but also with each other. Let’s use the chart below as an example for what this could look like.

Reading from left to right, we have initial planning and prioritization and the subsequent threat research that drives the detection engineering. This process will look quite different for each user so we are not going to spend much time describing it here. However, the outcome will largely be similar, the creation of new detection rules. These could be in various forms like Sigma rules (more in a later blog), Elastic TOML rule files, or creating the rules directly in Kibana. Regardless of format, once created these rules need to be staged. This would either occur in Kibana, your VCS, or both. From a DaC perspective, the goal is to sync the rules such that the process/automation are aware of these new additions. Furthermore, this provides the opportunity for peer review of these additions — the first stage of collaboration.

This will likely happen in your version control system; for instance, in GitHub one could use a PR with required approvals before merging back into a main branch that acts as the authoritative source of reviewed rules. The next step is for testing and validation, this step could additionally occur before peer review and this is largely up to the desired implementation.

In addition to any other internal release processes, by adhering to this workflow, we can reduce the risk of malformed rules and errant mistakes from reaching both our customers and the community. Additionally, having the evidence artifacts, passing unit tests, schema validation, etc., inspires confidence and provides control for each user to choose what risks they are willing to accept.

Once deployed and distributed, rule performance can be monitored from Kibana. Updates to these rules can be made either directly from Kibana or through the VCS. This will largely be dependent on the implementation specifics, but in either case, these can be treated very similarly to new rules and pass through the same peer review, testing, and validation processes.

As shown in the figure above, this can provide a unified method for handling rule updates whether from the community, customers, or from internal feedback. Since the rules ultimately exist as version-controlled files, there is a dedicated format source of truth to merge and test against.

In addition to the process quality improvements, having authoritative known states can empower additional automation. As an example, different customers may require different testing or perhaps different data sources. Instead of having to parse the rules manually, we provide a unified configuration experience where users can simply bring their own config and schemas and be confident that their specific requirements are met. All of this can be managed automatically via CI/CD. With a fully automated DaC setup, one can take advantage of this system entirely from VCS and Kibana without needing to write additional code. Let’s take a look at an example of what this could look like.

Example

For this example, we are going to be acting as an organization that has 2 Kibana spaces they want to manage via DaC. The first is a development space that rule authors will be using to write detection rules (so let’s assume there are some preexisting rules already available). There will also be some developers that are writing detection rules directly in TOML file formats and adding them to our VCS, so we will need to manage synchronization of these. Additionally, this organization wants to enforce unit testing and schema validation with the option for peer review on rules that will be deployed to a production space in the same Kibana instance. Finally, the organization wants all of this to occur in an automated manner with no requirement to either clone detection rules locally or write rules outside of a GUI.

In order to accomplish this we will need to make use of a few of the new DaC features in detection rules and write some simple CI/CD workflows. In this example we are going to be using GitHub. Additionally, you can find a video walkthrough of this example here. As a note, if you wish to follow along you will need to fork the detection rules repo and create an initial configuration using our custom-rules setup-config command. Also for general step by step instructions on how to use the DAC features, see this quickstart guide, which has several example commands.

Development Space Rule Synchronization

First we are going to synchronize from Kibana -> GitHub (VCS). To do this we will be using the kibana import-rules and kibana export-rules detection rules commands. Additionally, in order to keep the rule versions synchronized we will be using the locked versions file as we are wanting both our VCS and Kibana to be able to overwrite each other with the latest versions. This is not required for this setup, either Kibana or GitHub (VCS) could be used authoritatively instead of the locked versions file. But we will be using it for convenience.

The first step is for us to make a manual dispatch trigger that will pull the latest rules from Kibana upon request. In our setup this could be done automatically; however, we want to give rule authors control for when they want to move their rules to the VCS as the development space in Kibana is actively used for development and the presence of a new rule does not necessarily mean the rule is ready for VCS. The manual dispatch section could look like the following example:

With this trigger in place, we now can write 4 additional jobs that will trigger on this workflow dispatch.

Pull the rules from the desired Kibana space.
Update the version lock file.
Create a PR request for review to merge into the main branch in GitHub.
Set the correct target for the PR.

These jobs could look like this also from the same example:

Now, once we run this workflow we should expect to see a PR open with the new rules from the Kibana Dev space. We also need to synchronize rules from GitHub (VCS) to Kibana. For this we will need to create a triggers on pull request:

Next, we just need to create a job that uses the kibana import-rules command to push the rule files from the given PR to Kibana. See the second example for the complete workflow file.

With these two workflows complete we now have synchronization of rules between GitHub and the Kibana Dev space.

Production Space Deployment

With the Dev space synchronized, now we need to handle the prod space. As a reminder, for this we need to enforce unit testing, schema validation, available peer review for PRs to main, and on merge to main auto push to the prod space. To accomplish this we will need two workflow files. The first will run unit tests on all pull requests and pushes to versioned branches. The second will push the latest rules merged to main to the prod space in Kibana.

The first workflow file is very simple. It has an on push and pull_request trigger and has the core job of running the test command shown below. See this example for the full workflow.

With this test command we are performing unit tests and schema validation with the parameters specified in our config files on all of our custom rules. Now we just need the workflow to push the latest rules to the prod space. The core of this workflow is the kibana import-rules command again just using the prod space as the destination. However, there are a number of additional options provided to this workflow that are not necessary but nice to have in this example, such as options to overwrite and update exceptions/exception lists as well as rules. The core job is shown below. Please see this example for the full workflow file.

And there we have it, with those 4 workflow files we have a synchronized development space with rules passing through unit testing and schema validation. We have the option for peer review through the use of pull requests, which can be made as requirements in GitHub before allowing for merges to main. On merge to main in GitHub we also have an automated push to the Kibana prod space, establishing our baseline of rules that have passed our organizations requirements and are ready for use. All of this was accomplished without writing additional Python code, just by using our new DaC features in GitHub workflows.

Conclusion

Now that we’ve reached this milestone, you may be wondering what’s next? We’re planning to spend the next few cycles continuing to test edge cases and incorporating feedback from the community as part of our business-as-usual sprints. We also have a backlog of features request considerations so if you want to voice your opinion, checkout the issues titled [FR][DAC] Consideration: or open a similar new issue if it’s not already recorded. This will help us to prioritize the most important features for the community.

We’re always interested in hearing use cases and workflows like these, so as always, reach out to us via GitHub issues, chat with us in our security-rules-dac slack channel, and ask questions in our Discuss forums!

Google Cloud for Cyber Data Analytics

Thu, 14 Dec 2023 00:00:00 GMT

Introduction

In today's digital age, the sheer volume of data generated by devices and systems can be both a challenge and an opportunity for security practitioners. Analyzing a high magnitude of data to craft valuable or actionable insights on cyber attack trends requires precise tools and methodologies.

Before you delve into the task of data analysis, you might find yourself asking:

What specific questions am I aiming to answer, and do I possess the necessary data?
Where is all the pertinent data located?
How can I gain access to this data?
Upon accessing the data, what steps are involved in understanding and organizing it?
Which tools are most effective for extracting, interpreting, or visualizing the data?
Should I analyze the raw data immediately or wait until it has been processed?
Most crucially, what actionable insights can be derived from the data?

If these questions resonate with you, you're on the right path. Welcome to the world of Google Cloud, where we'll address these queries and guide you through the process of creating a comprehensive report.

Our approach will include several steps in the following order:

Exploration: We start by thoroughly understanding the data at our disposal. This phase involves identifying potential insights we aim to uncover and verifying the availability of the required data.

Extraction: Here, we gather the necessary data, focusing on the most relevant and current information for our analysis.

Pre-processing and transformation: At this stage, we prepare the data for analysis. This involves normalizing (cleaning, organizing, and structuring) the data to ensure its readiness for further processing.

Trend analysis: The majority of our threat findings and observations derive from this effort. We analyze the processed data for patterns, trends, and anomalies. Techniques such as time series analysis and aggregation are employed to understand the evolution of threats over time and to highlight significant cyber attacks across various platforms.

Reduction: In this step, we distill the data to its most relevant elements, focusing on the most significant and insightful aspects.

Presentation: The final step is about presenting our findings. Utilizing tools from Google Workspace, we aim to display our insights in a clear, concise, and visually-engaging manner.

Conclusion: Reflecting on this journey, we'll discuss the importance of having the right analytical tools. We'll highlight how Google Cloud Platform (GCP) provides an ideal environment for analyzing cyber threat data, allowing us to transform raw data into meaningful insights.

Exploration: Determining available data

Before diving into any sophisticated analyses, it's necessary to prepare by establishing an understanding of the data landscape we intend to study.

Here's our approach:

Identifying available data: The first step is to ascertain what data is accessible. This could include malware phenomena, endpoint anomalies, cloud signals, etc. Confirming the availability of these data types is essential.
Locating the data stores: Determining the exact location of our data. Knowing where our data resides – whether in databases, data lakes, or other storage solutions – helps streamline the subsequent analysis process.
Accessing the data: It’s important to ensure that we have the necessary permissions or credentials to access the datasets we need. If we don’t, attempting to identify and request access from the resource owner is necessary.
Understanding the data schema: Comprehending the structure of our data is vital. Knowing the schema aids in planning the analysis process effectively.
Evaluating data quality: Just like any thorough analysis, assessing the quality of the data is crucial. We check whether the data is segmented and detailed enough for a meaningful trend analysis.

This phase is about ensuring that our analysis is based on solid and realistic foundations. For a report like the Global Threat Report, we rely on rich and pertinent datasets such as:

Cloud signal data: This includes data from global Security Information and Event Management (SIEM) alerts, especially focusing on cloud platforms like AWS, GCP, and Azure. This data is often sourced from public detection rules.
Endpoint alert data: Data collected from the global Elastic Defend alerts, incorporating a variety of public endpoint behavior rules.
Malware data: This involves data from global Elastic Defend alerts, enriched with MalwareScore and public YARA rules.

Each dataset is categorized and enriched for context with frameworks like MITRE ATT&CK, Elastic Stack details, and customer insights. Storage solutions of Google Cloud Platform, such as BigQuery and Google Cloud Storage (GCS) buckets, provide a robust infrastructure for our analysis.

It's also important to set a data “freshness” threshold, excluding data not older than 365 days for an annual report, to ensure relevance and accuracy.

Lastly, remember to choose data that offers an unbiased perspective. Excluding or including internal data should be an intentional, strategic decision based on its relevance to your visibility.

In summary, selecting the right tools and datasets is fundamental to creating a comprehensive and insightful analysis. Each choice contributes uniquely to the overall effectiveness of the data analysis, ensuring that the final insights are both valuable and impactful.

Extraction: The first step in data analysis

Having identified and located the necessary data, the next step in our analytical journey is to extract this data from our storage solutions. This phase is critical, as it sets the stage for the in-depth analysis that follows.

Data extraction tools and techniques

Various tools and programming languages can be utilized for data extraction, including Python, R, Go, Jupyter Notebooks, and Looker Studio. Each tool offers unique advantages, and the choice depends on the specific needs of your analysis.

In our data extraction efforts, we have found the most success from a combination of BigQuery, Colab Notebooks, buckets, and Google Workspace to extract the required data. Colab Notebooks, akin to Jupyter Notebooks, operate within Google's cloud environment, providing a seamless integration with other Google Cloud services.

BigQuery for data staging and querying

In the analysis process, a key step is to "stage" our datasets using BigQuery. This involves utilizing BigQuery queries to create and save objects, thereby making them reusable and shareable across our team. We achieve this by employing the CREATE TABLE statement, which allows us to combine multiple datasets such as endpoint behavior alerts, customer data, and rule data into a single, comprehensive dataset.

This consolidated dataset is then stored in a BigQuery table specifically designated for this purpose–for this example, we’ll refer to it as the “Global Threat Report” dataset. This approach is applied consistently across different types of data, including both cloud signals and malware datasets.

The newly created data table, for instance, might be named elastic.global_threat_report.ep_behavior_raw. This naming convention, defined by BigQuery, helps in organizing and locating the datasets effectively, which is crucial for the subsequent stages of the extraction process.

An example of a BigQuery query used in this process might look like this:

CREATE TABLE elastic.global_threat_report.ep_behavior_raw AS
SELECT * FROM ...

Diagram for BigQuery query to an exported dataset table

We also use the EXPORT DATA statement in BigQuery to transfer tables to other GCP services, like exporting them to Google Cloud Storage (GCS) buckets in parquet file format.

EXPORT DATA
  OPTIONS (
    uri = 'gs://**/ep_behavior/*.parquet',
    format = 'parquet',
    overwrite = true
  )
AS (
SELECT * FROM `project.global_threat_report.2023_pre_norm_ep_behavior`
)

Colab Notebooks for loading staged datasets

Colab Notebooks are instrumental in organizing our data extraction process. They allow for easy access and management of data scripts stored in platforms like GitHub and Google Drive.

For authentication and authorization, we use Google Workspace credentials, simplifying access to various Google Cloud services, including BigQuery and Colab Notebooks. Here's a basic example of how authentication is handled:

Diagram for authentication and authorization between Google Cloud services

For those new to Jupyter Notebooks or dataframes, it's beneficial to spend time becoming familiar with these tools. They are fundamental in any data analyst's toolkit, allowing for efficient code management, data analysis, and structuring. Mastery of these tools is key to effective data analysis.

Upon creating a notebook in Google Colab, we're ready to extract our custom tables (such as project.global_threat_report.ep_behavior_raw) from BigQuery. This data is then loaded into Pandas Dataframes, a Python library that facilitates data manipulation and analysis. While handling large datasets with Python can be challenging, Google Colab provides robust virtual computing resources. If needed, these resources can be scaled up through the Google Cloud Marketplace or the Google Cloud Console, ensuring that even large datasets can be processed efficiently.

Essential Python libraries for data analysis

In our data analysis process, we utilize various Python libraries, each serving a specific purpose:

Library	Description
datetime	Essential for handling all operations related to date and time in your data. It allows you to manipulate and format date and time information for analysis.
google.auth	Manages authentication and access permissions, ensuring secure access to Google Cloud services. It's key for controlling who can access your data and services.
google.colab.auth	Provides authentication for accessing Google Cloud services within Google Colab notebooks, enabling a secure connection to your cloud-based resources.
google.cloud.bigquery	A tool for managing large datasets in Google Cloud's BigQuery service. It allows for efficient processing and analysis of massive amounts of data.
google.cloud.storage	Used for storing and retrieving data in Google Cloud Storage. It's an ideal solution for handling various data files in the cloud.
gspread	Facilitates interaction with Google Spreadsheets, allowing for easy manipulation and analysis of spreadsheet data.
gspread.dataframe.set_with_dataframe	Syncs data between Pandas dataframes and Google Spreadsheets, enabling seamless data transfer and updating between these formats.
matplotlib.pyplot.plt	A module in Matplotlib library for creating charts and graphs. It helps in visualizing data in a graphical format, making it easier to understand patterns and trends.
pandas	A fundamental tool for data manipulation and analysis in Python. It offers data structures and operations for manipulating numerical tables and time series.
pandas.gbq.to_gbq	Enables the transfer of data from Pandas dataframes directly into Google BigQuery, streamlining the process of moving data into this cloud-based analytics platform.
pyarrow.parquet.pq	Allows for efficient storage and retrieval of data in the Parquet format, a columnar storage file format optimized for use with large datasets.
seaborn	A Python visualization library based on Matplotlib that provides a high-level interface for drawing attractive and informative statistical graphics.

Next, we authenticate with BigQuery, and receive authorization to access our datasets as demonstrated earlier. By using Google Workspace credentials, we can easily access BigQuery and other Google Cloud services. The process typically involves a simple code snippet for authentication:

from google.colab import auth
from google.cloud import bigquery

auth.authenticate_user()
project_id = "PROJECT_FROM_GCP"
client = bigquery.Client(project=project_id)

With authentication complete, we can then proceed to access and manipulate our data. Google Colab's integration with Google Cloud services simplifies this process, making it efficient and secure.

Organizing Colab Notebooks before analysis

When working with Jupyter Notebooks, it's better to organize your notebook beforehand. Various stages of handling and manipulating data will be required, and staying organized will help you create a repeatable, comprehensive process.

In our notebooks, we use Jupyter Notebook headers to organize the code systematically. This structure allows for clear compartmentalization and the creation of collapsible sections, which is especially beneficial when dealing with complex data operations that require multiple steps. This methodical organization aids in navigating the notebook efficiently, ensuring that each step in the data extraction and analysis process is easily accessible and manageable.

Moreover, while the workflow in a notebook might seem linear, it's often more dynamic. Data analysts frequently engage in multitasking, jumping between different sections as needed based on the data or results they encounter. Furthermore, new insights discovered in one step may influence another step’s process, leading to some back and forth before finishing the notebook. |

Extracting Our BigQuery datasets into dataframes

After establishing the structure of our notebook and successfully authenticating with BigQuery, our next step is to retrieve the required datasets. This process sets the foundation for the rest of the report, as the information from these sources will form the basis of our analysis, similar to selecting the key components required for a comprehensive study.

Here's an example of how we might fetch data from BigQuery:

import datetime

current_year = datetime.datetime.now().year
reb_dataset_id = f'project.global_threat_report.{current_year}_raw_ep_behavior'
reb_table = client.list_rows(reb_dataset_id)
reb_df = reb_table.to_dataframe()

This snippet demonstrates a typical data retrieval process. We first define the dataset we're interested in (with the Global Threat Report, project.global_threat_report.ep_behavior_raw for the current year). Then, we use a BigQuery query to select the data from this dataset and load it into a Pandas DataFrame. This DataFrame will serve as the foundation for our subsequent data analysis steps.

Colab Notebook snippet for data extraction from BigQuery into Pandas dataframe

This process marks the completion of the extraction phase. We have successfully navigated BigQuery to select and retrieve the necessary datasets and load them in our notebooks within dataframes. The extraction phase is pivotal, as it not only involves gathering the data but also setting up the foundation for deeper analysis. It's the initial step in a larger journey of discovery, leading to the transformation phase, where we will uncover more detailed insights from the data.

In summary, this part of our data journey is about more than just collecting datasets; it's about structurally preparing them for the in-depth analysis that follows. This meticulous approach to organizing and executing the extraction phase sets the stage for the transformative insights that we aim to derive in the subsequent stages of our data analysis.

Pre-processing and transformation: The critical phase of data analysis

The transition from raw data to actionable insights involves a series of crucial steps in data processing. After extracting data, our focus shifts to refining it for analysis. Cybersecurity datasets often include various forms of noise, such as false positives and anomalies, which must be addressed to ensure accurate and relevant analysis.

Key stages in data pre-processing and transformation:

Data cleaning: This stage involves filling NULL values, correcting data misalignments, and validating data types to ensure the dataset's integrity.
Data enrichment: In this step, additional context is added to the dataset. For example, incorporating third-party data, like malware reputations from sources such as VirusTotal, enhances the depth of analysis.
Normalization: This process standardizes the data to ensure consistency, which is particularly important for varied datasets like endpoint malware alerts.
Anomaly detection: Identifying and rectifying outliers or false positives is critical to maintain the accuracy of the dataset.
Feature extraction: The process of identifying meaningful, consistent data points that can be further extracted for analysis.

Embracing the art of data cleaning

Data cleaning is a fundamental step in preparing datasets for comprehensive analysis, especially in cybersecurity. This process involves a series of technical checks to ensure data integrity and reliability. Here are the specific steps:

Mapping to MITRE ATT&CK framework: Verify that all detection and response rules in the dataset are accurately mapped to the corresponding tactics and techniques in the MITRE ATT&CK framework. This check includes looking for NULL values or any inconsistencies in how the data aligns with the framework.
Data type validation: Confirm that the data types within the dataset are appropriate and consistent. For example, timestamps should be in a standardized datetime format. This step may involve converting string formats to datetime objects or verifying that numerical values are in the correct format.
Completeness of critical data: Ensure that no vital information is missing from the dataset. This includes checking for the presence of essential elements like SHA256 hashes or executable names in endpoint behavior logs. The absence of such data can lead to incomplete or biased analysis.
Standardization across data formats: Assess and implement standardization of data formats across the dataset to ensure uniformity. This might involve normalizing text formats, ensuring consistent capitalization, or standardizing date and time representations.
Duplicate entry identification: Identify and remove duplicate entries by examining unique identifiers such as XDR agent IDs or cluster IDs. This process might involve using functions to detect and remove duplicates, ensuring the uniqueness of each data entry.
Exclusion of irrelevant internal data: Locate and remove any internal data that might have inadvertently been included in the dataset. This step is crucial to prevent internal biases or irrelevant information from affecting the analysis.

It is important to note that data cleaning or “scrubbing the data” is a continuous effort throughout our workflow. As we continue to peel back the layers of our data and wrangle it for various insights, it is expected that we identify additional changes.

Utilizing Pandas for data cleaning

The Pandas library in Python offers several functionalities that are particularly useful for data cleaning in cybersecurity contexts. Some of these methods include:

DataFrame.isnull() or DataFrame.notnull() to identify missing values.
DataFrame.drop_duplicates() to remove duplicate rows.
Data type conversion methods like pd.to_datetime() for standardizing timestamp formats.
Utilizing boolean indexing to filter out irrelevant data based on specific criteria.

A thorough understanding of the dataset is essential to determine the right cleaning methods. It may be necessary to explore the dataset preliminarily to identify specific areas requiring cleaning or transformation. Additional helpful methods and workflows can be found listed in this Real Python blog.

Feature extraction and enrichment

Feature extraction and enrichment are core steps in data analysis, particularly in the context of cybersecurity. These processes involve transforming and augmenting the dataset to enhance its usefulness for analysis.

Create new data from existing: This is where we modify or use existing data to add additional columns or rows.
Add new data from 3rd-party: Here, we use existing data as a query reference for 3rd-party RESTful APIs which respond with additional data we can add to the datasets.

Feature extraction

Let’s dig into a tangible example. Imagine we're presented with a bounty of publicly available YARA signatures that Elastic shares with its community. These signatures trigger some of the endpoint malware alerts in our dataset. A consistent naming convention has been observed based on the rule name that, of course, shows up in the raw data: OperationsSystem_MalwareCategory_MalwareFamily. These names can be deconstructed to provide more specific insights. Leveraging Pandas, we can expertly slice and dice the data. For those who prefer doing this during the dataset staging phase with BigQuery, the combination of SPLIT and OFFSET clauses can yield similar results:

df[['OperatingSystem', 'MalwareCategory', 'MalwareFamily']] = df['yara_rule_name'].str.split('_', expand=True)

Feature extraction with our YARA data

There are additional approaches, methods, and processes to feature extraction in data analysis. We recommend consulting your stakeholder's wants/needs and exploring your data to help determine what is necessary for extraction and how.

Data enrichment

Data enrichment enhances the depth and context of cybersecurity datasets. One effective approach involves integrating external data sources to provide additional perspectives on the existing data. This can be particularly valuable in understanding and interpreting cybersecurity alerts.

Example of data enrichment: Integrating VirusTotal reputation data A common method of data enrichment in cybersecurity involves incorporating reputation scores from external threat intelligence services like VirusTotal (VT). This process typically includes:

Fetching reputation data: Using an API key from VT, we can query for reputational data based on unique identifiers in our dataset, such as SHA256 hashes of binaries.

import requests

def get_reputation(sha256, API_KEY, URL):
    params = {'apikey': API_KEY, 'resource': sha256}
    response = requests.get(URL, params=params)
    json_response = response.json()
    
    if json_response.get("response_code") == 1:
        positives = json_response.get("positives", 0)
        return classify_positives(positives)
    else:
        return "unknown"

In this function, classify_positives is a custom function that classifies the reputation based on the number of antivirus engines that flagged the file as malicious.

Adding reputation data to the dataset: The reputation data fetched from VirusTotal is then integrated into the existing dataset. This is done by applying the get_reputation function to each relevant entry in the DataFrame.

df['reputation'] = df['sha256'].apply(lambda x: get_reputation(x, API_KEY, URL))

Here, a new column named reputation is added to the dataframe, providing an additional layer of information about each binary based on its detection rate in VirusTotal.

This method of data enrichment is just one of many options available for enhancing cybersecurity threat data. By utilizing robust helper functions and tapping into external data repositories, analysts can significantly enrich their datasets. This enrichment allows for a more comprehensive understanding of the data, leading to a more informed and nuanced analysis. The techniques demonstrated here are part of a broader range of advanced data manipulation methods that can further refine cybersecurity data analysis.

Normalization

Especially when dealing with varied datasets in cybersecurity, such as endpoint alerts and cloud SIEM notifications, normalization may be required to get the most out of your data.

Understanding normalization: At its core, normalization is about adjusting values measured on different scales to a common scale, ensuring that they are proportionally represented, and reducing redundancy. In the cybersecurity context, this means representing events or alerts in a manner that doesn't unintentionally amplify or reduce their significance.

Consider our endpoint malware dataset. When analyzing trends, say, infections based on malware families or categories, we aim for an accurate representation. However, a single malware infection on an endpoint could generate multiple alerts depending on the Extended Detection and Response (XDR) system. If left unchecked, this could significantly skew our understanding of the threat landscape. To counteract this, we consider the Elastic agents, which are deployed as part of the XDR solution. Each endpoint has a unique agent, representing a single infection instance if malware is detected. Therefore, to normalize this dataset, we would "flatten" or adjust it based on unique agent IDs. This means, for our analysis, we'd consider the number of unique agent IDs affected by a specific malware family or category rather than the raw number of alerts.

Example visualization of malware alert normalization by unique agents

As depicted in the image above, if we chose to not normalize the malware data in preparation for trend analysis, our key findings would depict inaccurate information. This inaccuracy could be sourced from a plethora of data inconsistencies such as generic YARA rules, programmatic operations that were flagged repeatedly on a single endpoint, and many more.

Diversifying the approach: On the other hand, when dealing with endpoint behavior alerts or cloud alerts (from platforms like AWS, GCP, Azure, Google Workspace, and O365), our normalization approach might differ. These datasets could have their own nuances and may not require the same "flattening" technique used for malware alerts.

Conceptualizing normalization options: Remember the goal of normalization is to reduce redundancy in your data. Make sure to keep your operations as atomic as possible in case you need to go back and tweak them later. This is especially true when performing both normalization and standardization. Sometimes these can be difficult to separate, and you may have to go back and forth between the two. Analysts have a wealth of options for these. From Min-Max scaling, where values are shifted and rescaled to range between 0 and 1, to Z-score normalization (or standardization), where values are centered around zero and standard deviations from the mean. The choice of technique depends on the nature of the data and the specific requirements of the analysis.

In essence, normalization ensures that our cybersecurity analysis is based on a level playing field, giving stakeholders an accurate view of the threat environment without undue distortions. This is a critical step before trend analysis.

Anomaly detection: Refining the process of data analysis

In the realm of cybersecurity analytics, a one-size-fits-all approach to anomaly detection does not exist. The process is highly dependent on the specific characteristics of the data at hand. The primary goal is to identify and address outliers that could potentially distort the analysis. This requires a dynamic and adaptable methodology, where understanding the nuances of the dataset is crucial.

Anomaly detection in cybersecurity involves exploring various techniques and methodologies, each suited to different types of data irregularities. The strategy is not to rigidly apply a single method but rather to use a deep understanding of the data to select the most appropriate technique for each situation. The emphasis is on flexibility and adaptability, ensuring that the approach chosen provides the clearest and most accurate insights into the data.

Statistical methods – The backbone of analysis:

Statistical analysis is always an optional approach to anomaly detection, especially for cyber security data. By understanding the inherent distribution and central tendencies of our data, we can highlight values that deviate from the norm. A simple yet powerful method, the Z-score, gauges the distance of a data point from the mean in terms of standard deviations.

import numpy as np

# Derive Z-scores for data points in a feature
z_scores = np.abs((df['mitre_technique'] - df['mitre_technique'].mean()) / df['mitre_technique'].std())

outliers = df[z_scores > 3]  # Conventionally, a Z-score above 3 signals an outlier

Why this matters: This method allows us to quantitatively gauge the significance of a data point's deviation. Such outliers can heavily skew aggregate metrics like mean or even influence machine learning model training detrimentally. Remember, outliers should not always be removed; it is all about context! Sometimes you may even be looking for the outliers specifically.

Key library: While we utilize NumPy above, SciPy can also be employed for intricate statistical operations.

Aggregations and sorting – unraveling layers:

Data often presents itself in layers. By starting with a high-level view and gradually diving into specifics, we can locate inconsistencies or anomalies. When we aggregate by categories such as the MITRE ATT&CK tactic, and then delve deeper, we gradually uncover the finer details and potential anomalies as we go from technique to rule logic and alert context.

# Aggregating by tactics first
tactic_agg = df.groupby('mitre_tactic').size().sort_values(ascending=False)

From here, we can identify the most common tactics and choose the tactic with the highest count. We then filter our data for this tactic to identify the most common technique associated with the most common tactic. Techniques often are more specific than tactics and thus add more explanation about what we may be observing. Following the same approach we can then filter for this specific technique, aggregate by rule and review that detection rule for more context. The goal here is to find “noisy” rules that may be skewing our dataset and thus related alerts need to be removed. This cycle can be repeated until outliers are removed and the percentages appear more accurate.

Why this matters: This layered analysis approach ensures no stone is left unturned. By navigating from the general to the specific, we systematically weed out inconsistencies.

Key library: Pandas remains the hero, equipped to handle data-wrangling chores with finesse.

Visualization – The lens of clarity:

Sometimes, the human eye, when aided with the right visual representation, can intuitively detect what even the most complex algorithms might miss. A boxplot, for instance, not only shows the central tendency and spread of data but distinctly marks outliers.

import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(12, 8))
sns.boxplot(x='Malware Family', y='Malware Score', data=df)
plt.title('Distribution of Malware Scores by Family')
plt.show()

Example visualization of malware distribution scores by family from an example dataset

Why this matters: Visualization transforms abstract data into tangible insights. It offers a perspective that's both holistic and granular, depending on the need.

Key library: Seaborn, built atop Matplotlib, excels at turning data into visual stories.

Machine learning – The advanced guard:

When traditional methods are insufficient, machine learning steps in, offering a predictive lens to anomalies. While many algorithms are designed to classify known patterns, some, like autoencoders in deep learning, learn to recreate 'normal' data, marking any deviation as an anomaly.

Why this matters: As data complexity grows, the boundaries of what constitutes an anomaly become blurrier. Machine learning offers adaptive solutions that evolve with the data.

Key libraries: Scikit-learn is a treasure trove for user-friendly, classical machine learning techniques, while PyTorch brings the power of deep learning to the table.

Perfecting anomaly detection in data analysis is similar to refining a complex skill through practice and iteration. The process often involves trial and error, with each iteration enhancing the analyst's familiarity with the dataset. This progressive understanding is key to ensuring that the final analysis is both robust and insightful. In data analysis, the journey of exploration and refinement is as valuable as the final outcome itself.

Before proceeding to in-depth trend analysis, it's very important to ensure that the data is thoroughly pre-processed and transformed. Just as precision and reliability are essential in any meticulous task, they are equally critical in data analysis. The steps of cleaning, normalizing, enriching, and removing anomalies from the groundwork for deriving meaningful insights. Without these careful preparations, the analysis could range from slightly inaccurate to significantly misleading. It's only when the data is properly refined and free of distortions that it can reveal its true value, leading to reliable and actionable insights in trend analysis.

Trend analysis: Unveiling patterns in data

In the dynamic field of cybersecurity where threat actors continually evolve their tactics, techniques, and procedures (TTPs), staying ahead of emerging threats is critical. Trend analysis serves as a vital tool in this regard, offering a way to identify and understand patterns and behaviors in cyber threats over time.

By utilizing the MITRE ATT&CK framework, cybersecurity professionals have a structured and standardized approach to analyzing and categorizing these evolving threats. This framework aids in systematically identifying patterns in attack methodologies, enabling defenders to anticipate and respond to changes in adversary behaviors effectively.

Trend analysis, through the lens of the MITRE ATT&CK framework, transforms raw cybersecurity telemetry into actionable intelligence. It allows analysts to track the evolution of attack strategies and to adapt their defense mechanisms accordingly, ensuring a proactive stance in cybersecurity management.

Beginning with a broad overview: Aggregation and sorting

Commencing our analysis with a bird's eye view is paramount. This panoramic perspective allows us to first pinpoint the broader tactics in play before delving into the more granular techniques and underlying detection rules.

Top tactics: By aggregating our data based on MITRE ATT&CK tactics, we can discern the overarching strategies adversaries lean toward. This paints a picture of their primary objectives, be it initial access, execution, or exfiltration.

top_tactics = df.groupby('mitre_tactic').size()
 .sort_values(ascending=False)

Zooming into techniques: Once we've identified a prominent tactic, we can then funnel our attention to the techniques linked to that tactic. This reveals the specific modus operandi of adversaries.

chosen_tactic = 'Execution'

techniques_under_tactic = df[df['mitre_tactic'] == chosen_tactic]
top_techniques = techniques_under_tactic.groupby('mitre_technique').size()
 .sort_values(ascending=False)

Detection rules and logic: With our spotlight on a specific technique, it's time to delve deeper, identifying the detection rules that triggered alerts. This not only showcases what was detected, but by reviewing the detection logic, we also gain an understanding of the precise behaviors and patterns that were flagged.

chosen_technique = 'Scripting'

rules_for_technique = techniques_under_tactic[techniques_under_tactic['mitre_technique'] == chosen_technique]

top_rules = rules_for_technique
 .groupby('detection_rule').size().sort_values(ascending=False)

This hierarchical, cascading approach is akin to peeling an onion. With each layer, we expose more intricate details, refining our perspective and sharpening our insights.

The power of time: Time series analysis

In the realm of cybersecurity, time isn't just a metric; it's a narrative. Timestamps, often overlooked, are goldmines of insights. Time series analysis allows us to plot events over time, revealing patterns, spikes, or lulls that might be indicative of adversary campaigns, specific attack waves, or dormancy periods.

For instance, plotting endpoint malware alerts over time can unveil an adversary's operational hours or spotlight a synchronized, multi-vector attack:

import matplotlib.pyplot as plt

# Extract and plot endpoint alerts over time
df.set_index('timestamp')['endpoint_alert'].resample('D').count().plot()
plt.title('Endpoint Malware Alerts Over Time')
plt.xlabel('Time')
plt.ylabel('Alert Count')
plt.show()

Time series analysis doesn't just highlight "when" but often provides insights into the "why" behind certain spikes or anomalies. It aids in correlating external events (like the release of a new exploit) to internal data trends.

Correlation analysis

Understanding relationships between different sets of data can offer valuable insights. For instance, a spike in one type of alert could correlate with another type of activity in the system, shedding light on multi-stage attack campaigns or diversion strategies.

# Finding correlation between an increase in login attempts and data exfiltration activities
correlation_value = df['login_attempts'].corr(df['data_exfil_activity'])

This analysis, with the help of pandas corr, can help in discerning whether multiple seemingly isolated activities are part of a coordinated attack chain.

Correlation also does not have to be metric-driven either. When analyzing threats, it is easy to find value and new insights by comparing older findings to the new ones.

Machine learning & anomaly detection

With the vast volume of data, manual analysis becomes impractical. Machine learning can assist in identifying patterns and anomalies that might escape the human eye. Algorithms like Isolation Forest or K-nearest neighbor(KNN) are commonly used to spot deviations or clusters of commonly related data.

from sklearn.ensemble import IsolationForest

# Assuming 'feature_set' contains relevant metrics for analysis
clf = IsolationForest(contamination=0.05)
anomalies = clf.fit_predict(feature_set)

Here, the anomalies variable will flag data points that deviate from the norm, helping analysts pinpoint unusual behavior swiftly.

Behavioral patterns & endpoint data analysis

Analyzing endpoint behavioral data collected from detection rules allows us to unearth overarching patterns and trends that can be indicative of broader threat landscapes, cyber campaigns, or evolving attacker TTPs.

Tactic progression patterns: By monitoring the sequence of detected behaviors over time, we can spot patterns in how adversaries move through their attack chain. For instance, if there's a consistent trend where initial access techniques are followed by execution and then lateral movement, it's indicative of a common attacker playbook being employed.

Command-line trend analysis: Even within malicious command-line arguments, certain patterns or sequences can emerge. Monitoring the most frequently detected malicious arguments can give insights into favored attack tools or scripts.

Example:

# Most frequently detected malicious command lines
top_malicious_commands = df.groupby('malicious_command_line').size()
 .sort_values(ascending=False).head(10)

Process interaction trends: While individual parent-child process relationships can be malicious, spotting trends in these interactions can hint at widespread malware campaigns or attacker TTPs. For instance, if a large subset of endpoints is showing the same unusual process interaction, it might suggest a common threat.

Temporal behavior patterns: Just as with other types of data, the temporal aspect of endpoint behavioral data can be enlightening. Analyzing the frequency and timing of certain malicious behaviors can hint at attacker operational hours or campaign durations.

Example:

# Analyzing frequency of a specific malicious behavior over time
monthly_data = df.pivot_table(index='timestamp', columns='tactic', values='count', aggfunc='sum').resample('M').sum()

ax = monthly_data[['execution', 'defense-evasion']].plot(kind='bar', stacked=False, figsize=(12,6))

plt.title("Frequency of 'execution' and 'defense-evasion' Tactics Over Time")

plt.ylabel("Count")
ax.set_xticklabels([x.strftime('%B-%Y') for x in monthly_data.index])
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

Note: This image is from example data and not from the Global Threat Report

By aggregating and analyzing endpoint behavioral data at a macro level, we don't just identify isolated threats but can spot waves, trends, and emerging patterns. This broader perspective empowers cybersecurity teams to anticipate, prepare for, and counter large-scale cyber threats more effectively.

While these are some examples of how to perform trend analysis, there is no right or wrong approach. Every analyst has their own preference or set of questions they or stakeholders may want to ask. Here are some additional questions or queries analysts may have for cybersecurity data when doing trend analysis.

What are the top three tactics being leveraged by adversaries this quarter?
Which detection rules are triggering the most, and is there a common thread?
Are there any time-based patterns in endpoint alerts, possibly hinting at an adversary's timezone?
How have cloud alerts evolved with the migration of more services to the cloud?
Which malware families are becoming more prevalent, and what might be the cause?
Do the data patterns suggest any seasonality, like increased activities towards year-end?
Are there correlations between external events and spikes in cyber activities?
How does the weekday data differ from weekends in terms of alerts and attacks?
Which organizational assets are most targeted, and are their defenses up-to-date?
Are there any signs of internal threats or unusual behaviors among privileged accounts?

Trend analysis in cybersecurity is a dynamic process. While we've laid down some foundational techniques and questions, there are myriad ways to approach this vast domain. Each analyst may have their preferences, tools, and methodologies, and that's perfectly fine. The essence lies in continuously evolving and adapting to our approach while cognizantly being aware of the ever-changing threat landscape for each ecosystem exposed to threats.

Reduction: Streamlining for clarity

Having progressed through the initial stages of our data analysis, we now enter the next phase: reduction. This step is about refining and concentrating our comprehensive data into a more digestible and focused format.

Recap of the Analysis Journey So Far:

Extraction: The initial phase involved setting up our Google Cloud environment and selecting relevant datasets for our analysis.
Pre-processing and transformation: At this stage, the data was extracted, processed, and transformed within our Colab notebooks, preparing it for detailed analysis.
Trend analysis: This phase provided in-depth insights into cyber attack tactics, techniques, and malware, forming the core of our analysis.

While the detailed data in our Colab Notebooks is extensive and informative for an analyst, it might be too complex for a broader audience. Therefore, the reduction phase focuses on distilling this information into a more concise and accessible form. The aim is to make the findings clear and understandable, ensuring that they can be effectively communicated and utilized across various departments or stakeholders.

Selecting and aggregating key data points

In order to effectively communicate our findings, we must tailor the presentation to the audience's needs. Not every stakeholder requires the full depth of collected data; many prefer a summarized version that highlights the most actionable points. This is where data selection and aggregation come into play, focusing on the most vital elements and presenting them in an accessible format.

Here's an example of how to use Pandas to aggregate and condense a dataset, focusing on key aspects of endpoint behavior:

required_endpoint_behavior_cols = ['rule_name','host_os_type','tactic_name','technique_name']


reduced_behavior_df = df.groupby(required_endpoint_behavior_cols).size()
 .reset_index(name='count')
 .sort_values(by="count", ascending=False)
 .reset_index(drop=True)

columns = {
    'rule_name': 'Rule Name', 
    'host_os_type': 'Host OS Type',
    'tactic_name': 'Tactic', 
    'technique_name': 'Technique', 
    'count': 'Alerts'
}

reduced_behavior_df = reduced_behavior_df.rename(columns=columns)

One remarkable aspect of this code and process is the flexibility it offers. For instance, we can group our data by various data points tailored to our needs. Interested in identifying popular tactics used by adversaries? Group by the MITRE ATT&CK tactic. Want to shed light on masquerading malicious binaries? Revisit extraction to add more Elastic Common Schema (ECS) fields such as file path, filter on Defense Evasion, and aggregate to reveal the commonly trodden paths. This approach ensures we create datasets that are both enlightening and not overwhelmingly rich, tailor-made for stakeholders who wish to understand the origins of our analysis.

This process involves grouping the data by relevant categories such as rule name, host OS type, and MITRE ATT&CK tactics and techniques and then counting the occurrences. This method helps in identifying the most prevalent patterns and trends in the data.

Diagram example of data aggregation to obtain reduced dataset

Exporting reduced data to Google Sheets for accessibility

The reduced data, now stored as a dataframe in memory, is ready to be exported. We use Google Sheets as the platform for sharing these insights because of its wide accessibility and user-friendly interface. The process of exporting data to Google Sheets is straightforward and efficient, thanks to the integration with Google Cloud services.

Here's an example of how the data can be uploaded to Google Sheets using Python from our Colab notebook:

auth.authenticate_user()
credentials, project = google.auth.default()
gc = gspread.authorize(credentials)
workbook = gc.open_by_key("SHEET_ID")
behavior_sheet_name = 'NAME_OF_TARGET_SHEET'
endpoint_behavior_worksheet = workbook.worksheet(behavior_sheet_name)
set_with_dataframe(endpoint_behavior_worksheet, reduced_behavior_df)

With a few simple lines of code, we have effectively transferred our data analysis results to Google Sheets. This approach is widely used due to its accessibility and ease of use. However, there are multiple other methods to present data, each suited to different requirements and audiences. For instance, some might opt for a platform like Looker to present the processed data in a more dynamic dashboard format. This method is particularly useful for creating interactive and visually engaging presentations of data. It ensures that even stakeholders who may not be familiar with the technical aspects of data analysis, such as those working in Jupyter Notebooks, can easily understand and derive value from the insights.

This streamlined process of data reduction and presentation can be applied to different types of datasets, such as cloud SIEM alerts, endpoint behavior alerts, or malware alerts. The objective remains the same: to simplify and concentrate the data for clear and actionable insights.

Presentation: Showcasing the insights

After meticulously refining our datasets, we now focus on the final stage: the presentation. Here we take our datasets, now neatly organized in platforms like Google Sheets or Looker, and transform them into a format that is both informative and engaging.

Pivot tables for in-depth analysis

Using pivot tables, we can create a comprehensive overview of our trend analysis findings. These tables allow us to display data in a multi-dimensional manner, offering insights into various aspects of cybersecurity, such as prevalent MITRE ATT&CK tactics, chosen techniques, and preferred malware families.

Our approach to data visualization involves:

Broad overview with MITRE ATT&CK tactics: Starting with a general perspective, we use pivot tables to overview the different tactics employed in cyber threats.
Detailed breakdown: From this panoramic view, we delve deeper, creating separate pivot tables for each popular tactic and then branching out into detailed analyses for each technique and specific detection rule.

This methodical process helps to uncover the intricacies of detection logic and alerts, effectively narrating the story of the cyber threat landscape.

Diagram showcasing aggregations funnel into contextual report information

Accessibility across audiences: Our data presentations are designed to cater to a wide range of audiences, from those deeply versed in data science to those who prefer a more straightforward understanding. The Google Workspace ecosystem facilitates the sharing of these insights, allowing pivot tables, reduced datasets, and other elements to be easily accessible to all involved in the report-making process.

Integrating visualizations into reports: When crafting a report, for example, in Google Docs, the integration of charts and tables from Google Sheets is seamless. This integration ensures that any modifications in the datasets or pivot tables are easily updated in the report, maintaining the efficiency and coherence of the presentation.

Tailoring the presentation to the audience: The presentation of data insights is not just about conveying information; it's about doing so in a visually appealing and digestible manner. For a more tech-savvy audience, an interactive Colab Notebook with dynamic charts and functions may be ideal. In contrast, for marketing or design teams, a well-designed dashboard in Looker might be more appropriate. The key is to ensure that the presentation is clear, concise, and visually attractive, tailored to the specific preferences and needs of the audience.

Conclusion: Reflecting on the data analysis journey

As we conclude, it's valuable to reflect on the territory we've navigated in analyzing cyber threat data. This journey involved several key stages, each contributing significantly to our final insights.

Journey through Google's Cloud ecosystem

Our path took us through several Google Cloud services, including GCP, GCE, Colab Notebooks, and Google Workspace. Each played a pivotal role:

Data exploration: We began with a set of cyber-related questions we wanted to answer and explored what vast datasets we had available to us. In this blog, we focused solely on telemetry being available in BigQuery. Data extraction: We began by extracting raw data, utilizing BigQuery to efficiently handle large volumes of data. Extraction occurred in both BigQuery and from within our Colab notebooks. Data wrangling and processing: The power of Python and the pandas library was leveraged to clean, aggregate, and refine this data, much like a chef skillfully preparing ingredients. Trend analysis: We then performed trend analysis on our reformed datasets with several methodologies to glean valuable insights into adversary tactics, techniques, and procedures over time. Reduction: Off the backbone of our trend analysis, we aggregated our different datasets by targeted data points in preparation for presentation to stakeholders and peers. Transition to presentation: The ease of moving from data analytics to presentation within a web browser highlighted the agility of our tools, facilitating a seamless workflow.

Modularity and flexibility in workflow

An essential aspect of our approach was the modular nature of our workflow. Each phase, from data extraction to presentation, featured interchangeable components in the Google Cloud ecosystem, allowing us to tailor the process to specific needs:

Versatile tools: Google Cloud Platform offered a diverse range of tools and options, enabling flexibility in data storage, analysis, and presentation. Customized analysis path: Depending on the specific requirements of our analysis, we could adapt and choose different tools and methods, ensuring a tailored approach to each dataset. Authentication and authorization: Due to our entities being housed in the Google Cloud ecosystem, access to different tools, sites, data, and more was all painless, ensuring a smooth transition between services.

Orchestration and tool synchronization

The synergy between our technical skills and the chosen tools was crucial. This harmonization ensured that the analytical process was not only effective for this project but also set the foundation for more efficient and insightful future analyses. The tools were used to augment our capabilities, keeping the focus on deriving meaningful insights rather than getting entangled in technical complexities.

In summary, this journey through data analysis emphasized the importance of a well-thought-out approach, leveraging the right tools and techniques, and the adaptability to meet the demands of cyber threat data analysis. The end result is not just a set of findings but a refined methodology that can be applied to future data analysis endeavors in the ever-evolving field of cybersecurity.

Call to Action: Embarking on your own data analytics journey

Your analytical workspace is ready! What innovative approaches or experiences with Google Cloud or other data analytics platforms can you bring to the table? The realm of data analytics is vast and varied, and although each analyst brings a unique touch, the underlying methods and principles are universal.

The objective is not solely to excel in your current analytical projects but to continually enhance and adapt your techniques. This ongoing refinement ensures that your future endeavors in data analysis will be even more productive, enlightening, and impactful. Dive in and explore the world of data analytics with Google Cloud!

We encourage any feedback and engagement for this topic! If you prefer to do so, feel free to engage us in Elastic’s public #security Slack channel.

Streamlining ES|QL Query and Rule Validation: Integrating with GitHub CI

Fri, 17 Nov 2023 00:00:00 GMT

One of the amazing, recently premiered 8.11.0 features, is the Elasticsearch Query Language (ES|QL). As highlighted in an earlier post by Costin Leau, it’s a full-blown, specialized query and compute engine for Elasitcsearch. Now that it’s in technical preview, we wanted to share some options to validate your ES|QL queries. This overview is for engineers new to ES|QL. Whether you’re searching for insights in Kibana or investigating security threats in Timelines, you’ll see how this capability is seamlessly interwoven throughout Elastic.

ES|QL validation basics ft. Kibana & Elasticsearch

If you want to quickly validate a single query, or feel comfortable manually testing queries one-by-one, the Elastic Stack UI is all you need. After navigating to the Discover tab in Kibana, click on the "Try ES|QL" Technical Preview button in the Data View dropdown to load the query pane. You can also grab sample queries from the ES|QL Examples to get up and running. Introducing non-ECS fields will immediately highlight errors prioritizing syntax errors, then unknown column errors.

In this example, there are two syntax errors that are highlighted:

the invalid syntax error on the input wheres which should be where and
the unknown column process.worsking_directory, which should be process.working_directory.

After resolving the syntax error in this example, you’ll observe the Unknown column errors. Here are a couple reasons this error may appear:

Fix Field Name Typos: Sometimes you simply need to fix the name as suggested in the error; consult the ECS or any integration schemas and confirm the fields are correct
Add Missing Data: If you’re confident the fields are correct, sometimes adding data to your stack, which will populate the columns
Update Mapping: You can configure Mappings to set explicit fields, or add new fields to an existing data stream or index using the Update Mapping API

ES|QL warnings

Not all fields will appear as errors, in which case you’re presented with warnings and a dropdown list. Hard failures (e.g. errors), imply that the rule cannot execute, whereas warnings indicate that the rule can run, but the functions may be degraded.

When utilizing broad ES|QL queries that span multiple indices, such as logs-* | limit 10, there might be instances where certain fields fail to appear in the results. This is often due to the fields being undefined in the indexed data, or not yet supported by ES|QL. In cases where the expected fields are not retrieved, it's typically a sign that the data was ingested into Elasticsearch without these fields being indexed, as per the established mappings. Instead of causing the query to fail, ES|QL handles this by returning "null" for the unavailable fields, serving as a warning that something in the query did not execute as expected. This approach ensures the query still runs, distinguishing it from a hard failure, which occurs when the query cannot execute at all, such as when a non-existent field is referenced.

There are also helpful performance warnings that may appear. Providing a LIMIT parameter to the query will help address performance warnings. Note this example highlights that there is a default limit of 500 events returned. This limit may significantly increase once this feature is generally available.

Security

In an investigative workflow, security practitioners prefer to iteratively hunt for threats, which may encompass manually testing, refining, and tuning a query in the UI. Conveniently, security analysts and engineers can natively leverage ES|QL in timelines, with no need to interrupt workflows by pivoting back and forth to a different view in Kibana. You’ll receive the same errors and warnings in the same security component, which shows Elasticsearch feedback under the hood.

In some components, you will receive additional feedback based on the context of where ES|QL is implemented. One scenario is when you create an ES|QL rule using the create new rule feature under the Detection Rules (SIEM) tab.

For example, this query could easily be converted to an EQL or KQL query as it does not leverage powerful features of ES|QL like statistics, frequency analysis, or parsing unstructured data. If you want to learn more about the benefits of queries using ES|QL check out this blog by Costin, which covers performance boosts. In this case, we must add [metadata _id, _version, _index] to the query, which informs the UI which components to return in the results.

API calls? Of course!

Prior to this section, all of the examples referenced creating ES|QL queries and receiving feedback directly from the UI. For illustrative purposes, the following examples leverage Dev Tools, but these calls are easily migratable to cURL bash commands or the language / tool of your choice that can send an HTTP request.

Here is the same query as previously shown throughout other examples, sent via a POST request to the query API with a valid query.

As expected, if you supply an invalid query, you’ll receive similar feedback observed in the UI. In this example, we’ve also supplied the ?error_trace flag which can provide the stack trace if you need additional context for why the query failed validation.

As you can imagine, we can use the API to programmatically validate ES|QL queries. You can also still use the Create rule Kibana API, which requires a bit more metadata associated with a security rule. However, if you want to only validate a query, the _query API comes in handy. From here you can use the Elasticsearch Python Client to connect to your stack and validate queries.

from elasticsearch import Elasticsearch
client = Elasticsearch(...)
data = {
"query": """
    from logs-endpoint.events.*
    | keep host.os.type, process.name, process.working_directory, event.type, event.action
    | where host.os.type == "linux" and process.name == "unshadow" and event.type == "start"     and event.action in ("exec", "exec_event")
"""
}

# Execute the query
headers = {"Content-Type": "application/json", "Accept": "application/json"}
response = client.perform_request(
"POST", "/_query", params={"pretty": True}, headers=headers, body=data
)

Leverage the grammar

One of the best parts of Elastic developing in the open is the antlr ES|QL grammar is also available.

If you’re comfortable with ANTLR, you can also download the latest JAR to build a lexer and parser.

pip install antlr4-tools # for antlr4
git clone git@github.com:elastic/elasticsearch.git # large repo
cd elasticsearch/x-pack/plugin/esql/src/main/antlr # navigate to grammar
antlr4 -Dlanguage=Python3 -o build EsqlBaseLexer.g4 # generate lexer
antlr4 -Dlanguage=Python3 -o build EsqlBaseParser.g4 # generate parser

This process will require more lifting to get ES|QL validation started, but you’ll at least have a tree object built, that provides more granular control and access to the parsed fields.

However, as you can see the listeners are stubs, which means you’ll need to build in semantics manually if you want to go this route.

The security rule GitHub CI use case

For our internal Elastic EQL and KQL query rule validation, we utilize the parsed abstract syntax tree (AST) objects of our queries to perform nuanced semantic validation across multiple stack versions. For example, having the AST allows us to validate proper field usage, verify new features are not used in older stack versions before being introduced, or even more, ensure related integrations are built based on datastreams used in the query. Fundamentally, local validation allows us to streamline a broader range of support for many stack features and versions. If you’re interested in seeing more of the design and rigorous validation that we can do with the AST, check out our detection-rules repo.

If you do not need granular access to the specific parsed tree objects and do not need to control the semantics of ES|QL validation, then out-of-the-box APIs may be all you need to validate queries. In this use case, we want to validate security detection rules using continuous integration. Managing detection rules through systems like GitHub helps garner all the benefits of using a version-controlled like tracking rule changes, receiving feedback via pull requests, and more. Conceptually, rule authors should be able to create these rules (which contain ES|QL queries) locally and exercise the git rule development lifecycle.

CI checks help to ensure queries still pass ES|QL validation without having to manually check the query in the UI. Based on the examples shown thus far, you have to either stand up a persistent stack and validate queries against the API, or build a parser implementation based on the available grammar outside of the Elastic stack.

One approach to using a short-lived Elastic stack versus leveraging a managed persistent stack is to use the Elastic Container Project (ECP). As advertised, this project will:

Stand up a 100% containerized Elastic stack, TLS secured, with Elasticsearch, Kibana, Fleet, and the Detection Engine all pre-configured, enabled, and ready to use, within minutes.

With a combination of:

Elastic Containers (e.g. ECP)
CI (e.g. Github Action Workflow)
ES|QL rules
Automation Foo (e.g. python & bash scripts)

You can validate ES|QL rules via CI against the latest stack version relatively easily, but there are some nuances involved in this approach.

Feel free to check out the sample GitHub action workflow if you’re interested in a high-level overview of how it can be implemented.

Note: if you're interested in using the GitHub action workflow, check out their documentation on using GitHub secrets in Actions and setting up Action workflows.

CI nuances

Any custom configuration needs to be scripted away (e.g. setting up additional policies, enrichments, etc.) In our POC, we created a step and bash script that executed a series of POST requests to our temporary CI Elastic Stack, which created the new enrichments used in our detection rules.

- name: Add Enrich Policy
  env:
    ELASTICSEARCH_SERVER: "https://localhost:9200"
    ELASTICSEARCH_USERNAME: "elastic"
    ELASTICSEARCH_PASSWORD: "${{ secrets.PASSWORD }}"
  run: |
    set -x
    chmod +x ./add_enrich.sh
    bash ./add_enrich.sh

Without data in our freshly deployed CI Elastic stack, there will be many Unknown Column issues as previously mentioned. One approach to address this is to build indices with the proper mappings for the queries to match. For example, if you have a query that searches the index logs-endpoint.events.*, then create an index called logs-endpoint.events.ci, with the proper mappings from the integration used in the query.
Once the temporary stack is configured, you’ll need extra logic to iterate over all the rules and validate using the _query API. For example, you can create a unit test that iterates over all the rules. We do this today by leveraging our default RuleCollection.default() that loads all rules, in our detection-rules repo, but here is a snippet that quickly loads only ES|QL rules.

# tests/test_all_rules.py
class TestESQLRules:
    """Test ESQL Rules."""

    @unittest.skipIf(not os.environ.get("DR_VALIDATE_ESQL"),
         "Test only run when DR_VALIDATE_ESQL environment variable set.")
    def test_environment_variables_set(self):
        collection = RuleCollection()

        # Iterate over all .toml files in the given directory recursively
        for rule in Path(DEFAULT_RULES_DIR).rglob('*.toml'):
            # Read file content
            content = rule.read_text(encoding='utf-8')
            # Search for the pattern
            if re.search(r'language = "esql"', content):
                print(f"Validating {str(rule)}")
                collection.load_file(rule)

Each rule would run through a validator method once the file is loaded with load_file.

# detection_rules/rule_validator.py
class ESQLValidator(QueryValidator):
    """Specific fields for ESQL query event types."""

    def validate(self, data: 'QueryRuleData', meta: RuleMeta) -> None:
        """Validate an ESQL query while checking TOMLRule."""
        if not os.environ.get("DR_VALIDATE_ESQL"):
            return

        if Version.parse(meta.min_stack_version) < Version.parse("8.11.0"):
            raise ValidationError(f"Rule minstack must be greater than 8.10.0 {data.rule_id}")

        client = Elasticsearch(...)
        client.info()
        client.perform_request("POST", "/_query", params={"pretty": True},
                               headers={"accept": "application/json", 
                                        "content-type": "application/json"},
                               body={"query": f"{self.query} | LIMIT 0"})

As highlighted earlier, we can POST to the query API and validate given the credentials that were set as GitHub action secrets and passed to the validation as environment variables. Note, the LIMIT 0 is so the query does not return data intentionally. It’s meant to only perform validation. Finally the single CI step would be a bash call to run the unit tests (e.g. pytest tests/test_all_rules.py::TestESQLRules).

Finally, CI leveraging containers may not scale well when validating many rules against multiple Elastic stack versions and configurations. Especially if you would like to test on a commit-basis. The time to deploy one stack took slightly over five minutes to complete. This measurement could greatly increase or decrease depending on your CI setup.

Conclusion

Elasticsearch's new feature, Elasticsearch Query Language (ES|QL), is a specialized query and compute engine for Elasticsearch, now in technical preview. It offers seamless integration across various Elastic services like Kibana and Timelines, with validation options for ES|QL queries. Users can validate queries through the Elastic Stack UI or API calls, receiving immediate feedback on syntax or column errors.

Additionally, ES|QL's ANTLR grammar is available for those who prefer a more hands-on approach to building lexers and parsers. We’re exploring ways to validate ES|QL queries in an automated fashion and now it’s your turn. Just know that we’re not done exploring, so check out ES|QL and let us know if you have ideas! We’d love to hear how you plan to use it within the stack natively or in CI.

We’re always interested in hearing use cases and workflows like these, so as always, reach out to us via GitHub issues, chat with us in our community Slack, and ask questions in our Discuss forums.

Check out these additional resources to learn more about how we’re bringing the latest AI capabilities to the hands of the analyst: Learn everything ES|QL Checkout the 8.11.0 release blog introducing ES|QL