Terrance DeJesusEric Forte

Google Cloud for Cyber Data Analytics

Navigating the seas of cyber threat data with Google Cloud

Google Cloud for Cyber Data Analytics

Introduction

In today's digital age, the sheer volume of data generated by devices and systems can be both a challenge and an opportunity for security practitioners. Analyzing a high magnitude of data to craft valuable or actionable insights on cyber attack trends requires precise tools and methodologies.

Before you delve into the task of data analysis, you might find yourself asking:

  • What specific questions am I aiming to answer, and do I possess the necessary data?
  • Where is all the pertinent data located?
  • How can I gain access to this data?
  • Upon accessing the data, what steps are involved in understanding and organizing it?
  • Which tools are most effective for extracting, interpreting, or visualizing the data?
  • Should I analyze the raw data immediately or wait until it has been processed?
  • Most crucially, what actionable insights can be derived from the data?

If these questions resonate with you, you're on the right path. Welcome to the world of Google Cloud, where we'll address these queries and guide you through the process of creating a comprehensive report.

Our approach will include several steps in the following order:

Exploration: We start by thoroughly understanding the data at our disposal. This phase involves identifying potential insights we aim to uncover and verifying the availability of the required data.

Extraction: Here, we gather the necessary data, focusing on the most relevant and current information for our analysis.

Pre-processing and transformation: At this stage, we prepare the data for analysis. This involves normalizing (cleaning, organizing, and structuring) the data to ensure its readiness for further processing.

Trend analysis: The majority of our threat findings and observations derive from this effort. We analyze the processed data for patterns, trends, and anomalies. Techniques such as time series analysis and aggregation are employed to understand the evolution of threats over time and to highlight significant cyber attacks across various platforms.

Reduction: In this step, we distill the data to its most relevant elements, focusing on the most significant and insightful aspects.

Presentation: The final step is about presenting our findings. Utilizing tools from Google Workspace, we aim to display our insights in a clear, concise, and visually-engaging manner.

Conclusion: Reflecting on this journey, we'll discuss the importance of having the right analytical tools. We'll highlight how Google Cloud Platform (GCP) provides an ideal environment for analyzing cyber threat data, allowing us to transform raw data into meaningful insights.

Exploration: Determining available data

Before diving into any sophisticated analyses, it's necessary to prepare by establishing an understanding of the data landscape we intend to study.

Here's our approach:

  1. Identifying available data: The first step is to ascertain what data is accessible. This could include malware phenomena, endpoint anomalies, cloud signals, etc. Confirming the availability of these data types is essential.
  2. Locating the data stores: Determining the exact location of our data. Knowing where our data resides – whether in databases, data lakes, or other storage solutions – helps streamline the subsequent analysis process.
  3. Accessing the data: It’s important to ensure that we have the necessary permissions or credentials to access the datasets we need. If we don’t, attempting to identify and request access from the resource owner is necessary.
  4. Understanding the data schema: Comprehending the structure of our data is vital. Knowing the schema aids in planning the analysis process effectively.
  5. Evaluating data quality: Just like any thorough analysis, assessing the quality of the data is crucial. We check whether the data is segmented and detailed enough for a meaningful trend analysis.

This phase is about ensuring that our analysis is based on solid and realistic foundations. For a report like the Global Threat Report, we rely on rich and pertinent datasets such as:

  • Cloud signal data: This includes data from global Security Information and Event Management (SIEM) alerts, especially focusing on cloud platforms like AWS, GCP, and Azure. This data is often sourced from public detection rules.
  • Endpoint alert data: Data collected from the global Elastic Defend alerts, incorporating a variety of public endpoint behavior rules.
  • Malware data: This involves data from global Elastic Defend alerts, enriched with MalwareScore and public YARA rules.

Each dataset is categorized and enriched for context with frameworks like MITRE ATT&CK, Elastic Stack details, and customer insights. Storage solutions of Google Cloud Platform, such as BigQuery and Google Cloud Storage (GCS) buckets, provide a robust infrastructure for our analysis.

It's also important to set a data “freshness” threshold, excluding data not older than 365 days for an annual report, to ensure relevance and accuracy.

Lastly, remember to choose data that offers an unbiased perspective. Excluding or including internal data should be an intentional, strategic decision based on its relevance to your visibility.

In summary, selecting the right tools and datasets is fundamental to creating a comprehensive and insightful analysis. Each choice contributes uniquely to the overall effectiveness of the data analysis, ensuring that the final insights are both valuable and impactful.

Extraction: The first step in data analysis

Having identified and located the necessary data, the next step in our analytical journey is to extract this data from our storage solutions. This phase is critical, as it sets the stage for the in-depth analysis that follows.

Data extraction tools and techniques

Various tools and programming languages can be utilized for data extraction, including Python, R, Go, Jupyter Notebooks, and Looker Studio. Each tool offers unique advantages, and the choice depends on the specific needs of your analysis.

In our data extraction efforts, we have found the most success from a combination of BigQuery, Colab Notebooks, buckets, and Google Workspace to extract the required data. Colab Notebooks, akin to Jupyter Notebooks, operate within Google's cloud environment, providing a seamless integration with other Google Cloud services.

BigQuery for data staging and querying

In the analysis process, a key step is to "stage" our datasets using BigQuery. This involves utilizing BigQuery queries to create and save objects, thereby making them reusable and shareable across our team. We achieve this by employing the CREATE TABLE statement, which allows us to combine multiple datasets such as endpoint behavior alerts, customer data, and rule data into a single, comprehensive dataset.

This consolidated dataset is then stored in a BigQuery table specifically designated for this purpose–for this example, we’ll refer to it as the “Global Threat Report” dataset. This approach is applied consistently across different types of data, including both cloud signals and malware datasets.

The newly created data table, for instance, might be named elastic.global_threat_report.ep_behavior_raw. This naming convention, defined by BigQuery, helps in organizing and locating the datasets effectively, which is crucial for the subsequent stages of the extraction process.

An example of a BigQuery query used in this process might look like this:

CREATE TABLE elastic.global_threat_report.ep_behavior_raw AS
SELECT * FROM ...

Diagram for BigQuery query to an exported dataset table Diagram for BigQuery query to an exported dataset table

We also use the EXPORT DATA statement in BigQuery to transfer tables to other GCP services, like exporting them to Google Cloud Storage (GCS) buckets in parquet file format.

EXPORT DATA
  OPTIONS (
    uri = 'gs://**/ep_behavior/*.parquet',
    format = 'parquet',
    overwrite = true
  )
AS (
SELECT * FROM `project.global_threat_report.2023_pre_norm_ep_behavior`
)

Colab Notebooks for loading staged datasets

Colab Notebooks are instrumental in organizing our data extraction process. They allow for easy access and management of data scripts stored in platforms like GitHub and Google Drive.

For authentication and authorization, we use Google Workspace credentials, simplifying access to various Google Cloud services, including BigQuery and Colab Notebooks. Here's a basic example of how authentication is handled:

Diagram for authentication and authorization between Google Cloud services Diagram for authentication and authorization between Google Cloud services

For those new to Jupyter Notebooks or dataframes, it's beneficial to spend time becoming familiar with these tools. They are fundamental in any data analyst's toolkit, allowing for efficient code management, data analysis, and structuring. Mastery of these tools is key to effective data analysis.

Upon creating a notebook in Google Colab, we're ready to extract our custom tables (such as project.global_threat_report.ep_behavior_raw) from BigQuery. This data is then loaded into Pandas Dataframes, a Python library that facilitates data manipulation and analysis. While handling large datasets with Python can be challenging, Google Colab provides robust virtual computing resources. If needed, these resources can be scaled up through the Google Cloud Marketplace or the Google Cloud Console, ensuring that even large datasets can be processed efficiently.

Essential Python libraries for data analysis

In our data analysis process, we utilize various Python libraries, each serving a specific purpose:

LibraryDescription
datetimeEssential for handling all operations related to date and time in your data. It allows you to manipulate and format date and time information for analysis.
google.authManages authentication and access permissions, ensuring secure access to Google Cloud services. It's key for controlling who can access your data and services.
google.colab.authProvides authentication for accessing Google Cloud services within Google Colab notebooks, enabling a secure connection to your cloud-based resources.
google.cloud.bigqueryA tool for managing large datasets in Google Cloud's BigQuery service. It allows for efficient processing and analysis of massive amounts of data.
google.cloud.storageUsed for storing and retrieving data in Google Cloud Storage. It's an ideal solution for handling various data files in the cloud.
gspreadFacilitates interaction with Google Spreadsheets, allowing for easy manipulation and analysis of spreadsheet data.
gspread.dataframe.set_with_dataframeSyncs data between Pandas dataframes and Google Spreadsheets, enabling seamless data transfer and updating between these formats.
matplotlib.pyplot.pltA module in Matplotlib library for creating charts and graphs. It helps in visualizing data in a graphical format, making it easier to understand patterns and trends.
pandasA fundamental tool for data manipulation and analysis in Python. It offers data structures and operations for manipulating numerical tables and time series.
pandas.gbq.to_gbqEnables the transfer of data from Pandas dataframes directly into Google BigQuery, streamlining the process of moving data into this cloud-based analytics platform.
pyarrow.parquet.pqAllows for efficient storage and retrieval of data in the Parquet format, a columnar storage file format optimized for use with large datasets.
seabornA Python visualization library based on Matplotlib that provides a high-level interface for drawing attractive and informative statistical graphics.

Next, we authenticate with BigQuery, and receive authorization to access our datasets as demonstrated earlier. By using Google Workspace credentials, we can easily access BigQuery and other Google Cloud services. The process typically involves a simple code snippet for authentication:

from google.colab import auth
from google.cloud import bigquery

auth.authenticate_user()
project_id = "PROJECT_FROM_GCP"
client = bigquery.Client(project=project_id)

With authentication complete, we can then proceed to access and manipulate our data. Google Colab's integration with Google Cloud services simplifies this process, making it efficient and secure.

Organizing Colab Notebooks before analysis

When working with Jupyter Notebooks, it's better to organize your notebook beforehand. Various stages of handling and manipulating data will be required, and staying organized will help you create a repeatable, comprehensive process.

In our notebooks, we use Jupyter Notebook headers to organize the code systematically. This structure allows for clear compartmentalization and the creation of collapsible sections, which is especially beneficial when dealing with complex data operations that require multiple steps. This methodical organization aids in navigating the notebook efficiently, ensuring that each step in the data extraction and analysis process is easily accessible and manageable.

Moreover, while the workflow in a notebook might seem linear, it's often more dynamic. Data analysts frequently engage in multitasking, jumping between different sections as needed based on the data or results they encounter. Furthermore, new insights discovered in one step may influence another step’s process, leading to some back and forth before finishing the notebook. |

Extracting Our BigQuery datasets into dataframes

After establishing the structure of our notebook and successfully authenticating with BigQuery, our next step is to retrieve the required datasets. This process sets the foundation for the rest of the report, as the information from these sources will form the basis of our analysis, similar to selecting the key components required for a comprehensive study.

Here's an example of how we might fetch data from BigQuery:

import datetime

current_year = datetime.datetime.now().year
reb_dataset_id = f'project.global_threat_report.{current_year}_raw_ep_behavior'
reb_table = client.list_rows(reb_dataset_id)
reb_df = reb_table.to_dataframe() 

This snippet demonstrates a typical data retrieval process. We first define the dataset we're interested in (with the Global Threat Report, project.global_threat_report.ep_behavior_raw for the current year). Then, we use a BigQuery query to select the data from this dataset and load it into a Pandas DataFrame. This DataFrame will serve as the foundation for our subsequent data analysis steps.

Colab Notebook snippet for data extraction from BigQuery into Pandas dataframe Colab Notebook snippet for data extraction from BigQuery into Pandas dataframe

This process marks the completion of the extraction phase. We have successfully navigated BigQuery to select and retrieve the necessary datasets and load them in our notebooks within dataframes. The extraction phase is pivotal, as it not only involves gathering the data but also setting up the foundation for deeper analysis. It's the initial step in a larger journey of discovery, leading to the transformation phase, where we will uncover more detailed insights from the data.

In summary, this part of our data journey is about more than just collecting datasets; it's about structurally preparing them for the in-depth analysis that follows. This meticulous approach to organizing and executing the extraction phase sets the stage for the transformative insights that we aim to derive in the subsequent stages of our data analysis.

Pre-processing and transformation: The critical phase of data analysis

The transition from raw data to actionable insights involves a series of crucial steps in data processing. After extracting data, our focus shifts to refining it for analysis. Cybersecurity datasets often include various forms of noise, such as false positives and anomalies, which must be addressed to ensure accurate and relevant analysis.

Key stages in data pre-processing and transformation:

  • Data cleaning: This stage involves filling NULL values, correcting data misalignments, and validating data types to ensure the dataset's integrity.
  • Data enrichment: In this step, additional context is added to the dataset. For example, incorporating third-party data, like malware reputations from sources such as VirusTotal, enhances the depth of analysis.
  • Normalization: This process standardizes the data to ensure consistency, which is particularly important for varied datasets like endpoint malware alerts.
  • Anomaly detection: Identifying and rectifying outliers or false positives is critical to maintain the accuracy of the dataset.
  • Feature extraction: The process of identifying meaningful, consistent data points that can be further extracted for analysis.

Embracing the art of data cleaning

Data cleaning is a fundamental step in preparing datasets for comprehensive analysis, especially in cybersecurity. This process involves a series of technical checks to ensure data integrity and reliability. Here are the specific steps:

  • Mapping to MITRE ATT&CK framework: Verify that all detection and response rules in the dataset are accurately mapped to the corresponding tactics and techniques in the MITRE ATT&CK framework. This check includes looking for NULL values or any inconsistencies in how the data aligns with the framework.

  • Data type validation: Confirm that the data types within the dataset are appropriate and consistent. For example, timestamps should be in a standardized datetime format. This step may involve converting string formats to datetime objects or verifying that numerical values are in the correct format.

  • Completeness of critical data: Ensure that no vital information is missing from the dataset. This includes checking for the presence of essential elements like SHA256 hashes or executable names in endpoint behavior logs. The absence of such data can lead to incomplete or biased analysis.

  • Standardization across data formats: Assess and implement standardization of data formats across the dataset to ensure uniformity. This might involve normalizing text formats, ensuring consistent capitalization, or standardizing date and time representations.

  • Duplicate entry identification: Identify and remove duplicate entries by examining unique identifiers such as XDR agent IDs or cluster IDs. This process might involve using functions to detect and remove duplicates, ensuring the uniqueness of each data entry.

  • Exclusion of irrelevant internal data: Locate and remove any internal data that might have inadvertently been included in the dataset. This step is crucial to prevent internal biases or irrelevant information from affecting the analysis.

It is important to note that data cleaning or “scrubbing the data” is a continuous effort throughout our workflow. As we continue to peel back the layers of our data and wrangle it for various insights, it is expected that we identify additional changes.

Utilizing Pandas for data cleaning

The Pandas library in Python offers several functionalities that are particularly useful for data cleaning in cybersecurity contexts. Some of these methods include:

  • DataFrame.isnull() or DataFrame.notnull() to identify missing values.
  • DataFrame.drop_duplicates() to remove duplicate rows.
  • Data type conversion methods like pd.to_datetime() for standardizing timestamp formats.
  • Utilizing boolean indexing to filter out irrelevant data based on specific criteria.

A thorough understanding of the dataset is essential to determine the right cleaning methods. It may be necessary to explore the dataset preliminarily to identify specific areas requiring cleaning or transformation. Additional helpful methods and workflows can be found listed in this Real Python blog.

Feature extraction and enrichment

Feature extraction and enrichment are core steps in data analysis, particularly in the context of cybersecurity. These processes involve transforming and augmenting the dataset to enhance its usefulness for analysis.

  • Create new data from existing: This is where we modify or use existing data to add additional columns or rows.
  • Add new data from 3rd-party: Here, we use existing data as a query reference for 3rd-party RESTful APIs which respond with additional data we can add to the datasets.

Feature extraction

Let’s dig into a tangible example. Imagine we're presented with a bounty of publicly available YARA signatures that Elastic shares with its community. These signatures trigger some of the endpoint malware alerts in our dataset. A consistent naming convention has been observed based on the rule name that, of course, shows up in the raw data: OperationsSystem_MalwareCategory_MalwareFamily. These names can be deconstructed to provide more specific insights. Leveraging Pandas, we can expertly slice and dice the data. For those who prefer doing this during the dataset staging phase with BigQuery, the combination of SPLIT and OFFSET clauses can yield similar results:

df[['OperatingSystem', 'MalwareCategory', 'MalwareFamily']] = df['yara_rule_name'].str.split('_', expand=True)

Feature extraction with our YARA data Feature extraction with our YARA data

There are additional approaches, methods, and processes to feature extraction in data analysis. We recommend consulting your stakeholder's wants/needs and exploring your data to help determine what is necessary for extraction and how.

Data enrichment

Data enrichment enhances the depth and context of cybersecurity datasets. One effective approach involves integrating external data sources to provide additional perspectives on the existing data. This can be particularly valuable in understanding and interpreting cybersecurity alerts.

Example of data enrichment: Integrating VirusTotal reputation data A common method of data enrichment in cybersecurity involves incorporating reputation scores from external threat intelligence services like VirusTotal (VT). This process typically includes:

  1. Fetching reputation data: Using an API key from VT, we can query for reputational data based on unique identifiers in our dataset, such as SHA256 hashes of binaries.
import requests

def get_reputation(sha256, API_KEY, URL):
    params = {'apikey': API_KEY, 'resource': sha256}
    response = requests.get(URL, params=params)
    json_response = response.json()
    
    if json_response.get("response_code") == 1:
        positives = json_response.get("positives", 0)
        return classify_positives(positives)
    else:
        return "unknown"

In this function, classify_positives is a custom function that classifies the reputation based on the number of antivirus engines that flagged the file as malicious.

  1. Adding reputation data to the dataset: The reputation data fetched from VirusTotal is then integrated into the existing dataset. This is done by applying the get_reputation function to each relevant entry in the DataFrame.
df['reputation'] = df['sha256'].apply(lambda x: get_reputation(x, API_KEY, URL))

Here, a new column named reputation is added to the dataframe, providing an additional layer of information about each binary based on its detection rate in VirusTotal.

This method of data enrichment is just one of many options available for enhancing cybersecurity threat data. By utilizing robust helper functions and tapping into external data repositories, analysts can significantly enrich their datasets. This enrichment allows for a more comprehensive understanding of the data, leading to a more informed and nuanced analysis. The techniques demonstrated here are part of a broader range of advanced data manipulation methods that can further refine cybersecurity data analysis.

Normalization

Especially when dealing with varied datasets in cybersecurity, such as endpoint alerts and cloud SIEM notifications, normalization may be required to get the most out of your data.

Understanding normalization: At its core, normalization is about adjusting values measured on different scales to a common scale, ensuring that they are proportionally represented, and reducing redundancy. In the cybersecurity context, this means representing events or alerts in a manner that doesn't unintentionally amplify or reduce their significance.

Consider our endpoint malware dataset. When analyzing trends, say, infections based on malware families or categories, we aim for an accurate representation. However, a single malware infection on an endpoint could generate multiple alerts depending on the Extended Detection and Response (XDR) system. If left unchecked, this could significantly skew our understanding of the threat landscape. To counteract this, we consider the Elastic agents, which are deployed as part of the XDR solution. Each endpoint has a unique agent, representing a single infection instance if malware is detected. Therefore, to normalize this dataset, we would "flatten" or adjust it based on unique agent IDs. This means, for our analysis, we'd consider the number of unique agent IDs affected by a specific malware family or category rather than the raw number of alerts.

Example visualization of malware alert normalization by unique agents Example visualization of malware alert normalization by unique agents

As depicted in the image above, if we chose to not normalize the malware data in preparation for trend analysis, our key findings would depict inaccurate information. This inaccuracy could be sourced from a plethora of data inconsistencies such as generic YARA rules, programmatic operations that were flagged repeatedly on a single endpoint, and many more.

Diversifying the approach: On the other hand, when dealing with endpoint behavior alerts or cloud alerts (from platforms like AWS, GCP, Azure, Google Workspace, and O365), our normalization approach might differ. These datasets could have their own nuances and may not require the same "flattening" technique used for malware alerts.

Conceptualizing normalization options: Remember the goal of normalization is to reduce redundancy in your data. Make sure to keep your operations as atomic as possible in case you need to go back and tweak them later. This is especially true when performing both normalization and standardization. Sometimes these can be difficult to separate, and you may have to go back and forth between the two. Analysts have a wealth of options for these. From Min-Max scaling, where values are shifted and rescaled to range between 0 and 1, to Z-score normalization (or standardization), where values are centered around zero and standard deviations from the mean. The choice of technique depends on the nature of the data and the specific requirements of the analysis.

In essence, normalization ensures that our cybersecurity analysis is based on a level playing field, giving stakeholders an accurate view of the threat environment without undue distortions. This is a critical step before trend analysis.

Anomaly detection: Refining the process of data analysis

In the realm of cybersecurity analytics, a one-size-fits-all approach to anomaly detection does not exist. The process is highly dependent on the specific characteristics of the data at hand. The primary goal is to identify and address outliers that could potentially distort the analysis. This requires a dynamic and adaptable methodology, where understanding the nuances of the dataset is crucial.

Anomaly detection in cybersecurity involves exploring various techniques and methodologies, each suited to different types of data irregularities. The strategy is not to rigidly apply a single method but rather to use a deep understanding of the data to select the most appropriate technique for each situation. The emphasis is on flexibility and adaptability, ensuring that the approach chosen provides the clearest and most accurate insights into the data.

Statistical methods – The backbone of analysis:

Statistical analysis is always an optional approach to anomaly detection, especially for cyber security data. By understanding the inherent distribution and central tendencies of our data, we can highlight values that deviate from the norm. A simple yet powerful method, the Z-score, gauges the distance of a data point from the mean in terms of standard deviations.

import numpy as np

# Derive Z-scores for data points in a feature
z_scores = np.abs((df['mitre_technique'] - df['mitre_technique'].mean()) / df['mitre_technique'].std())

outliers = df[z_scores > 3]  # Conventionally, a Z-score above 3 signals an outlier

Why this matters: This method allows us to quantitatively gauge the significance of a data point's deviation. Such outliers can heavily skew aggregate metrics like mean or even influence machine learning model training detrimentally. Remember, outliers should not always be removed; it is all about context! Sometimes you may even be looking for the outliers specifically.

Key library: While we utilize NumPy above, SciPy can also be employed for intricate statistical operations.

Aggregations and sorting – unraveling layers:

Data often presents itself in layers. By starting with a high-level view and gradually diving into specifics, we can locate inconsistencies or anomalies. When we aggregate by categories such as the MITRE ATT&CK tactic, and then delve deeper, we gradually uncover the finer details and potential anomalies as we go from technique to rule logic and alert context.

# Aggregating by tactics first
tactic_agg = df.groupby('mitre_tactic').size().sort_values(ascending=False)

From here, we can identify the most common tactics and choose the tactic with the highest count. We then filter our data for this tactic to identify the most common technique associated with the most common tactic. Techniques often are more specific than tactics and thus add more explanation about what we may be observing. Following the same approach we can then filter for this specific technique, aggregate by rule and review that detection rule for more context. The goal here is to find “noisy” rules that may be skewing our dataset and thus related alerts need to be removed. This cycle can be repeated until outliers are removed and the percentages appear more accurate.

Why this matters: This layered analysis approach ensures no stone is left unturned. By navigating from the general to the specific, we systematically weed out inconsistencies.

Key library: Pandas remains the hero, equipped to handle data-wrangling chores with finesse.

Visualization – The lens of clarity:

Sometimes, the human eye, when aided with the right visual representation, can intuitively detect what even the most complex algorithms might miss. A boxplot, for instance, not only shows the central tendency and spread of data but distinctly marks outliers.

import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(12, 8))
sns.boxplot(x='Malware Family', y='Malware Score', data=df)
plt.title('Distribution of Malware Scores by Family')
plt.show()

Example visualization of malware distribution scores by family from an example dataset Example visualization of malware distribution scores by family from an example dataset

Why this matters: Visualization transforms abstract data into tangible insights. It offers a perspective that's both holistic and granular, depending on the need.

Key library: Seaborn, built atop Matplotlib, excels at turning data into visual stories.

Machine learning – The advanced guard:

When traditional methods are insufficient, machine learning steps in, offering a predictive lens to anomalies. While many algorithms are designed to classify known patterns, some, like autoencoders in deep learning, learn to recreate 'normal' data, marking any deviation as an anomaly.

Why this matters: As data complexity grows, the boundaries of what constitutes an anomaly become blurrier. Machine learning offers adaptive solutions that evolve with the data.

Key libraries: Scikit-learn is a treasure trove for user-friendly, classical machine learning techniques, while PyTorch brings the power of deep learning to the table.

Perfecting anomaly detection in data analysis is similar to refining a complex skill through practice and iteration. The process often involves trial and error, with each iteration enhancing the analyst's familiarity with the dataset. This progressive understanding is key to ensuring that the final analysis is both robust and insightful. In data analysis, the journey of exploration and refinement is as valuable as the final outcome itself.

Before proceeding to in-depth trend analysis, it's very important to ensure that the data is thoroughly pre-processed and transformed. Just as precision and reliability are essential in any meticulous task, they are equally critical in data analysis. The steps of cleaning, normalizing, enriching, and removing anomalies from the groundwork for deriving meaningful insights. Without these careful preparations, the analysis could range from slightly inaccurate to significantly misleading. It's only when the data is properly refined and free of distortions that it can reveal its true value, leading to reliable and actionable insights in trend analysis.

Trend analysis: Unveiling patterns in data

In the dynamic field of cybersecurity where threat actors continually evolve their tactics, techniques, and procedures (TTPs), staying ahead of emerging threats is critical. Trend analysis serves as a vital tool in this regard, offering a way to identify and understand patterns and behaviors in cyber threats over time.

By utilizing the MITRE ATT&CK framework, cybersecurity professionals have a structured and standardized approach to analyzing and categorizing these evolving threats. This framework aids in systematically identifying patterns in attack methodologies, enabling defenders to anticipate and respond to changes in adversary behaviors effectively.

Trend analysis, through the lens of the MITRE ATT&CK framework, transforms raw cybersecurity telemetry into actionable intelligence. It allows analysts to track the evolution of attack strategies and to adapt their defense mechanisms accordingly, ensuring a proactive stance in cybersecurity management.

Beginning with a broad overview: Aggregation and sorting

Commencing our analysis with a bird's eye view is paramount. This panoramic perspective allows us to first pinpoint the broader tactics in play before delving into the more granular techniques and underlying detection rules.

Top tactics: By aggregating our data based on MITRE ATT&CK tactics, we can discern the overarching strategies adversaries lean toward. This paints a picture of their primary objectives, be it initial access, execution, or exfiltration.

top_tactics = df.groupby('mitre_tactic').size()
 .sort_values(ascending=False)

Zooming into techniques: Once we've identified a prominent tactic, we can then funnel our attention to the techniques linked to that tactic. This reveals the specific modus operandi of adversaries.

chosen_tactic = 'Execution'

techniques_under_tactic = df[df['mitre_tactic'] == chosen_tactic]
top_techniques = techniques_under_tactic.groupby('mitre_technique').size()
 .sort_values(ascending=False)

Detection rules and logic: With our spotlight on a specific technique, it's time to delve deeper, identifying the detection rules that triggered alerts. This not only showcases what was detected, but by reviewing the detection logic, we also gain an understanding of the precise behaviors and patterns that were flagged.

chosen_technique = 'Scripting'

rules_for_technique = techniques_under_tactic[techniques_under_tactic['mitre_technique'] == chosen_technique]

top_rules = rules_for_technique
 .groupby('detection_rule').size().sort_values(ascending=False)

This hierarchical, cascading approach is akin to peeling an onion. With each layer, we expose more intricate details, refining our perspective and sharpening our insights.

The power of time: Time series analysis

In the realm of cybersecurity, time isn't just a metric; it's a narrative. Timestamps, often overlooked, are goldmines of insights. Time series analysis allows us to plot events over time, revealing patterns, spikes, or lulls that might be indicative of adversary campaigns, specific attack waves, or dormancy periods.

For instance, plotting endpoint malware alerts over time can unveil an adversary's operational hours or spotlight a synchronized, multi-vector attack:

import matplotlib.pyplot as plt

# Extract and plot endpoint alerts over time
df.set_index('timestamp')['endpoint_alert'].resample('D').count().plot()
plt.title('Endpoint Malware Alerts Over Time')
plt.xlabel('Time')
plt.ylabel('Alert Count')
plt.show()

Time series analysis doesn't just highlight "when" but often provides insights into the "why" behind certain spikes or anomalies. It aids in correlating external events (like the release of a new exploit) to internal data trends.

Correlation analysis

Understanding relationships between different sets of data can offer valuable insights. For instance, a spike in one type of alert could correlate with another type of activity in the system, shedding light on multi-stage attack campaigns or diversion strategies.

# Finding correlation between an increase in login attempts and data exfiltration activities
correlation_value = df['login_attempts'].corr(df['data_exfil_activity'])

This analysis, with the help of pandas corr, can help in discerning whether multiple seemingly isolated activities are part of a coordinated attack chain.

Correlation also does not have to be metric-driven either. When analyzing threats, it is easy to find value and new insights by comparing older findings to the new ones.

Machine learning & anomaly detection

With the vast volume of data, manual analysis becomes impractical. Machine learning can assist in identifying patterns and anomalies that might escape the human eye. Algorithms like Isolation Forest or K-nearest neighbor(KNN) are commonly used to spot deviations or clusters of commonly related data.

from sklearn.ensemble import IsolationForest

# Assuming 'feature_set' contains relevant metrics for analysis
clf = IsolationForest(contamination=0.05)
anomalies = clf.fit_predict(feature_set)

Here, the anomalies variable will flag data points that deviate from the norm, helping analysts pinpoint unusual behavior swiftly.

Behavioral patterns & endpoint data analysis

Analyzing endpoint behavioral data collected from detection rules allows us to unearth overarching patterns and trends that can be indicative of broader threat landscapes, cyber campaigns, or evolving attacker TTPs.

Tactic progression patterns: By monitoring the sequence of detected behaviors over time, we can spot patterns in how adversaries move through their attack chain. For instance, if there's a consistent trend where initial access techniques are followed by execution and then lateral movement, it's indicative of a common attacker playbook being employed.

Command-line trend analysis: Even within malicious command-line arguments, certain patterns or sequences can emerge. Monitoring the most frequently detected malicious arguments can give insights into favored attack tools or scripts.

Example:

# Most frequently detected malicious command lines
top_malicious_commands = df.groupby('malicious_command_line').size()
 .sort_values(ascending=False).head(10)

Process interaction trends: While individual parent-child process relationships can be malicious, spotting trends in these interactions can hint at widespread malware campaigns or attacker TTPs. For instance, if a large subset of endpoints is showing the same unusual process interaction, it might suggest a common threat.

Temporal behavior patterns: Just as with other types of data, the temporal aspect of endpoint behavioral data can be enlightening. Analyzing the frequency and timing of certain malicious behaviors can hint at attacker operational hours or campaign durations.

Example:

# Analyzing frequency of a specific malicious behavior over time
monthly_data = df.pivot_table(index='timestamp', columns='tactic', values='count', aggfunc='sum').resample('M').sum()

ax = monthly_data[['execution', 'defense-evasion']].plot(kind='bar', stacked=False, figsize=(12,6))

plt.title("Frequency of 'execution' and 'defense-evasion' Tactics Over Time")

plt.ylabel("Count")
ax.set_xticklabels([x.strftime('%B-%Y') for x in monthly_data.index])
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

Note: This image is from example data and not from the Global Threat Report Note: This image is from example data and not from the Global Threat Report

By aggregating and analyzing endpoint behavioral data at a macro level, we don't just identify isolated threats but can spot waves, trends, and emerging patterns. This broader perspective empowers cybersecurity teams to anticipate, prepare for, and counter large-scale cyber threats more effectively.

While these are some examples of how to perform trend analysis, there is no right or wrong approach. Every analyst has their own preference or set of questions they or stakeholders may want to ask. Here are some additional questions or queries analysts may have for cybersecurity data when doing trend analysis.

  • What are the top three tactics being leveraged by adversaries this quarter?
  • Which detection rules are triggering the most, and is there a common thread?
  • Are there any time-based patterns in endpoint alerts, possibly hinting at an adversary's timezone?
  • How have cloud alerts evolved with the migration of more services to the cloud?
  • Which malware families are becoming more prevalent, and what might be the cause?
  • Do the data patterns suggest any seasonality, like increased activities towards year-end?
  • Are there correlations between external events and spikes in cyber activities?
  • How does the weekday data differ from weekends in terms of alerts and attacks?
  • Which organizational assets are most targeted, and are their defenses up-to-date?
  • Are there any signs of internal threats or unusual behaviors among privileged accounts?

Trend analysis in cybersecurity is a dynamic process. While we've laid down some foundational techniques and questions, there are myriad ways to approach this vast domain. Each analyst may have their preferences, tools, and methodologies, and that's perfectly fine. The essence lies in continuously evolving and adapting to our approach while cognizantly being aware of the ever-changing threat landscape for each ecosystem exposed to threats.

Reduction: Streamlining for clarity

Having progressed through the initial stages of our data analysis, we now enter the next phase: reduction. This step is about refining and concentrating our comprehensive data into a more digestible and focused format.

Recap of the Analysis Journey So Far:

  • Extraction: The initial phase involved setting up our Google Cloud environment and selecting relevant datasets for our analysis.
  • Pre-processing and transformation: At this stage, the data was extracted, processed, and transformed within our Colab notebooks, preparing it for detailed analysis.
  • Trend analysis: This phase provided in-depth insights into cyber attack tactics, techniques, and malware, forming the core of our analysis.

While the detailed data in our Colab Notebooks is extensive and informative for an analyst, it might be too complex for a broader audience. Therefore, the reduction phase focuses on distilling this information into a more concise and accessible form. The aim is to make the findings clear and understandable, ensuring that they can be effectively communicated and utilized across various departments or stakeholders.

Selecting and aggregating key data points

In order to effectively communicate our findings, we must tailor the presentation to the audience's needs. Not every stakeholder requires the full depth of collected data; many prefer a summarized version that highlights the most actionable points. This is where data selection and aggregation come into play, focusing on the most vital elements and presenting them in an accessible format.

Here's an example of how to use Pandas to aggregate and condense a dataset, focusing on key aspects of endpoint behavior:

required_endpoint_behavior_cols = ['rule_name','host_os_type','tactic_name','technique_name']


reduced_behavior_df = df.groupby(required_endpoint_behavior_cols).size()
 .reset_index(name='count')
 .sort_values(by="count", ascending=False)
 .reset_index(drop=True)

columns = {
    'rule_name': 'Rule Name', 
    'host_os_type': 'Host OS Type',
    'tactic_name': 'Tactic', 
    'technique_name': 'Technique', 
    'count': 'Alerts'
}

reduced_behavior_df = reduced_behavior_df.rename(columns=columns)

One remarkable aspect of this code and process is the flexibility it offers. For instance, we can group our data by various data points tailored to our needs. Interested in identifying popular tactics used by adversaries? Group by the MITRE ATT&CK tactic. Want to shed light on masquerading malicious binaries? Revisit extraction to add more Elastic Common Schema (ECS) fields such as file path, filter on Defense Evasion, and aggregate to reveal the commonly trodden paths. This approach ensures we create datasets that are both enlightening and not overwhelmingly rich, tailor-made for stakeholders who wish to understand the origins of our analysis.

This process involves grouping the data by relevant categories such as rule name, host OS type, and MITRE ATT&CK tactics and techniques and then counting the occurrences. This method helps in identifying the most prevalent patterns and trends in the data.

Diagram example of data aggregation to obtain reduced dataset Diagram example of data aggregation to obtain reduced dataset

Exporting reduced data to Google Sheets for accessibility

The reduced data, now stored as a dataframe in memory, is ready to be exported. We use Google Sheets as the platform for sharing these insights because of its wide accessibility and user-friendly interface. The process of exporting data to Google Sheets is straightforward and efficient, thanks to the integration with Google Cloud services.

Here's an example of how the data can be uploaded to Google Sheets using Python from our Colab notebook:

auth.authenticate_user()
credentials, project = google.auth.default()
gc = gspread.authorize(credentials)
workbook = gc.open_by_key("SHEET_ID")
behavior_sheet_name = 'NAME_OF_TARGET_SHEET'
endpoint_behavior_worksheet = workbook.worksheet(behavior_sheet_name)
set_with_dataframe(endpoint_behavior_worksheet, reduced_behavior_df)

With a few simple lines of code, we have effectively transferred our data analysis results to Google Sheets. This approach is widely used due to its accessibility and ease of use. However, there are multiple other methods to present data, each suited to different requirements and audiences. For instance, some might opt for a platform like Looker to present the processed data in a more dynamic dashboard format. This method is particularly useful for creating interactive and visually engaging presentations of data. It ensures that even stakeholders who may not be familiar with the technical aspects of data analysis, such as those working in Jupyter Notebooks, can easily understand and derive value from the insights.

Results in Google Sheet

This streamlined process of data reduction and presentation can be applied to different types of datasets, such as cloud SIEM alerts, endpoint behavior alerts, or malware alerts. The objective remains the same: to simplify and concentrate the data for clear and actionable insights.

Presentation: Showcasing the insights

After meticulously refining our datasets, we now focus on the final stage: the presentation. Here we take our datasets, now neatly organized in platforms like Google Sheets or Looker, and transform them into a format that is both informative and engaging.

Pivot tables for in-depth analysis

Using pivot tables, we can create a comprehensive overview of our trend analysis findings. These tables allow us to display data in a multi-dimensional manner, offering insights into various aspects of cybersecurity, such as prevalent MITRE ATT&CK tactics, chosen techniques, and preferred malware families.

Our approach to data visualization involves:

  • Broad overview with MITRE ATT&CK tactics: Starting with a general perspective, we use pivot tables to overview the different tactics employed in cyber threats.
  • Detailed breakdown: From this panoramic view, we delve deeper, creating separate pivot tables for each popular tactic and then branching out into detailed analyses for each technique and specific detection rule.

This methodical process helps to uncover the intricacies of detection logic and alerts, effectively narrating the story of the cyber threat landscape.

Diagram showcasing aggregations funnel into contextual report information Diagram showcasing aggregations funnel into contextual report information

Accessibility across audiences: Our data presentations are designed to cater to a wide range of audiences, from those deeply versed in data science to those who prefer a more straightforward understanding. The Google Workspace ecosystem facilitates the sharing of these insights, allowing pivot tables, reduced datasets, and other elements to be easily accessible to all involved in the report-making process.

Integrating visualizations into reports: When crafting a report, for example, in Google Docs, the integration of charts and tables from Google Sheets is seamless. This integration ensures that any modifications in the datasets or pivot tables are easily updated in the report, maintaining the efficiency and coherence of the presentation.

Tailoring the presentation to the audience: The presentation of data insights is not just about conveying information; it's about doing so in a visually appealing and digestible manner. For a more tech-savvy audience, an interactive Colab Notebook with dynamic charts and functions may be ideal. In contrast, for marketing or design teams, a well-designed dashboard in Looker might be more appropriate. The key is to ensure that the presentation is clear, concise, and visually attractive, tailored to the specific preferences and needs of the audience.

Conclusion: Reflecting on the data analysis journey

As we conclude, it's valuable to reflect on the territory we've navigated in analyzing cyber threat data. This journey involved several key stages, each contributing significantly to our final insights.

Journey through Google's Cloud ecosystem

Our path took us through several Google Cloud services, including GCP, GCE, Colab Notebooks, and Google Workspace. Each played a pivotal role:

Data exploration: We began with a set of cyber-related questions we wanted to answer and explored what vast datasets we had available to us. In this blog, we focused solely on telemetry being available in BigQuery. Data extraction: We began by extracting raw data, utilizing BigQuery to efficiently handle large volumes of data. Extraction occurred in both BigQuery and from within our Colab notebooks. Data wrangling and processing: The power of Python and the pandas library was leveraged to clean, aggregate, and refine this data, much like a chef skillfully preparing ingredients. Trend analysis: We then performed trend analysis on our reformed datasets with several methodologies to glean valuable insights into adversary tactics, techniques, and procedures over time. Reduction: Off the backbone of our trend analysis, we aggregated our different datasets by targeted data points in preparation for presentation to stakeholders and peers. Transition to presentation: The ease of moving from data analytics to presentation within a web browser highlighted the agility of our tools, facilitating a seamless workflow.

Modularity and flexibility in workflow

An essential aspect of our approach was the modular nature of our workflow. Each phase, from data extraction to presentation, featured interchangeable components in the Google Cloud ecosystem, allowing us to tailor the process to specific needs:

Versatile tools: Google Cloud Platform offered a diverse range of tools and options, enabling flexibility in data storage, analysis, and presentation. Customized analysis path: Depending on the specific requirements of our analysis, we could adapt and choose different tools and methods, ensuring a tailored approach to each dataset. Authentication and authorization: Due to our entities being housed in the Google Cloud ecosystem, access to different tools, sites, data, and more was all painless, ensuring a smooth transition between services.

Orchestration and tool synchronization

The synergy between our technical skills and the chosen tools was crucial. This harmonization ensured that the analytical process was not only effective for this project but also set the foundation for more efficient and insightful future analyses. The tools were used to augment our capabilities, keeping the focus on deriving meaningful insights rather than getting entangled in technical complexities.

In summary, this journey through data analysis emphasized the importance of a well-thought-out approach, leveraging the right tools and techniques, and the adaptability to meet the demands of cyber threat data analysis. The end result is not just a set of findings but a refined methodology that can be applied to future data analysis endeavors in the ever-evolving field of cybersecurity.

Call to Action: Embarking on your own data analytics journey

Your analytical workspace is ready! What innovative approaches or experiences with Google Cloud or other data analytics platforms can you bring to the table? The realm of data analytics is vast and varied, and although each analyst brings a unique touch, the underlying methods and principles are universal.

The objective is not solely to excel in your current analytical projects but to continually enhance and adapt your techniques. This ongoing refinement ensures that your future endeavors in data analysis will be even more productive, enlightening, and impactful. Dive in and explore the world of data analytics with Google Cloud!

We encourage any feedback and engagement for this topic! If you prefer to do so, feel free to engage us in Elastic’s public #security Slack channel.