Elastic Observability Labs - Articles by Stephen Brown

Bringing Your Cloud-Managed Kubernetes Audit Logs into Elasticsearch

Mon, 19 Aug 2024 00:00:00 GMT

Introduction:

Kubernetes audit logs are essential for ensuring the security, compliance, and transparency of Kubernetes clusters. However, with managed Kubernetes infrastructure, traditional audit file-based log shipping is often not supported, and audit logs are only available via the control plane API or the Cloud Provider logging facility. In this blog, we will show you how to ingest the audit logs from these other sources and still take advantage of the Elastic Kubernetes Audit Log Integration.

In this blog we will be focusing on AWS as our cloud provider and when ingesting logs from AWS you have several options:

AWS Custom Logs integration (which we will utilize in this blog)
AWS Firehose to send logs from Cloudwatch to Elastic
AWS General integration which supports many AWS sources

In part 1 of this two-part series, we will focus on properly ingesting Kubernetes Audit, and part 2 will focus on investigation, analytics, and alerting.

Kubernetes auditing documentation describes the need for auditing in order to get answers to the questions below:

What happened?
When did it happen?
Who initiated it?
What resource did it occur on?
Where was it observed?
From where was it initiated (Source IP)?
Where was it going (Destination IP)?

Answers to the above questions become important when an incident occurs and an investigation follows. Alternatively, it could just be a log retention use case for a regulated company trying to fulfill compliance requirements.

We are giving special importance to audit logs in Kubernetes because audit logs are not enabled by default. Audit logs can take up a large amount of memory and storage. So, usually, it’s a balance between retaining/investigating audit logs against giving up resources budgeted otherwise for workloads to be hosted on the Kubernetes cluster. Another reason we’re talking about audit logs in Kubernetes is that, unlike usual container logs, after being turned on, these logs are orchestrated to write to the cloud provider’s logging service. This is true for most cloud providers because the Kubernetes control plane is managed by the cloud providers. It makes sense for cloud providers to use their built-in orchestration workflows involving the control plane for a managed service backed by their implementation of a logging framework.

Kubernetes audit logs can be quite verbose by default. Hence, it becomes important to selectively choose how much logging needs to be done so that all the audit requirements are met for the organization. This is done in the audit policy file. The audit policy file is submitted against the kube-apiserver. It is not necessary that all flavors of cloud-provider-hosted Kubernetes clusters allow you to play with the kube-apiserver directly. For example, AWS EKS allows for this logging to be done only by the control plane.

In this blog we will be using Elastic Kubernetes Service (Amazon EKS) on AWS with the Kubernetes Audit Logs that are automatically shipped to AWS CloudWatch.

A sample audit log for a secret by the name “empty-secret” created by an admin user on EKS is logged on AWS CloudWatch in the following format:

Once the audit logs show up on CloudWatch, it is time to consider how to transfer them to Elasticsearch. Elasticsearch is a great platform for creating dashboards that visualize different audit events recorded in a Kubernetes cluster. It is also a powerful tool for analyzing various audit events. For example, how many secret object creation attempts were made in an hour?

Now that we established the Kubernetes audit logs are being logged in CloudWatch, let’s discuss how to get the logs ingested into Elasticsearch. Elasticsearch has an integration to consume logs written on CloudWatch. Just using this integration by default is going to get the JSON from CloudWatch as is i.e. the real audit log JSON is nested inside the wrapper CloudWatch JSON. When bringing logs to Elasticsearch, it is important that we use the Elastic Common Schema(ECS) to get the best search and analytics performance. This means that there needs to be an ingest pipeline that parses a standard Kubernetes audit JSON message and creates an ECS Compliant document in Elasticsearch. Let’s dive into how to achieve this.

Elasticsearch has a Kubernetes integration using Elastic Agent to consume Kubernetes container logs from the console and audit logs written to a file path. For a cloud-provider use case, as described above, it may not be feasible to write audit logs to a path on the Kubernetes cluster. So, how do we leverage the ECS designed for parsing the Kubernetes audit logs already implemented in the Kubernetes integration to work on the CloudWatch audit logs? That is the most exciting plumbing piece! Let’s see how to do it.

What we’re going to do is:

Read the Kubernetes audit logs from the cloud provider’s logging module, in our case, AWS CloudWatch since this is where logs reside. We will use Elastic Agent and Elasticsearch AWS Custom Logs integration to read from logs from CloudWatch. Note: please be aware, there are several Elastic AWS integration, we are specifically using the AWS Custom Logs integration.
Create two simple ingest pipelines (we do this for best practices of isolation and composability)
The first pipeline looks for Kubernetes audit JSON messages and then redirects them to the second pipeline
The second custom pipeline will associate the JSON message field with the correct field expected by the Elasticsearch Kubernetes Audit managed pipeline (aka the Integration) and then reroute the message to the correct data stream, kubernetes.audit_logs-default, which in turn applies all the proper mapping and ingest pipelines for the incoming message
The overall flow will be

1. Create an AWS CloudWatch integration:

a. Populate the AWS access key and secret pair values

b. In the logs section, populate the log ARN, Tags and Preserve the original event if you want to, and then Save this integration and exit from the page

2. Next, we will configure the custom ingest pipeline

We are doing this because we want to override what the generic managed pipeline does. We will retrieve the custom component name by searching for managed pipeline created as an asset when we install the AWS CloudWatch integration. In this case we will be adding the custom ingest pipeline logs-aws_logs.generic@custom

From the Dev tools console, run below. Here, we are extracting the message field from the CloudWatch JSON and putting the value in a field called kubernetes.audit. Then, we are rerouting this message to the default Kubernetes audit dataset or ECS that comes with Kubernetes integration

PUT _ingest/pipeline/logs-aws_logs.generic@custom
{
    "processors": [
      {
        "pipeline": {
          "if": "ctx.message.contains('audit.k8s.io')",
          "name": "logs-aws-process-k8s-audit"
        }
      }
    ]
}

PUT _ingest/pipeline/logs-aws-process-k8s-audit
{
  "processors": [
    {
      "json": {
        "field": "message",
        "target_field": "kubernetes.audit"
      }
    },
    {
      "remove": {
        "field": "message"
      }
    },
    {
      "reroute": {
        "dataset": "kubernetes.audit_logs",
        "namespace": "default"
      }
    }
  ]
}

Let’s understand this further:

When we create a Kubernetes integration, we get a managed index template called logs-kubernetes.audit_logs that writes to the pipeline called logs-kubernetes.audit_logs-1.62.2 by default
If we look into the pipeline logs-kubernetes.audit_logs-1.62.2, we see that all the processor logic is working against the field kubernetes.audit. This is the reason why our json processor in the above code snippet is creating a field called kubernetes.audit before dropping the original message field and rerouting. Rerouting is directed to the kubernetes.audit_logs dataset that backs the logs-kubernetes.audit_logs-1.62.2 pipeline (dataset name is derived from the pipeline name convention that’s in the format logs--version)

3. Now let’s verify that the logs are actually flowing through and the audit message is being parsed

a. We will use Elastic Agent and enroll using Fleet and the integration policy we created in the Step 1. There are a number of ways to deploy Elastic Agent and for this exercise we will deploy using docker which is quick and easy.

% docker run --env FLEET_ENROLL=1 --env FLEET_URL=<> --env FLEET_ENROLLMENT_TOKEN=<>  --rm docker.elastic.co/beats/elastic-agent:8.19.11

b. Check the messages in Discover. In 8.15 there is also a new feature called Logs Explorer which provides an ability to see Kubernetes Audit logs (and container logs) with a few clicks (see image below). Voila! We can see the Kubernetes audit messages parsed!

4. Let's do a quick recap of what we did

We configured CloudWatch integration in Elasticsearch to read Kubernetes audit logs from CloudWatch. Then, we created custom ingest pipelines to reroute the audit messages to the correct data stream and all the OOTB mappings and parsing that come with the Kubernetes Audit Logs integration.

In the next part, we’ll look at how to analyze the ingested Kubernetes Audit log data.

Using NLP and Pattern Matching to Detect, Assess, and Redact PII in Logs - Part 1

Wed, 25 Sep 2024 00:00:00 GMT

Introduction:

The prevalence of high-entropy logs in distributed systems has significantly raised the risk of PII (Personally Identifiable Information) seeping into our logs, which can result in security and compliance issues. This 2-part blog delves into the crucial task of identifying and managing this issue using the Elastic Stack. We will explore using NLP (Natural Language Processing) and Pattern matching to detect, assess, and, where feasible, redact PII from logs that are being ingested into Elasticsearch.

In Part 1 of this blog, we will cover the following:

Review the techniques and tools we have available to manage PII in our logs
Understand the roles of NLP / NER in PII detection
Build a composable processing pipeline to detect and assess PII
Sample logs and run them through the NER Model
Assess the results of the NER Model

In Part 2 of this blog of this blog, we will cover the following:

Redact PII using NER and the redact processor
Apply field-level security to control access to the un-redacted data
Enhance the dashboards and alerts
Production considerations and scaling
How to run these processes on incoming or historical data

Here is the overall flow we will construct over the 2 blogs:

All code for this exercise can be found at: https://github.com/bvader/elastic-pii.

Tools and Techniques

There are four general capabilities that we will use for this exercise.

Named Entity Recognition Detection (NER)
Pattern Matching Detection
Log Sampling
Ingest Pipelines as Composable Processing

Named Entity Recognition (NER) Detection

NER is a sub-task of Natural Language Processing (NLP) that involves identifying and categorizing named entities in unstructured text into predefined categories such as:

Person: Names of individuals, including celebrities, politicians, and historical figures.
Organization: Names of companies, institutions, and organizations.
Location: Geographic locations, including cities, countries, and landmarks.
Event: Names of events, including conferences, meetings, and festivals.

For our use PII case, we will choose the base BERT NER model bert-base-NER that can be downloaded from Hugging Face and loaded into Elasticsearch as a trained model.

Important Note: NER / NLP Models are CPU-intensive and expensive to run at scale; thus, we will want to employ a sampling technique to understand the risk in our logs without sending the full logs volume through the NER Model. We will discuss the performance and scaling of the NER model in part 2 of the blog.

Pattern Matching Detection

In addition to using an NER, regex pattern matching is a powerful tool for detecting and redacting PII based on common patterns. The Elasticsearch redact processor is built for this use case.

Log Sampling

Considering the performance implications of NER and the fact that we may be ingesting a large volume of logs into Elasticsearch, it makes sense to sample our incoming logs. We will build a simple log sampler to accomplish this.

Ingest Pipelines as Composable Processing

We will create several pipelines, each focusing on a specific capability and a main ingest pipeline to orchestrate the overall process.

Building the Processing Flow

Logs Sampling + Composable Ingest Pipelines

The first thing we will do is set up a sampler to sample our logs. This ingest pipeline simply takes a sampling rate between 0 (no log) and 10000 (all logs), which allows as low as ~0.01% sampling rate and marks the sampled logs with sample.sampled: true. Further processing on the logs will be driven by the value of sample.sampled. The sample.sample_rate can be set here or "passed in" from the orchestration pipeline.

The command should be run from the Kibana -> Dev Tools

The code can be found here for the following three sections of code.

logs-sampler pipeline code - click to open/close

# logs-sampler pipeline - part 1
DELETE _ingest/pipeline/logs-sampler
PUT _ingest/pipeline/logs-sampler
{
  "processors": [
    {
      "set": {
        "description": "Set Sampling Rate 0 None 10000 all allows for 0.01% precision",
        "if": "ctx.sample.sample_rate == null",
        "field": "sample.sample_rate",
        "value": 10000
      }
    },
    {
      "set": {
        "description": "Determine if keeping unsampled docs",
        "if": "ctx.sample.keep_unsampled == null",
        "field": "sample.keep_unsampled",
        "value": true
      }
    },
    {
      "set": {
        "field": "sample.sampled",
        "value": false
      }
    },
    {
      "script": {
        "source": """ Random r = new Random();
        ctx.sample.random = r.nextInt(params.max); """,
        "params": {
          "max": 10000
        }
      }
    },
    {
      "set": {
        "if": "ctx.sample.random <= ctx.sample.sample_rate",
        "field": "sample.sampled",
        "value": true
      }
    },
    {
      "drop": {
         "description": "Drop unsampled document if applicable",
        "if": "ctx.sample.keep_unsampled == false && ctx.sample.sampled == false"
      }
    }
  ]
}

Now, let's test the logs sampler. We will build the first part of the composable pipeline. We will be sending logs to the logs-generic-default data stream. With that in mind, we will create the logs@custom ingest pipeline that will be automatically called using the logs data stream framework for customization. We will add one additional level of abstraction so that you can apply this PII processing to other data streams.

Next, we will create the process-pii pipeline. This is the core processing pipeline where we will orchestrate PII processing component pipelines. In this first step, we will simply apply the sampling logic. Note that we are setting the sampling rate to 100, which is equivalent to 10% of the logs.

process-pii pipeline code - click to open/close

# Process PII pipeline - part 1
DELETE _ingest/pipeline/process-pii
PUT _ingest/pipeline/process-pii
{
  "processors": [
    {
      "set": {
        "description": "Set true if enabling sampling, otherwise false",
        "field": "sample.enabled",
        "value": true
      }
    },
    {
      "set": {
        "description": "Set Sampling Rate 0 None 10000 all allows for 0.01% precision",
        "field": "sample.sample_rate",
        "value": 1000
      }
    },
    {
      "set": {
        "description": "Set to false if you want to drop unsampled data, handy for reindexing hostorical data",
        "field": "sample.keep_unsampled",
        "value": true
      }
    },
    {
      "pipeline": {
        "if": "ctx.sample.enabled == true",
        "name": "logs-sampler",
        "ignore_failure": true
      }
    }
  ]
}

Finally, we create the logs logs@custom, which will simply call our process-pii pipeline based on the correct data_stream.dataset

logs@custom pipeline code - click to open/close

# logs@custom pipeline - part 1
DELETE _ingest/pipeline/logs@custom
PUT _ingest/pipeline/logs@custom
{
  "processors": [
    {
      "set": {
        "field": "pipelinetoplevel",
        "value": "logs@custom"
      }
    },
        {
      "set": {
        "field": "pipelinetoplevelinfo",
        "value": "{{{data_stream.dataset}}}"
      }
    },
    {
      "pipeline": {
        "description" : "Call the process_pii pipeline on the correct dataset",
        "if": "ctx?.data_stream?.dataset == 'pii'", 
        "name": "process-pii"
      }
    }
  ]
}

Now, let's test to see the sampling at work.

Load the data as described here Data Loading Appendix. Let's use the sample data first, and we will talk about how to test with your incoming or historical logs later at the end of this blog.

If you look at Observability -> Logs -> Logs Explorer with KQL filter data_stream.dataset : pii and Breakdown by sample.sampled, you should see the breakdown to be approximately 10%

At this point we have a composable ingest pipeline that is "sampling" logs. As a bonus, you can use this logs sampler for any other use cases you have as well.

Loading, Configuration, and Execution of the NER Pipeline

Loading the NER Model

You will need a Machine Learning node to run the NER model on. In this exercise, we are using Elastic Cloud Hosted Deployment on AWS with the CPU Optimized (ARM) architecture. The NER inference will run on a Machine Learning AWS c5d node. There will be GPU options in the future, but today, we will stick with CPU architecture.

This exercise will use a single c5d with 8 GB RAM with 4.2 vCPU up to 8.4 vCPU

Please refer to the official documentation on how to import an NLP-trained model into Elasticsearch for complete instructions on uploading, configuring, and deploying the model.

The quickest way to get the model is using the Eland Docker method.

The following command will load the model into Elasticsearch but will not start it. We will do that in the next step.

docker run -it --rm --network host docker.elastic.co/eland/eland \
  eland_import_hub_model \
  --url https://mydeployment.es.us-west-1.aws.found.io:443/ \
  -u elastic -p password \
  --hub-model-id dslim/bert-base-NER --task-type ner

Deploy and Start the NER Model

In general, to improve ingest performance, increase throughput by adding more allocations to the deployment. For improved search speed, increase the number of threads per allocation.

To scale ingest, we will focus on scaling the allocations for the deployed model. More information on this topic is available here. The number of allocations must be less than the available allocated processors (cores, not vCPUs) per node.

To deploy and start the NER Model. We will do this using the Start trained model deployment API

We will configure the following:

4 Allocations to allow for more parallel ingestion
1 Thread per Allocation
0 Byes Cache, as we expect a low cache hit rate
8192 Queue

# Start the model with 4 Allocators x 1 Thread, no cache, and 8192 queue
POST _ml/trained_models/dslim__bert-base-ner/deployment/_start?cache_size=0b&number_of_allocations=4&threads_per_allocation=1&queue_capacity=8192

You should get a response that looks something like this.

{
  "assignment": {
    "task_parameters": {
      "model_id": "dslim__bert-base-ner",
      "deployment_id": "dslim__bert-base-ner",
      "model_bytes": 430974836,
      "threads_per_allocation": 1,
      "number_of_allocations": 4,
      "queue_capacity": 8192,
      "cache_size": "0",
      "priority": "normal",
      "per_deployment_memory_bytes": 430914596,
      "per_allocation_memory_bytes": 629366952
    },
...
    "assignment_state": "started",
    "start_time": "2024-09-23T21:39:18.476066615Z",
    "max_assigned_allocations": 4
  }
}

The NER model has been deployed and started and is ready to be used.

The following ingest pipeline implements the NER model via the inference processor.

There is a significant amount of code here, but only two items of interest now exist. The rest of the code is conditional logic to drive some additional specific behavior that we will look closer at in the future.

The inference processor calls the NER model by ID, which we loaded previously, and passes the text to be analyzed, which, in this case, is the message field, which is the text_field we want to pass to the NER model to analyze for PII.
The script processor loops through the message field and uses the data generated by the NER model to replace the identified PII with redacted placeholders. This looks more complex than it really is, as it simply loops through the array of ML predictions and replaces them in the message string with constants, and stores the results in a new field redact.message. We will look at this a little closer in the following steps.

The code can be found here for the following three sections of code.

The NER PII Pipeline

logs-ner-pii-processor pipeline code - click to open/close

# NER Pipeline
DELETE _ingest/pipeline/logs-ner-pii-processor
PUT _ingest/pipeline/logs-ner-pii-processor
{
  "processors": [
    {
      "set": {
        "description": "Set to true to actually redact, false will run processors but leave original",
        "field": "redact.enable",
        "value": true
      }
    },
    {
      "set": {
        "description": "Set to true to keep ml results for debugging",
        "field": "redact.ner.keep_result",
        "value": true
      }
    },
    {
      "set": {
        "description": "Set to PER, LOC, ORG to skip, or NONE to not drop any replacement",
        "field": "redact.ner.skip_entity",
        "value": "NONE"
      }
    },
    {
      "set": {
        "description": "Set to PER, LOC, ORG to skip, or NONE to not drop any replacement",
        "field": "redact.ner.minimum_score",
        "value": 0
      }
    },
    {
      "set": {
        "if": "ctx?.redact?.message == null",
        "field": "redact.message",
        "copy_from": "message"
      }
    },
    {
      "set": {
        "field": "redact.ner.successful",
        "value": true
      }
    },
    {
      "set": {
        "field": "redact.ner.found",
        "value": false
      }
    },
    {
      "inference": {
        "model_id": "dslim__bert-base-ner",
        "field_map": {
          "message": "text_field"
        },
        "on_failure": [
          {
            "set": {
              "description": "Set 'error.message'",
              "field": "failure",
              "value": "REDACT_NER_FAILED"
            }
          },
          {
            "set": {
              "field": "redact.ner.successful",
              "value": false
            }
          }
        ]
      }
    },
    {
      "script": {
        "if": "ctx.failure_ner != 'REDACT_NER_FAILED'",
        "lang": "painless",
        "source": """String msg = ctx['message'];
          for (item in ctx['ml']['inference']['entities']) {
          	if ((item['class_name'] != ctx.redact.ner.skip_entity) && 
          	  (item['class_probability'] >= ctx.redact.ner.minimum_score)) {  
          		  msg = msg.replace(item['entity'], '<' + 
          		  'REDACTNER-'+ item['class_name'] + '_NER>')
          	}
          }
          ctx.redact.message = msg""",
        "on_failure": [
          {
            "set": {
              "description": "Set 'error.message'",
              "field": "failure",
              "value": "REDACT_REPLACEMENT_SCRIPT_FAILED",
              "override": false
            }
          },
          {
            "set": {
              "field": "redact.successful",
              "value": false
            }
          }
        ]
      }
    },
    
    {
      "set": {
        "if": "ctx?.ml?.inference?.entities.size() > 0", 
        "field": "redact.ner.found",
        "value": true,
        "ignore_failure": true
      }
    },
    {
      "set": {
        "if": "ctx?.redact?.pii?.found == null",
        "field": "redact.pii.found",
        "value": false
      }
    },
    {
      "set": {
        "if": "ctx?.redact?.ner?.found == true",
        "field": "redact.pii.found",
        "value": true
      }
    },
    {
      "remove": {
        "if": "ctx.redact.ner.keep_result != true",
        "field": [
          "ml"
        ],
        "ignore_missing": true,
        "ignore_failure": true
      }
    }
  ],
  "on_failure": [
    {
      "set": {
        "field": "failure",
        "value": "GENERAL_FAILURE",
        "override": false
      }
    }
  ]
}

The updated PII Processor Pipeline, which now calls the NER Pipeline

process-pii pipeline code - click to open/close

# Updated Process PII pipeline that now call the NER pipeline
DELETE _ingest/pipeline/process-pii
PUT _ingest/pipeline/process-pii
{
  "processors": [
    {
      "set": {
        "description": "Set true if enabling sampling, otherwise false",
        "field": "sample.enabled",
        "value": true
      }
    },
    {
      "set": {
        "description": "Set Sampling Rate 0 None 10000 all allows for 0.01% precision",
        "field": "sample.sample_rate",
        "value": 1000
      }
    },
    {
      "set": {
        "description": "Set to false if you want to drop unsampled data, handy for reindexing hostorical data",
        "field": "sample.keep_unsampled",
        "value": true
      }
    },
    {
      "pipeline": {
        "if": "ctx.sample.enabled == true",
        "name": "logs-sampler",
        "ignore_failure": true
      }
    },
    {
      "pipeline": {
        "if": "ctx.sample.enabled == false || (ctx.sample.enabled == true && ctx.sample.sampled == true)",
        "name": "logs-ner-pii-processor"
      }
    }
  ]
}

Now reload the data as described here in Reloading the logs

Results

Let's take a look at the results with the NER processing in place. In the Logs Explorer with KQL query bar, execute the following query data_stream.dataset : pii and ml.inference.entities.class_name : ("PER" and "LOC" and "ORG" )

Logs Explorer should look something like this, open the top message to see the details.

NER Model Results

Lets take a closer look at what these fields mean.

Field: ml.inference.entities.class_name
Sample Value: [PER, PER, LOC, ORG, ORG]
Description: An array of the named entity classes that the NER model has identified.

Field: ml.inference.entities.class_probability
Sample Value: [0.999, 0.972, 0.896, 0.506, 0.595]
Description: The class_probability is a value between 0 and 1, which indicates how likely it is that a given data point belongs to a certain class. The higher the number, the higher the probability that the data point belongs to the named class. This is important as in the next blog we can decide a threshold that we will want to use to alert and redact on.' You can see in this example it identified a LOC as an ORG, we can filter this out / find them by setting a threshold.

Field: ml.inference.entities.entity
Sample Value: [Paul Buck, Steven Glens, South Amyborough, ME, Costco]
Description: The array of entities identified that align positionally with the class_name and class_probability.

Field: ml.inference.predicted_value
Sample Value: [2024-09-23T14:32:14.608207-07:00Z] log.level=INFO: Payment successful for order #4594 (user: [Paul Buck](PER&Paul+Buck), david59@burgess.net). Phone: 726-632-0527x520, Address: 3713 [Steven Glens](PER&Steven+Glens), [South Amyborough](LOC&South+Amyborough), [ME](ORG&ME) 93580, Ordered from: [Costco](ORG&Costco)
Description: The predicted value of the model.

PII Assessment Dashboard

Lets take a quick look at a dashboard built to assess PII the data.

To load the dashboard, go to Kibana -> Stack Management -> Saved Objects and import the pii-dashboard-part-1.ndjson file that can be found here:

https://github.com/bvader/elastic-pii/blob/main/elastic/blog-part-1/pii-dashboard-part-1.ndjson

More complete instructions on Kibana Saved Objects can be found here.

After loading the dashboard, navigate to it and select the right time range and you should see something like below. It shows metrics such as sample rate, percent of logs with NER, NER Score Trends etc. We will examine the assessment and actions in part 2 of this blog.

Summary and Next Steps

In this first part of the blog, we have accomplished the following.

Reviewed the techniques and tools we have available for PII detection and assement
Reviewed NLP / NER role in PII detection and assessment
Built the necessary composable ingest pipelines to sample logs and run them through the NER Model
Reviewed the NER results and are ready to move to the second blog

In the upcoming Part 2 of this blog of this blog, we will cover the following:

Redact PII using NER and redact processor
Apply field-level security to control access to the un-redacted data
Enhance the dashboards and alerts
Production considerations and scaling
How to run these processes on incoming or historical data

Data Loading Appendix

Code

The data loading code can be found here:

https://github.com/bvader/elastic-pii

$ git clone https://github.com/bvader/elastic-pii.git

Creating and Loading the Sample Data Set

$ cd elastic-pii
$ cd python
$ python -m venv .env
$ source .env/bin/activate
$ pip install elasticsearch
$ pip install Faker

Run the log generator

$ python generate_random_logs.py

If you do not changes any parameters, this will create 10000 random logs in a file named pii.log with a mix of logs that containe and do not contain PII.

Edit load_logs.py and set the following

# The Elastic User 
ELASTIC_USER = "elastic"

# Password for the 'elastic' user generated by Elasticsearch
ELASTIC_PASSWORD = "askdjfhasldfkjhasdf"

# Found in the 'Manage Deployment' page
ELASTIC_CLOUD_ID = "deployment:sadfjhasfdlkjsdhf3VuZC5pbzo0NDMkYjA0NmQ0YjFiYzg5NDM3ZDgxM2YxM2RhZjQ3OGE3MzIkZGJmNTE0OGEwODEzNGEwN2E3M2YwYjcyZjljYTliZWQ="

Then run the following command.

$ python load_logs.py

Reloading the logs

Note To reload the logs, you can simply re-run the above command. You can run the command multiple time during this exercise and the logs will be reloaded (actually loaded again). The new logs will not collide with previous runs as there will be a unique run.id for each run which is displayed at the end of the loading process.

$ python load_logs.py

Using NLP and Pattern Matching to Detect, Assess, and Redact PII in Logs - Part 2

Tue, 22 Oct 2024 00:00:00 GMT

Introduction:

In Part 1 of this blog, we covered the following:

Review the techniques and tools we have available to manage PII in our logs
Understand the roles of NLP / NER in PII detection
Build a composable processing pipeline to detect and assess PII
Sample logs and run them through the NER Model
Assess the results of the NER Model

In Part 2 of this blog, we will cover the following:

Apply the redact regex pattern processor and assess the results
Create Alerts using ESQL
Apply field-level security to control access to the un-redacted data
Production considerations and scaling
How to run these processes on incoming or historical data

Reminder of the overall flow we will construct over the 2 blogs:

All code for this exercise can be found at: https://github.com/bvader/elastic-pii.

Part 1 Prerequisites

This blog picks up where Part 1 of this blog left off. You must have the NER model, ingest pipelines, and dashboard from Part 1 installed and working.

Loaded and configured NER Model
Installed all the composable ingest pipelines from Part 1 of the blog
Installed dashboard

You can access the complete solution for Blog 1 here. Don't forget to load the dashboard, found here.

Applying the Redact Processor

Next, we will apply the redact processor. The redact processor is a simple regex-based processor that takes a list of regex patterns and looks for them in a field and replaces them with literals when found. The redact processor is reasonably performant and can run at scale. At the end, we will discuss this in detail in the production scaling section.

Elasticsearch comes packaged with a number of useful predefined patterns that can be conveniently referenced by the redact processor. If one does not suit your needs, create a new pattern with a custom definition. The Redact processor replaces every occurrence of a match. If there are multiple matches, they will all be replaced with the pattern name.

In the code below, we leveraged some of the predefined patterns as well as constructing several custom patterns.

        "patterns": [
          "%{EMAILADDRESS:EMAIL_REGEX}",      << Predefined
          "%{IP:IP_ADDRESS_REGEX}",           << Predefined
          "%{CREDIT_CARD:CREDIT_CARD_REGEX}", << Custom
          "%{SSN:SSN_REGEX}",                 << Custom
          "%{PHONE:PHONE_REGEX}"              << Custom
        ]

We also replaced the PII with easily identifiable patterns we can use for assessment.

In addition, it is important to note that since the redact processor is a simple regex find and replace, it can be used against many "secrets" patterns, not just PII. There are many references for regex and secrets patterns, so you can reuse this capability to detect secrets in your logs.

The code can be found here for the following two sections of code.

redact processor pipeline code - click to open/close

# Add the PII redact processor pipeline
DELETE _ingest/pipeline/logs-pii-redact-processor
PUT _ingest/pipeline/logs-pii-redact-processor
{
  "processors": [
    {
      "set": {
        "field": "redact.proc.successful",
        "value": true
      }
    },
    {
      "set": {
        "field": "redact.proc.found",
        "value": false
      }
    },
    {
      "set": {
        "if": "ctx?.redact?.message == null",
        "field": "redact.message",
        "copy_from": "message"
      }
    },
    {
      "redact": {
        "field": "redact.message",
        "prefix": "",
        "patterns": [
          "%{EMAILADDRESS:EMAIL_REGEX}",
          "%{IP:IP_ADDRESS_REGEX}",
          "%{CREDIT_CARD:CREDIT_CARD_REGEX}",
          "%{SSN:SSN_REGEX}",
          "%{PHONE:PHONE_REGEX}"
        ],
        "pattern_definitions": {
          "CREDIT_CARD": """\d{4}[ -]\d{4}[ -]\d{4}[ -]\d{4}""",
          "SSN": """\d{3}-\d{2}-\d{4}""",
          "PHONE": """(\+\d{1,2}\s?)?1?\-?\.?\s?\(?\d{3}\)?[\s.-]?\d{3}[\s.-]?\d{4}"""
        },
        "on_failure": [
          {
            "set": {
              "description": "Set 'error.message'",
              "field": "failure",
              "value": "REDACT_PROCESSOR_FAILED",
              "override": false
            }
          },
          {
            "set": {
              "field": "redact.proc.successful",
              "value": false
            }
          }
        ]
      }
    },
    {
      "set": {
        "if": "ctx?.redact?.message.contains('REDACTPROC')",
        "field": "redact.proc.found",
        "value": true
      }
    },
    {
      "set": {
        "if": "ctx?.redact?.pii?.found == null",
        "field": "redact.pii.found",
        "value": false
      }
    },
    {
      "set": {
        "if": "ctx?.redact?.proc?.found == true",
        "field": "redact.pii.found",
        "value": true
      }
    }
  ],
  "on_failure": [
    {
      "set": {
        "field": "failure",
        "value": "GENERAL_FAILURE",
        "override": false
      }
    }
  ]
}

And now, we will add the logs-pii-redact-processor pipeline to the overall process-pii pipeline

redact processor pipeline code - click to open/close

# Updated Process PII pipeline that now call the NER and Redact Processor pipeline
DELETE _ingest/pipeline/process-pii
PUT _ingest/pipeline/process-pii
{
  "processors": [
    {
      "set": {
        "description": "Set true if enabling sampling, otherwise false",
        "field": "sample.enabled",
        "value": true
      }
    },
    {
      "set": {
        "description": "Set Sampling Rate 0 None 10000 all allows for 0.01% precision",
        "field": "sample.sample_rate",
        "value": 1000
      }
    },
    {
      "set": {
        "description": "Set to false if you want to drop unsampled data, handy for reindexing hostorical data",
        "field": "sample.keep_unsampled",
        "value": true
      }
    },
    {
      "pipeline": {
        "if": "ctx.sample.enabled == true",
        "name": "logs-sampler",
        "ignore_failure": true
      }
    },
    {
      "pipeline": {
        "if": "ctx.sample.enabled == false || (ctx.sample.enabled == true && ctx.sample.sampled == true)",
        "name": "logs-ner-pii-processor"
      }
    },
    {
      "pipeline": {
        "if": "ctx.sample.enabled == false || (ctx.sample.enabled == true &&  ctx.sample.sampled == true)",
        "name": "logs-pii-redact-processor"
      }
    }
  ]
}

Reload the data as described in the Reloading the logs. If you have not generated the logs the first time, follow the instructions in the Data Loading Appendix

Go to Discover and enter the following into the KQL bar sample.sampled : true and redact.message: REDACTPROC and add the redact.message to the table and you should see something like this.

And if you did not load the dashboard from Blog Part 1 at already, load it, it can be found here using the Kibana -> Stack Management -> Saved Objects -> Import.

It should look something like this now. Note that the REGEX portions of the dashboard are now active.

Checkpoint

At this point, we have the following capabilities:

Ability to sample incoming logs and apply this PII redaction
Detect and Assess PII with the NER/NLP and Pattern Matching
Assess the amount, type and quality of the PII detections

This is a great point to stop if you are just running all this once to see how it works, but we have a few more steps to make this useful in production systems.

Clean up the working and unredacted data
Update the Dashboard to work with the cleaned-up data
Apply Role Based Access Control to protect the raw unredacted data
Create Alerts
Production and Scaling Considerations
How to run these processes on incoming or historical data

Applying to Production Systems

Cleanup working data and update the dashboard

And now we will add the cleanup code to the overall process-pii pipeline.

In short, we set a flag redact.enable: true that directs the pipeline to move the unredacted message field to raw.message and the move the redacted message field redact.messageto the message field. We will "protect" the raw.message in the following section.

NOTE: Of course you can change this behavior if you want to completely delete the unredacted data. In this exercise we will keep it and protect it.

In addition we set redact.cleanup: true to clean up the NLP working data.

These fields allow a lot of control over what data you decide to keep and analyze.

The code can be found here for the following two sections of code.

redact processor pipeline code - click to open/close

# Updated Process PII pipeline that now call the NER and Redact Processor pipeline and cleans up 
DELETE _ingest/pipeline/process-pii
PUT _ingest/pipeline/process-pii
{
  "processors": [
    {
      "set": {
        "description": "Set true if enabling sampling, otherwise false",
        "field": "sample.enabled",
        "value": true
      }
    },
    {
      "set": {
        "description": "Set Sampling Rate 0 None 10000 all allows for 0.01% precision",
        "field": "sample.sample_rate",
        "value": 1000
      }
    },
    {
      "set": {
        "description": "Set to false if you want to drop unsampled data, handy for reindexing hostorical data",
        "field": "sample.keep_unsampled",
        "value": true
      }
    },
    {
      "pipeline": {
        "if": "ctx.sample.enabled == true",
        "name": "logs-sampler",
        "ignore_failure": true
      }
    },
    {
      "pipeline": {
        "if": "ctx.sample.enabled == false || (ctx.sample.enabled == true && ctx.sample.sampled == true)",
        "name": "logs-ner-pii-processor"
      }
    },
    {
      "pipeline": {
        "if": "ctx.sample.enabled == false || (ctx.sample.enabled == true &&  ctx.sample.sampled == true)",
        "name": "logs-pii-redact-processor"
      }
    },
    {
      "set": {
        "description": "Set to true to actually redact, false will run processors but leave original",
        "field": "redact.enable",
        "value": true
      }
    },
    {
      "rename": {
        "if": "ctx?.redact?.pii?.found == true && ctx?.redact?.enable == true",
        "field": "message",
        "target_field": "raw.message"
      }
    },
    {
      "rename": {
        "if": "ctx?.redact?.pii?.found == true && ctx?.redact?.enable == true",
        "field": "redact.message",
        "target_field": "message"
      }
    },
    {
      "set": {
        "description": "Set to true to actually to clean up working data",
        "field": "redact.cleanup",
        "value": true
      }
    },
    {
      "remove": {
        "if": "ctx?.redact?.cleanup == true",
        "field": [
          "ml"
        ],
        "ignore_failure": true
      }
    }
  ]
}

Reload the data as described here in the Reloading the logs.

Go to Discover and enter the following into the KQL bar sample.sampled : true and redact.pii.found: true and add the following fields to the table

message,raw.message,redact.ner.found,redact.proc.found,redact.pii.found

You should see something like this

We have everything we need to move forward with protecting the PII and Alerting on it.

Load up the new dashboard that works on the cleaned-up data

To load the dashboard, go to Kibana -> Stack Management -> Saved Objects and import the pii-dashboard-part-2.ndjson file that can be found here.

The new dashboard should look like this. Note: It uses different fields under the covers since we have cleaned up the underlying data.

You should see something like this

Apply Role Based Access Control to protect the raw unredacted data

Elasticsearch supports role-based access control, including field and document level access control natively; it dramatically reduces the operational and maintenance complexity required to secure our application.

We will create a Role that does not allow access to the raw.message field and then create a user and assign that user the role. With that role, the user will only be able to see the redacted message, which is now in the message field, but will not be able to access the protected raw.message field.

NOTE: Since we only sampled 10% of the data in this exercise the non-sampled message fields are not moved to the raw.message, so they are still viewable, but this shows the capability you can apply in a production system.

The code can be found here for the following section of code.

RBAC protect-pii role and user code - click to open/close

# Create role with no access to the raw.message field
GET _security/role/protect-pii
DELETE _security/role/protect-pii
PUT _security/role/protect-pii
{
 "cluster": [],
 "indices": [
   {
     "names": [
       "logs-*"
     ],
     "privileges": [
       "read",
       "view_index_metadata"
     ],
     "field_security": {
       "grant": [
         "*"
       ],
       "except": [
         "raw.message"
       ]
     },
     "allow_restricted_indices": false
   }
 ],
 "applications": [
   {
     "application": "kibana-.kibana",
     "privileges": [
       "all"
     ],
     "resources": [
       "*"
     ]
   }
 ],
 "run_as": [],
 "metadata": {},
 "transient_metadata": {
   "enabled": true
 }
}

# Create user stephen with protect-pii role
GET _security/user/stephen
DELETE /_security/user/stephen
POST /_security/user/stephen
{
 "password" : "mypassword",
 "roles" : [ "protect-pii" ],
 "full_name" : "Stephen Brown"
}

Now log into a separate window with the new user stephen with the protect-pii role. Go to Discover and put redact.pii.found : true in the KQL bar and add the message field to the table. Also, notice that the raw.message is not available.

You should see something like this

Create an Alert when PII Detected

Now, with the processing of the pipelines, creating an alert when PII is detected is easy. To review Alerting in Kibana in detail if needed

NOTE: Reload the data if needed to have recent data.

First, we will create a simple ES|QL query in Discover.

The code can be found here.

FROM logs-pii-default
| WHERE redact.pii.found == true
| STATS pii_count = count(*)
| WHERE pii_count > 0

When you run this you should see something like this.

Now click the Alerts menu and select Create search threshold rule, and will create an alert to alert us when PII is found.

Select a time field: @timestamp Set the time window: 5 minutes

Assuming you loaded the data recently when you run Test it should do something like

pii_count : 343 Alerts generated query matched

Add an action when the alert is Active.

For each alert: On status changes Run when: Query matched

Elasticsearch query rule {{rule.name}} is active:

- PII Found: true
- PII Count: {{#context.hits}} {{_source.pii_count}}{{/context.hits}}
- Conditions Met: {{context.conditions}} over {{rule.params.timeWindowSize}}{{rule.params.timeWindowUnit}}
- Timestamp: {{context.date}}
- Link: {{context.link}}

Add an Action for when the Alert is Recovered.

For each alert: On status changes Run when: Recovered

Elasticsearch query rule {{rule.name}} is Recovered:

- PII Found: false
- Conditions Not Met: {{context.conditions}} over {{rule.params.timeWindowSize}}{{rule.params.timeWindowUnit}}
- Timestamp: {{context.date}}
- Link: {{context.link}}

When all setup it should look like this and Save

You should get an Active alert that looks like this if you have recent data. I sent mine to Slack.

Elasticsearch query rule pii-found-esql is active:
- PII Found: true
- PII Count:  374
- Conditions Met: Query matched documents over 5m
- Timestamp: 2024-10-15T02:44:52.795Z
- Link: https://mydeployment123.aws.found.io:9243/app/management/insightsAndAlerting/triggersActions/rule/7d6faecf-964e-46da-aaba-8a2f89f33989

And then if you wait you will get a Recovered alert that looks like this.

Elasticsearch query rule pii-found-esql is Recovered:
- PII Found: false
- Conditions Not Met: Query did NOT match documents over 5m
- Timestamp: 2024-10-15T02:49:04.815Z
- Link: https://mydeployment123.kb.us-west-1.aws.found.io:9243/app/management/insightsAndAlerting/triggersActions/rule/7d6faecf-964e-46da-aaba-8a2f89f33989

Production Scaling

NER Scaling

As we mentioned Part 1 of this blog of this blog, NER / NLP Models are CPU-intensive and expensive to run at scale; thus, we employed a sampling technique to understand the risk in our logs without sending the full logs volume through the NER Model.

Please review the setup and configuration of the NER model from Part 1 of the blog.

We chose the base BERT NER model bert-base-NER for our PII case.

The metrics below are related to the model and configuration from Part 1 of the blog.

4 Allocations to allow for more parallel ingestion
1 Thread per Allocation
0 Byes Cache, as we expect a low cache hit rate Note If there are many repeated logs, cache can help, but with timestamps and other variations, cache will not help and can even slow down the process
8192 Queue

GET _ml/trained_models/dslim__bert-base-ner/_stats
.....
           "node": {
              "0m4tq7tMRC2H5p5eeZoQig": {
.....
                "attributes": {
                  "xpack.installed": "true",
                  "region": "us-west-1",
                  "ml.allocated_processors": "5", << HERE 
.....
            },
            "inference_count": 5040,
            "average_inference_time_ms": 138.44285714285715, << HERE 
            "average_inference_time_ms_excluding_cache_hits": 138.44285714285715,
            "inference_cache_hit_count": 0,
.....
            "threads_per_allocation": 1,
            "number_of_allocations": 4,  <<< HERE
            "peak_throughput_per_minute": 1550,
            "throughput_last_minute": 1373,
            "average_inference_time_ms_last_minute": 137.55280407865988,
            "inference_cache_hit_count_last_minute": 0
          }
        ]
      }
    }

There are 3 key pieces of information above:

"ml.allocated_processors": "5" The number of physical cores / processors available
"number_of_allocations": 4 The number of allocations which is maximum 1 per physical core. Note: we could have used 5 allocations, but we only allocated 4 for this exercise
"average_inference_time_ms": 138.44285714285715 The averages inference time per document.

The math is pretty straightforward for throughput for Inferences per Min (IPM) per allocation (1 allocation per physical core), since an inference uses a single core and a single thread.

Then the Inferences per Min per Allocation is simply:

IPM per allocation = 60,000 ms (in a minute) / 138ms per inference = 435

When then lines up with the Total Inferences per Minute

Total IPM = 435 IPM / allocation * 4 Allocations = ~1740

Suppose we want to do 10,000 IPMs, how many allocations (cores) would I need?

Allocations = 10,000 IPM / 435 IPM per allocation = 23 Allocation (cores rounded up)

Or perhaps logs are coming in at 5000 EPS and you want to do 1% Sampling.

IPM = 5000 EPS * 60sec * 0.01 sampling = 3000 IPM sampled

Then

Number of Allocators = 3000 IPM / 435 IPM per allocation = 7 allocations (cores rounded up)

Want Faster! Turns out there is a more lightweight NER Model distilbert-NER model that is faster, but the tradeoff is a little less accuracy.

Running the logs through this model results in an inference time nearly twice as fast!

"average_inference_time_ms": 66.0263959390863

Here is some quick math: $IPM per allocation = 60,000 ms (in a minute) / 61ms per inference = 983

Suppose we want to do 25,000 IPMs, how many allocations (cores) would I need?

Allocations = 25,000 IPM / 983 IPM per allocation = 26 Allocation (cores rounded up)

Now you can apply this math to determine the correct sampling and NER scaling to support your logging use case.

Redact Processor Scaling

In short, the redact processor should scale to production loads as long as you are using appropriately sized and configured nodes and have well-constructed regex patterns.

Assessing incoming logs

If you want to test on incoming logs data in a data stream. All you need to do is change the conditional in the logs@custom pipeline to apply the process-pii to the dataset you want to. You can use any conditional that fits your condition.

Note: Just make sure that you have accounted for the proper scaling for the NER and Redact processors they were described above in Production Scaling

    {
      "pipeline": {
        "description" : "Call the process_pii pipeline on the correct dataset",
        "if": "ctx?.data_stream?.dataset == 'pii'", <<< HERE
        "name": "process-pii"
      }
    }

So if for example your logs are coming into logs-mycustomapp-default you would just change the conditional to

        "if": "ctx?.data_stream?.dataset == 'mycustomapp'",

Assessing historical data

If you have a historical (already ingested) data stream or index you can run the assessment over them using the _reindex API>

Note: Just make sure that you have accounted for the proper scaling for the NER and Redact processors they were described above in Production Scaling

There are a couple of extra steps: The code can be found here.

First we can set the parameters to ONLY keep the sampled data as there is no reason to make a copy of all the unsampled data. In the process-pii pipeline, there is a setting sample.keep_unsampled, which we can set to false, which will then only keep the sampled data

    {
      "set": {
        "description": "Set to false if you want to drop unsampled data, handy for reindexing hostorical data",
        "field": "sample.keep_unsampled",
        "value": false <<< SET TO false
      }
    },

Second, we will create a pipeline that will reroute the data to the correct data stream to run through all the PII assessment/detection pipelines. It also sets the correct dataset and namespace

DELETE _ingest/pipeline/sendtopii
PUT _ingest/pipeline/sendtopii
{
  "processors": [
    {
      "set": {
        "field": "data_stream.dataset",
        "value": "pii"
      }
    },
    {
      "set": {
        "field": "data_stream.namespace",
        "value": "default"
      }
    },
    {
      "reroute" : 
      {
        "dataset" : "{{data_stream.dataset}}",
        "namespace": "{{data_stream.namespace}}"
      }
    }
  ]
}

Finally, we can run a _reindex to select the data we want to test/assess. It is recommended to review the _reindex documents before trying this. First, select the source data stream you want to assess, in this example, it is the logs-generic-default logs data stream. Note: I also added a range filter to select a specific time range. There is a bit of a "trick" that we need to use since we are re-routing the data to the data stream logs-pii-default. To do this, we just set "index": "logs-tmp-default" in the _reindex as the correct data stream will be set in the pipeline. We must do that because reroute is a noop if it is called from/to the same datastream.

POST _reindex?wait_for_completion=false
{
  "source": {
    "index": "logs-generic-default",
    "query": {
      "bool": {
        "filter": [
          {
            "range": {
              "@timestamp": {
                "gte": "now-1h/h",
                "lt": "now"
              }
            }
          }
        ]
      }
    }
  },
  "dest": {
    "op_type": "create",
    "index": "logs-tmp-default",
    "pipeline": "sendtopii"
  }
}

Summary

At this point, you have the tools and processes need to assess, detect, analyze, alert and protect PII in your logs.

The end state solution can be found here:.

In Part 1 of this blog, we accomplished the following.

Reviewed the techniques and tools we have available for PII detection and assessment
Reviewed NLP / NER role in PII detection and assessment
Built the necessary composable ingest pipelines to sample logs and run them through the NER Model
Reviewed the NER results and are ready to move to the second blog

In Part 2 of this blog, we covered the following:

Redact PII using NER and redact processor
Apply field-level security to control access to the un-redacted data
Enhance the dashboards and alerts
Production considerations and scaling
How to run these processes on incoming or historical data

So get to work and reduce risk in your logs!

Data Loading Appendix

Code

The data loading code can be found here:

https://github.com/bvader/elastic-pii

$ git clone https://github.com/bvader/elastic-pii.git

Creating and Loading the Sample Data Set

$ cd elastic-pii
$ cd python
$ python -m venv .env
$ source .env/bin/activate
$ pip install elasticsearch
$ pip install Faker

Run the log generator

$ python generate_random_logs.py

If you do not changes any parameters, this will create 10000 random logs in a file named pii.log with a mix of logs that containe and do not contain PII.

Edit load_logs.py and set the following

# The Elastic User 
ELASTIC_USER = "elastic"

# Password for the 'elastic' user generated by Elasticsearch
ELASTIC_PASSWORD = "askdjfhasldfkjhasdf"

# Found in the 'Manage Deployment' page
ELASTIC_CLOUD_ID = "deployment:sadfjhasfdlkjsdhf3VuZC5pbzo0NDMkYjA0NmQ0YjFiYzg5NDM3ZDgxM2YxM2RhZjQ3OGE3MzIkZGJmNTE0OGEwODEzNGEwN2E3M2YwYjcyZjljYTliZWQ="

Then run the following command.

$ python load_logs.py

Reloading the logs

$ python load_logs.py