<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/">
    <channel>
        <title>Elastic Observability Labs - Articles by Stephen Brown</title>
        <link>https://www.elastic.co/observability-labs</link>
        <description>Trusted security news &amp; research from the team at Elastic.</description>
        <lastBuildDate>Mon, 08 Jun 2026 15:18:17 GMT</lastBuildDate>
        <docs>https://validator.w3.org/feed/docs/rss2.html</docs>
        <generator>https://github.com/jpmonette/feed</generator>
        <image>
            <title>Elastic Observability Labs - Articles by Stephen Brown</title>
            <url>https://www.elastic.co/observability-labs/assets/observability-labs-thumbnail.png</url>
            <link>https://www.elastic.co/observability-labs</link>
        </image>
        <copyright>© 2026. Elasticsearch B.V. All Rights Reserved</copyright>
        <item>
            <title><![CDATA[Bringing Your Cloud-Managed Kubernetes Audit Logs into Elasticsearch]]></title>
            <link>https://www.elastic.co/observability-labs/blog/bringing-your-cloud-managed-kubernetes-audit-logs-into-elasticsearch</link>
            <guid isPermaLink="false">bringing-your-cloud-managed-kubernetes-audit-logs-into-elasticsearch</guid>
            <pubDate>Mon, 19 Aug 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[How to bring your Cloud-Managed Kubernetes Audit Logs into Elasticsearch]]></description>
            <content:encoded><![CDATA[<h2>Introduction:</h2>
<p>Kubernetes audit logs are essential for ensuring the security, compliance, and transparency of Kubernetes clusters. However, with managed Kubernetes infrastructure, traditional audit file-based log shipping is often not supported, and audit logs are only available via the control plane API or the Cloud Provider logging facility. In this blog, we will show you how to ingest the audit logs from these other sources and still take advantage of the <a href="https://www.elastic.co/docs/current/integrations/kubernetes/audit-logs">Elastic Kubernetes Audit Log Integration</a>.</p>
<p>In this blog we will be focusing on AWS as our cloud provider and when ingesting logs from AWS you have several options:</p>
<ul>
<li><a href="https://www.elastic.co/docs/current/integrations/aws_logs">AWS Custom Logs integration</a> (which we will utilize in this blog)</li>
<li><a href="https://www.elastic.co/observability-labs/blog/aws-kinesis-data-firehose-observability-analytics">AWS Firehose</a> to send logs from Cloudwatch to Elastic</li>
<li><a href="https://www.elastic.co/docs/current/integrations/aws">AWS General integration</a> which supports many AWS sources</li>
</ul>
<p>In part 1 of this two-part series, we will focus on properly ingesting Kubernetes Audit, and part 2 will focus on investigation, analytics, and alerting.</p>
<p>Kubernetes auditing <a href="https://kubernetes.io/docs/tasks/debug/debug-cluster/audit/">documentation</a> describes the need for auditing in order to get answers to the questions below:</p>
<ul>
<li>What happened?</li>
<li>When did it happen?</li>
<li>Who initiated it?</li>
<li>What resource did it occur on?</li>
<li>Where was it observed?</li>
<li>From where was it initiated (Source IP)?</li>
<li>Where was it going (Destination IP)?</li>
</ul>
<p>Answers to the above questions become important when an incident occurs and an investigation follows. Alternatively, it could just be a log retention use case for a regulated company trying to fulfill compliance requirements. </p>
<p>We are giving special importance to audit logs in Kubernetes because audit logs are not enabled by default. Audit logs can take up a large amount of memory and storage. So, usually, it’s a balance between retaining/investigating audit logs against giving up resources budgeted otherwise for workloads to be hosted on the Kubernetes cluster. Another reason we’re talking about audit logs in Kubernetes is that, unlike usual container logs, after being turned on, these logs are orchestrated to write to the cloud provider’s logging service. This is true for most cloud providers because the Kubernetes control plane is managed by the cloud providers. It makes sense for cloud providers to use their built-in orchestration workflows involving the control plane for a managed service backed by their implementation of a logging framework.</p>
<p>Kubernetes audit logs can be quite verbose by default. Hence, it becomes important to selectively choose how much logging needs to be done so that all the audit requirements are met for the organization. This is done in the <a href="https://kubernetes.io/docs/tasks/debug/debug-cluster/audit/#audit-policy">audit policy</a> file. The audit policy file is submitted against the<code> kube-apiserver.</code> It is not necessary that all flavors of cloud-provider-hosted Kubernetes clusters allow you to play with the <code>kube-apiserver</code> directly. For example, AWS EKS allows for this <a href="https://docs.aws.amazon.com/eks/latest/userguide/control-plane-logs.html">logging</a> to be done only by the control plane.</p>
<p><strong>In this blog we will be using Elastic Kubernetes Service (Amazon EKS) on AWS with the Kubernetes Audit Logs that are automatically shipped to AWS CloudWatch.</strong></p>
<p>A sample audit log for a secret by the name “empty-secret” created by an admin user on EKS  is logged on AWS CloudWatch in the following format: </p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/bringing-your-cloud-managed-kubernetes-audit-logs-into-elasticsearch/aws-clougwatch-logs.png" alt="Alt text" /></p>
<p>Once the audit logs show up on CloudWatch, it is time to consider how to transfer them to Elasticsearch. Elasticsearch is a great platform for creating dashboards that visualize different audit events recorded in a Kubernetes cluster. It is also a powerful tool for analyzing various audit events. For example, how many secret object creation attempts were made in an hour? </p>
<p>Now that we established the Kubernetes audit logs are being logged in CloudWatch, let’s discuss how to get the logs ingested into Elasticsearch. Elasticsearch has an integration to consume logs written on CloudWatch. Just using this integration by default is going to get the JSON from CloudWatch as is i.e. the real audit log JSON is nested inside the wrapper CloudWatch JSON. When bringing logs to Elasticsearch, it is important that we use the <a href="https://www.elastic.co/guide/en/ecs/current/index.html">Elastic Common Schema</a>(ECS) to get the best search and analytics performance. This means that there needs to  be an ingest pipeline that parses a standard Kubernetes audit JSON message and creates an ECS Compliant document in Elasticsearch. Let’s dive into how to achieve this.</p>
<p>Elasticsearch has a Kubernetes integration using Elastic Agent to consume Kubernetes container logs from the console and audit logs written to a file path. For a cloud-provider use case, as described above, it may not be feasible to write audit logs to a path on the Kubernetes cluster. So, how do we leverage the<a href="https://github.com/elastic/integrations/blob/main/packages/kubernetes/data_stream/audit_logs/fields/fields.yml"> ECS designed for parsing the Kubernetes audit logs</a> already implemented in the Kubernetes integration to work on the CloudWatch audit logs? That is the most exciting plumbing piece! Let’s see how to do it.</p>
<h3>What we’re going to do is:</h3>
<ul>
<li>
<p>Read the Kubernetes audit logs from the cloud provider’s logging module, in our case, AWS CloudWatch since this is where logs reside. We will use Elastic Agent and <a href="https://www.elastic.co/docs/current/integrations/aws_logs">Elasticsearch AWS Custom Logs integration </a> to read from logs from CloudWatch. <strong>Note:</strong> please be aware, there are several Elastic AWS integration, we are specifically using the AWS Custom Logs integration.</p>
</li>
<li>
<p>Create two simple ingest pipelines (we do this for best practices of isolation and composability) </p>
</li>
<li>
<p>The first pipeline looks for Kubernetes audit JSON messages and then redirects them to the second pipeline</p>
</li>
<li>
<p>The second custom pipeline will associate the JSON <code>message</code> field with the correct field expected by the Elasticsearch Kubernetes Audit managed pipeline (aka the Integration) and then <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/reroute-processor.html"><code>reroute</code></a> the message to the correct data stream, <code>kubernetes.audit_logs-default,</code> which in turn applies all the proper mapping and ingest pipelines for the incoming message</p>
</li>
<li>
<p>The overall flow will be</p>
</li>
</ul>
<p><img src="https://www.elastic.co/observability-labs/assets/images/bringing-your-cloud-managed-kubernetes-audit-logs-into-elasticsearch/overall-ingestion-flow.png" alt="Alt text" /></p>
<h3>1. Create an AWS CloudWatch integration:</h3>
<p>a.  Populate the AWS access key and secret pair values</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/bringing-your-cloud-managed-kubernetes-audit-logs-into-elasticsearch/aws-custom-logs-integration-1.png" alt="Alt text" /></p>
<p>b. In the logs section, populate the log ARN, Tags and Preserve the original event if you want to, and then Save this integration and exit from the page</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/bringing-your-cloud-managed-kubernetes-audit-logs-into-elasticsearch/aws-custom-logs-integration-2.png" alt="Alt text" /></p>
<h3>2. Next, we will configure the custom ingest pipeline</h3>
<p>We are doing this because we want to override what the generic managed pipeline does. We will retrieve the custom component name by searching for managed pipeline created as an asset when we install the AWS CloudWatch integration. In this case we will be adding the custom ingest pipeline <code>logs-aws_logs.generic@custom</code></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/bringing-your-cloud-managed-kubernetes-audit-logs-into-elasticsearch/aws-logs-index-management.png" alt="Alt text" /></p>
<p>From the Dev tools console, run below. Here, we are extracting the message field from the CloudWatch JSON and putting the value in a field called kubernetes.audit. Then, we are rerouting this message to the default Kubernetes audit dataset or ECS that comes with Kubernetes integration</p>
<pre><code>PUT _ingest/pipeline/logs-aws_logs.generic@custom
{
    &quot;processors&quot;: [
      {
        &quot;pipeline&quot;: {
          &quot;if&quot;: &quot;ctx.message.contains('audit.k8s.io')&quot;,
          &quot;name&quot;: &quot;logs-aws-process-k8s-audit&quot;
        }
      }
    ]
}

PUT _ingest/pipeline/logs-aws-process-k8s-audit
{
  &quot;processors&quot;: [
    {
      &quot;json&quot;: {
        &quot;field&quot;: &quot;message&quot;,
        &quot;target_field&quot;: &quot;kubernetes.audit&quot;
      }
    },
    {
      &quot;remove&quot;: {
        &quot;field&quot;: &quot;message&quot;
      }
    },
    {
      &quot;reroute&quot;: {
        &quot;dataset&quot;: &quot;kubernetes.audit_logs&quot;,
        &quot;namespace&quot;: &quot;default&quot;
      }
    }
  ]
}
</code></pre>
<p>Let’s understand this further:</p>
<ul>
<li>
<p>When we create a Kubernetes integration, we get a managed index template called <code>logs-kubernetes.audit_logs</code> that writes to the pipeline called <code>logs-kubernetes.audit_logs-1.62.2</code> by default</p>
</li>
<li>
<p>If we look into the pipeline<code> logs-kubernetes.audit_logs-1.62.2</code>, we see that all the processor logic is working against the field <code>kubernetes.audit</code>. This is the reason why our json processor in the above code snippet is creating a field called <code>kubernetes.audit </code>before dropping the original <em>message</em> field and rerouting. Rerouting is directed to the <code>kubernetes.audit_logs</code> dataset that backs the <code>logs-kubernetes.audit_logs-1.62.2</code> pipeline (dataset name is derived from the pipeline name convention that’s in the format <code>logs-&lt;datasetname&gt;-version</code>)</p>
</li>
</ul>
<p><img src="https://www.elastic.co/observability-labs/assets/images/bringing-your-cloud-managed-kubernetes-audit-logs-into-elasticsearch/ingest-pipelines.png" alt="Alt text" /></p>
<h3>3.  Now let’s verify that the logs are actually flowing through and the audit message is being parsed</h3>
<p>a. We will use Elastic Agent and enroll using Fleet and the integration policy we created in the Step 1. There are a number of ways to <a href="https://www.elastic.co/guide/en/fleet/current/install-fleet-managed-elastic-agent.html">deploy Elastic Agent</a> and for this exercise we will deploy using docker which is quick and easy.</p>
<pre><code>% docker run --env FLEET_ENROLL=1 --env FLEET_URL=&lt;&lt;fleet_URL&gt;&gt; --env FLEET_ENROLLMENT_TOKEN=&lt;&lt;fleet_enrollment_token&gt;&gt;  --rm docker.elastic.co/beats/elastic-agent:8.19.13
</code></pre>
<p>b. Check the messages in Discover. In 8.15 there is also a new feature called Logs Explorer which provides an ability to see Kubernetes Audit logs (and container logs) with a few clicks (see image below). Voila! We can see the Kubernetes audit messages parsed!</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/bringing-your-cloud-managed-kubernetes-audit-logs-into-elasticsearch/discover.jpg" alt="Alt text" /></p>
<h3>4. Let's do a quick recap of what we did</h3>
<p>We configured CloudWatch integration in Elasticsearch to read Kubernetes audit logs from CloudWatch. Then, we created custom ingest pipelines to reroute the audit messages to the correct data stream and all the OOTB mappings and parsing that come with the Kubernetes Audit Logs integration. </p>
<p>In the next part, we’ll look at how to analyze the ingested Kubernetes Audit log data.</p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/bringing-your-cloud-managed-kubernetes-audit-logs-into-elasticsearch/bringing-your-cloud-managed-kubernetes-audit-logs-into-elasticsearch.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Using NLP and Pattern Matching to Detect, Assess, and Redact PII in Logs - Part 1]]></title>
            <link>https://www.elastic.co/observability-labs/blog/pii-ner-regex-assess-redact-part-1</link>
            <guid isPermaLink="false">pii-ner-regex-assess-redact-part-1</guid>
            <pubDate>Wed, 25 Sep 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[How to detect and assess PII in your logs using Elasticsearch and NLP]]></description>
            <content:encoded><![CDATA[<h2>Introduction:</h2>
<p>The prevalence of high-entropy logs in distributed systems has significantly raised the risk of PII (Personally Identifiable Information) seeping into our logs, which can result in security and compliance issues. This 2-part blog delves into the crucial task of identifying and managing this issue using the Elastic Stack. We will explore using NLP (Natural Language Processing) and Pattern matching to detect, assess, and, where feasible, redact PII from logs that are being ingested into Elasticsearch.</p>
<p>In <strong>Part 1</strong> of this blog, we will cover the following:</p>
<ul>
<li>Review the techniques and tools we have available to manage PII in our logs</li>
<li>Understand the roles of NLP / NER in PII detection</li>
<li>Build a composable processing pipeline to detect and assess PII</li>
<li>Sample logs and run them through the NER Model</li>
<li>Assess the results of the NER Model</li>
</ul>
<p>In <a href="https://www.elastic.co/observability-labs/blog/pii-ner-regex-assess-redact-part-2">Part 2 of this blog</a> of this blog, we will cover the following:</p>
<ul>
<li>Redact PII using NER and the redact processor</li>
<li>Apply field-level security to control access to the un-redacted data</li>
<li>Enhance the dashboards and alerts</li>
<li>Production considerations and scaling</li>
<li>How to run these processes on incoming or historical data</li>
</ul>
<p>Here is the overall flow we will construct over the 2 blogs:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/pii-ner-regex-assess-redact-part-1/pii-overall-flow.png" alt="PII Overall Flow" /></p>
<p>All code for this exercise can be found at:
<a href="https://github.com/bvader/elastic-pii">https://github.com/bvader/elastic-pii</a>.</p>
<h2>Tools and Techniques</h2>
<p>There are four general capabilities that we will use for this exercise.</p>
<ul>
<li>Named Entity Recognition Detection (NER)</li>
<li>Pattern Matching Detection</li>
<li>Log Sampling</li>
<li>Ingest Pipelines as Composable Processing</li>
</ul>
<h4>Named Entity Recognition (NER) Detection</h4>
<p>NER is a sub-task of Natural Language Processing (NLP) that involves identifying and categorizing named entities in unstructured text into predefined categories such as:</p>
<ul>
<li>Person: Names of individuals, including celebrities, politicians, and historical figures.</li>
<li>Organization: Names of companies, institutions, and organizations.</li>
<li>Location: Geographic locations, including cities, countries, and landmarks.</li>
<li>Event: Names of events, including conferences, meetings, and festivals.</li>
</ul>
<p>For our use PII case, we will choose the base BERT NER model <a href="https://huggingface.co/dslim/bert-base-NER">bert-base-NER</a> that can be downloaded from <a href="https://huggingface.co">Hugging Face</a> and loaded into Elasticsearch as a trained model.</p>
<p><strong>Important Note:</strong>  NER / NLP Models are CPU-intensive and expensive to run at scale; thus, we will want to employ a sampling technique to understand the risk in our logs without sending the full logs volume through the NER Model. We will discuss the performance and scaling of the NER model in part 2 of the blog.</p>
<h4>Pattern Matching Detection</h4>
<p>In addition to using an NER, regex pattern matching is a powerful tool for detecting and redacting PII based on common patterns. The Elasticsearch <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/redact-processor.html">redact</a> processor is built for this use case.</p>
<h4>Log Sampling</h4>
<p>Considering the performance implications of NER and the fact that we may be ingesting a large volume of logs into Elasticsearch, it makes sense to sample our incoming logs. We will build a simple log sampler to accomplish this.</p>
<h4>Ingest Pipelines as Composable Processing</h4>
<p>We will create several pipelines, each focusing on a specific capability and a main ingest pipeline to orchestrate the overall process.</p>
<h2>Building the Processing Flow</h2>
<h4>Logs Sampling + Composable Ingest Pipelines</h4>
<p>The first thing we will do is set up a sampler to sample our logs. This ingest pipeline simply takes a sampling rate between 0 (no log) and 10000 (all logs), which allows as low as ~0.01% sampling rate and marks the sampled logs with <code>sample.sampled: true</code>. Further processing on the logs will be driven by the value of <code>sample.sampled</code>. The <code>sample.sample_rate</code> can be set here or &quot;passed in&quot; from the orchestration pipeline.</p>
<p>The command should be run from the Kibana -&gt; Dev Tools</p>
<p><a href="https://github.com/bvader/elastic-pii/blob/main/elastic/blog-part-1/logs-sampler-composable-pipelines-part-1.json">The code can be found here</a> for the following three sections of code.</p>
&lt;details open&gt;
  &lt;summary&gt;logs-sampler pipeline code - click to open/close&lt;/summary&gt;
<pre><code class="language-bash"># logs-sampler pipeline - part 1
DELETE _ingest/pipeline/logs-sampler
PUT _ingest/pipeline/logs-sampler
{
  &quot;processors&quot;: [
    {
      &quot;set&quot;: {
        &quot;description&quot;: &quot;Set Sampling Rate 0 None 10000 all allows for 0.01% precision&quot;,
        &quot;if&quot;: &quot;ctx.sample.sample_rate == null&quot;,
        &quot;field&quot;: &quot;sample.sample_rate&quot;,
        &quot;value&quot;: 10000
      }
    },
    {
      &quot;set&quot;: {
        &quot;description&quot;: &quot;Determine if keeping unsampled docs&quot;,
        &quot;if&quot;: &quot;ctx.sample.keep_unsampled == null&quot;,
        &quot;field&quot;: &quot;sample.keep_unsampled&quot;,
        &quot;value&quot;: true
      }
    },
    {
      &quot;set&quot;: {
        &quot;field&quot;: &quot;sample.sampled&quot;,
        &quot;value&quot;: false
      }
    },
    {
      &quot;script&quot;: {
        &quot;source&quot;: &quot;&quot;&quot; Random r = new Random();
        ctx.sample.random = r.nextInt(params.max); &quot;&quot;&quot;,
        &quot;params&quot;: {
          &quot;max&quot;: 10000
        }
      }
    },
    {
      &quot;set&quot;: {
        &quot;if&quot;: &quot;ctx.sample.random &lt;= ctx.sample.sample_rate&quot;,
        &quot;field&quot;: &quot;sample.sampled&quot;,
        &quot;value&quot;: true
      }
    },
    {
      &quot;drop&quot;: {
         &quot;description&quot;: &quot;Drop unsampled document if applicable&quot;,
        &quot;if&quot;: &quot;ctx.sample.keep_unsampled == false &amp;&amp; ctx.sample.sampled == false&quot;
      }
    }
  ]
}
</code></pre>
&lt;/details&gt;
<p>Now, let's test the logs sampler. We will build the first part of the composable pipeline. We will be sending logs to the logs-generic-default data stream. With that in mind, we will create the <code>logs@custom</code> ingest pipeline that will be automatically called using the logs <a href="https://www.elastic.co/guide/en/fleet/current/data-streams.html#data-streams-pipelines">data stream framework</a> for customization. We will add one additional level of abstraction so that you can apply this PII processing to other data streams.</p>
<p>Next, we will create the <code>process-pii</code> pipeline. This is the core processing pipeline where we will orchestrate PII processing component pipelines. In this first step, we will simply apply the sampling logic. Note that we are setting the sampling rate to 100, which is equivalent to 10% of the logs.</p>
&lt;details open&gt;
  &lt;summary&gt;process-pii pipeline code - click to open/close&lt;/summary&gt;
<pre><code class="language-bash"># Process PII pipeline - part 1
DELETE _ingest/pipeline/process-pii
PUT _ingest/pipeline/process-pii
{
  &quot;processors&quot;: [
    {
      &quot;set&quot;: {
        &quot;description&quot;: &quot;Set true if enabling sampling, otherwise false&quot;,
        &quot;field&quot;: &quot;sample.enabled&quot;,
        &quot;value&quot;: true
      }
    },
    {
      &quot;set&quot;: {
        &quot;description&quot;: &quot;Set Sampling Rate 0 None 10000 all allows for 0.01% precision&quot;,
        &quot;field&quot;: &quot;sample.sample_rate&quot;,
        &quot;value&quot;: 1000
      }
    },
    {
      &quot;set&quot;: {
        &quot;description&quot;: &quot;Set to false if you want to drop unsampled data, handy for reindexing hostorical data&quot;,
        &quot;field&quot;: &quot;sample.keep_unsampled&quot;,
        &quot;value&quot;: true
      }
    },
    {
      &quot;pipeline&quot;: {
        &quot;if&quot;: &quot;ctx.sample.enabled == true&quot;,
        &quot;name&quot;: &quot;logs-sampler&quot;,
        &quot;ignore_failure&quot;: true
      }
    }
  ]
}
</code></pre>
&lt;/details&gt;
<p>Finally, we create the logs <code>logs@custom</code>, which will simply call our <code>process-pii</code> pipeline based on the correct <code>data_stream.dataset</code></p>
&lt;details open&gt;
  &lt;summary&gt;logs@custom pipeline code - click to open/close&lt;/summary&gt;
<pre><code class="language-bash"># logs@custom pipeline - part 1
DELETE _ingest/pipeline/logs@custom
PUT _ingest/pipeline/logs@custom
{
  &quot;processors&quot;: [
    {
      &quot;set&quot;: {
        &quot;field&quot;: &quot;pipelinetoplevel&quot;,
        &quot;value&quot;: &quot;logs@custom&quot;
      }
    },
        {
      &quot;set&quot;: {
        &quot;field&quot;: &quot;pipelinetoplevelinfo&quot;,
        &quot;value&quot;: &quot;{{{data_stream.dataset}}}&quot;
      }
    },
    {
      &quot;pipeline&quot;: {
        &quot;description&quot; : &quot;Call the process_pii pipeline on the correct dataset&quot;,
        &quot;if&quot;: &quot;ctx?.data_stream?.dataset == 'pii'&quot;, 
        &quot;name&quot;: &quot;process-pii&quot;
      }
    }
  ]
}
</code></pre>
&lt;/details&gt;
<p>Now, let's test to see the sampling at work.</p>
<p>Load the data as described here <a href="#data-loading-appendix">Data Loading Appendix</a>. Let's use the sample data first, and we will talk about how to test with your incoming or historical logs later at the end of this blog.</p>
<p>If you look at Observability -&gt; Logs -&gt; Logs Explorer with KQL filter <code>data_stream.dataset : pii</code> and Breakdown by sample.sampled, you should see the breakdown to be approximately 10%</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/pii-ner-regex-assess-redact-part-1/pii-discover-1-part-1.png" alt="PII Discover 1" /></p>
<p>At this point we have a composable ingest pipeline that is &quot;sampling&quot; logs. As a bonus, you can use this logs sampler for any other use cases you have as well.</p>
<h4>Loading, Configuration, and Execution of the NER Pipeline</h4>
<h5>Loading the NER Model</h5>
<p>You will need a Machine Learning node to run the NER model on. In this exercise, we are using <a href="https://www.elastic.co/guide/en/cloud/current/ec-getting-started.html">Elastic Cloud Hosted Deployment </a>on AWS with the <a href="https://www.elastic.co/guide/en/cloud/current/ec_selecting_the_right_configuration_for_you.html">CPU Optimized (ARM)</a> architecture. The NER inference will run on a Machine Learning AWS c5d node. There will be GPU options in the future, but today, we will stick with CPU architecture.</p>
<p>This exercise will use a single c5d with 8 GB RAM with 4.2 vCPU up to 8.4 vCPU</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/pii-ner-regex-assess-redact-part-1/pii-ml-node-part-1.png" alt="ML Node" /></p>
<p>Please refer to the official documentation on <a href="https://www.elastic.co/guide/en/machine-learning/current/ml-nlp-import-model.html">how to import an NLP-trained model into Elasticsearch</a> for complete instructions on uploading, configuring, and deploying the model.</p>
<p>The quickest way to get the model is using the Eland Docker method.</p>
<p>The following command will load the model into Elasticsearch but will not start it. We will do that in the next step.</p>
<pre><code class="language-bash">docker run -it --rm --network host docker.elastic.co/eland/eland \
  eland_import_hub_model \
  --url https://mydeployment.es.us-west-1.aws.found.io:443/ \
  -u elastic -p password \
  --hub-model-id dslim/bert-base-NER --task-type ner

</code></pre>
<h5>Deploy and Start the NER Model</h5>
<p>In general, to improve ingest performance, increase throughput by adding more allocations to the deployment. For improved search speed, increase the number of threads per allocation.</p>
<p>To scale ingest, we will focus on scaling the allocations for the deployed model. More information on this topic is available <a href="https://www.elastic.co/guide/en/machine-learning/current/ml-nlp-deploy-model.html">here</a>. The number of allocations must be less than the available allocated processors (cores, not vCPUs) per node.</p>
<p>To deploy and start the NER Model. We will do this using the <a href="https://www.elastic.co/guide/en/elasticsearch/reference/8.15/start-trained-model-deployment.html">Start trained model deployment API</a></p>
<p>We will configure the following:</p>
<ul>
<li>4 Allocations to allow for more parallel ingestion</li>
<li>1 Thread per Allocation</li>
<li>0 Byes Cache, as we expect a low cache hit rate</li>
<li>8192 Queue</li>
</ul>
<pre><code># Start the model with 4 Allocators x 1 Thread, no cache, and 8192 queue
POST _ml/trained_models/dslim__bert-base-ner/deployment/_start?cache_size=0b&amp;number_of_allocations=4&amp;threads_per_allocation=1&amp;queue_capacity=8192

</code></pre>
<p>You should get a response that looks something like this.</p>
<pre><code class="language-bash">{
  &quot;assignment&quot;: {
    &quot;task_parameters&quot;: {
      &quot;model_id&quot;: &quot;dslim__bert-base-ner&quot;,
      &quot;deployment_id&quot;: &quot;dslim__bert-base-ner&quot;,
      &quot;model_bytes&quot;: 430974836,
      &quot;threads_per_allocation&quot;: 1,
      &quot;number_of_allocations&quot;: 4,
      &quot;queue_capacity&quot;: 8192,
      &quot;cache_size&quot;: &quot;0&quot;,
      &quot;priority&quot;: &quot;normal&quot;,
      &quot;per_deployment_memory_bytes&quot;: 430914596,
      &quot;per_allocation_memory_bytes&quot;: 629366952
    },
...
    &quot;assignment_state&quot;: &quot;started&quot;,
    &quot;start_time&quot;: &quot;2024-09-23T21:39:18.476066615Z&quot;,
    &quot;max_assigned_allocations&quot;: 4
  }
}
</code></pre>
<p>The NER model has been deployed and started and is ready to be used.</p>
<p>The following ingest pipeline implements the NER model via the <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/inference-processor.html">inference</a> processor.</p>
<p>There is a significant amount of code here, but only two items of interest now exist. The rest of the code is conditional logic to drive some additional specific behavior that we will look closer at in the future.</p>
<ol>
<li>
<p>The inference processor calls the NER model by ID, which we loaded previously, and passes the text to be analyzed, which, in this case, is the message field, which is the text_field we want to pass to the NER model to analyze for PII.</p>
</li>
<li>
<p>The script processor loops through the message field and uses the data generated by the NER model to replace the identified PII with redacted placeholders. This looks more complex than it really is, as it simply loops through the array of ML predictions and replaces them in the message string with constants, and stores the results in a new field <code>redact.message</code>. We will look at this a little closer in the following steps.</p>
</li>
</ol>
<p><a href="https://github.com/bvader/elastic-pii/blob/main/elastic/blog-part-1/logs-sampler-composable-pipelines-part-2.json">The code can be found here</a> for the following three sections of code.</p>
<p>The NER PII Pipeline</p>
&lt;details open&gt;
  &lt;summary&gt;logs-ner-pii-processor pipeline code - click to open/close&lt;/summary&gt;
<pre><code class="language-bash"># NER Pipeline
DELETE _ingest/pipeline/logs-ner-pii-processor
PUT _ingest/pipeline/logs-ner-pii-processor
{
  &quot;processors&quot;: [
    {
      &quot;set&quot;: {
        &quot;description&quot;: &quot;Set to true to actually redact, false will run processors but leave original&quot;,
        &quot;field&quot;: &quot;redact.enable&quot;,
        &quot;value&quot;: true
      }
    },
    {
      &quot;set&quot;: {
        &quot;description&quot;: &quot;Set to true to keep ml results for debugging&quot;,
        &quot;field&quot;: &quot;redact.ner.keep_result&quot;,
        &quot;value&quot;: true
      }
    },
    {
      &quot;set&quot;: {
        &quot;description&quot;: &quot;Set to PER, LOC, ORG to skip, or NONE to not drop any replacement&quot;,
        &quot;field&quot;: &quot;redact.ner.skip_entity&quot;,
        &quot;value&quot;: &quot;NONE&quot;
      }
    },
    {
      &quot;set&quot;: {
        &quot;description&quot;: &quot;Set to PER, LOC, ORG to skip, or NONE to not drop any replacement&quot;,
        &quot;field&quot;: &quot;redact.ner.minimum_score&quot;,
        &quot;value&quot;: 0
      }
    },
    {
      &quot;set&quot;: {
        &quot;if&quot;: &quot;ctx?.redact?.message == null&quot;,
        &quot;field&quot;: &quot;redact.message&quot;,
        &quot;copy_from&quot;: &quot;message&quot;
      }
    },
    {
      &quot;set&quot;: {
        &quot;field&quot;: &quot;redact.ner.successful&quot;,
        &quot;value&quot;: true
      }
    },
    {
      &quot;set&quot;: {
        &quot;field&quot;: &quot;redact.ner.found&quot;,
        &quot;value&quot;: false
      }
    },
    {
      &quot;inference&quot;: {
        &quot;model_id&quot;: &quot;dslim__bert-base-ner&quot;,
        &quot;field_map&quot;: {
          &quot;message&quot;: &quot;text_field&quot;
        },
        &quot;on_failure&quot;: [
          {
            &quot;set&quot;: {
              &quot;description&quot;: &quot;Set 'error.message'&quot;,
              &quot;field&quot;: &quot;failure&quot;,
              &quot;value&quot;: &quot;REDACT_NER_FAILED&quot;
            }
          },
          {
            &quot;set&quot;: {
              &quot;field&quot;: &quot;redact.ner.successful&quot;,
              &quot;value&quot;: false
            }
          }
        ]
      }
    },
    {
      &quot;script&quot;: {
        &quot;if&quot;: &quot;ctx.failure_ner != 'REDACT_NER_FAILED'&quot;,
        &quot;lang&quot;: &quot;painless&quot;,
        &quot;source&quot;: &quot;&quot;&quot;String msg = ctx['message'];
          for (item in ctx['ml']['inference']['entities']) {
          	if ((item['class_name'] != ctx.redact.ner.skip_entity) &amp;&amp; 
          	  (item['class_probability'] &gt;= ctx.redact.ner.minimum_score)) {  
          		  msg = msg.replace(item['entity'], '&lt;' + 
          		  'REDACTNER-'+ item['class_name'] + '_NER&gt;')
          	}
          }
          ctx.redact.message = msg&quot;&quot;&quot;,
        &quot;on_failure&quot;: [
          {
            &quot;set&quot;: {
              &quot;description&quot;: &quot;Set 'error.message'&quot;,
              &quot;field&quot;: &quot;failure&quot;,
              &quot;value&quot;: &quot;REDACT_REPLACEMENT_SCRIPT_FAILED&quot;,
              &quot;override&quot;: false
            }
          },
          {
            &quot;set&quot;: {
              &quot;field&quot;: &quot;redact.successful&quot;,
              &quot;value&quot;: false
            }
          }
        ]
      }
    },
    
    {
      &quot;set&quot;: {
        &quot;if&quot;: &quot;ctx?.ml?.inference?.entities.size() &gt; 0&quot;, 
        &quot;field&quot;: &quot;redact.ner.found&quot;,
        &quot;value&quot;: true,
        &quot;ignore_failure&quot;: true
      }
    },
    {
      &quot;set&quot;: {
        &quot;if&quot;: &quot;ctx?.redact?.pii?.found == null&quot;,
        &quot;field&quot;: &quot;redact.pii.found&quot;,
        &quot;value&quot;: false
      }
    },
    {
      &quot;set&quot;: {
        &quot;if&quot;: &quot;ctx?.redact?.ner?.found == true&quot;,
        &quot;field&quot;: &quot;redact.pii.found&quot;,
        &quot;value&quot;: true
      }
    },
    {
      &quot;remove&quot;: {
        &quot;if&quot;: &quot;ctx.redact.ner.keep_result != true&quot;,
        &quot;field&quot;: [
          &quot;ml&quot;
        ],
        &quot;ignore_missing&quot;: true,
        &quot;ignore_failure&quot;: true
      }
    }
  ],
  &quot;on_failure&quot;: [
    {
      &quot;set&quot;: {
        &quot;field&quot;: &quot;failure&quot;,
        &quot;value&quot;: &quot;GENERAL_FAILURE&quot;,
        &quot;override&quot;: false
      }
    }
  ]
}
</code></pre>
&lt;/details&gt;
<p>The updated PII Processor Pipeline, which now calls the NER Pipeline</p>
&lt;details open&gt;
  &lt;summary&gt;process-pii pipeline code - click to open/close&lt;/summary&gt;
<pre><code class="language-bash"># Updated Process PII pipeline that now call the NER pipeline
DELETE _ingest/pipeline/process-pii
PUT _ingest/pipeline/process-pii
{
  &quot;processors&quot;: [
    {
      &quot;set&quot;: {
        &quot;description&quot;: &quot;Set true if enabling sampling, otherwise false&quot;,
        &quot;field&quot;: &quot;sample.enabled&quot;,
        &quot;value&quot;: true
      }
    },
    {
      &quot;set&quot;: {
        &quot;description&quot;: &quot;Set Sampling Rate 0 None 10000 all allows for 0.01% precision&quot;,
        &quot;field&quot;: &quot;sample.sample_rate&quot;,
        &quot;value&quot;: 1000
      }
    },
    {
      &quot;set&quot;: {
        &quot;description&quot;: &quot;Set to false if you want to drop unsampled data, handy for reindexing hostorical data&quot;,
        &quot;field&quot;: &quot;sample.keep_unsampled&quot;,
        &quot;value&quot;: true
      }
    },
    {
      &quot;pipeline&quot;: {
        &quot;if&quot;: &quot;ctx.sample.enabled == true&quot;,
        &quot;name&quot;: &quot;logs-sampler&quot;,
        &quot;ignore_failure&quot;: true
      }
    },
    {
      &quot;pipeline&quot;: {
        &quot;if&quot;: &quot;ctx.sample.enabled == false || (ctx.sample.enabled == true &amp;&amp; ctx.sample.sampled == true)&quot;,
        &quot;name&quot;: &quot;logs-ner-pii-processor&quot;
      }
    }
  ]
}

</code></pre>
&lt;/details&gt;
<p>Now reload the data as described here in <a href="#reloading-the-logs">Reloading the logs</a></p>
<h3>Results</h3>
<p>Let's take a look at the results with the NER processing in place. In the Logs Explorer with KQL query bar, execute the following query
<code>data_stream.dataset : pii and ml.inference.entities.class_name : (&quot;PER&quot; and &quot;LOC&quot; and &quot;ORG&quot; )</code></p>
<p>Logs Explorer should look something like this, open the top message to see the details.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/pii-ner-regex-assess-redact-part-1/pii-discover-2-part-1.png" alt="PII Discover 2" /></p>
<h4>NER Model Results</h4>
<p>Lets take a closer look at what these fields mean.</p>
<p><strong>Field:</strong> <code>ml.inference.entities.class_name</code><br />
<strong>Sample Value:</strong> <code>[PER, PER, LOC, ORG, ORG]</code><br />
<strong>Description:</strong> An array of the named entity classes that the NER model has identified.</p>
<p><strong>Field:</strong> <code>ml.inference.entities.class_probability</code><br />
<strong>Sample Value:</strong> <code>[0.999, 0.972, 0.896, 0.506, 0.595]</code><br />
<strong>Description:</strong> The class_probability is a value between 0 and 1, which indicates how likely it is that a given data point belongs to a certain class. The higher the number, the higher the probability that the data point belongs to the named class. <strong>This is important as in the next blog we can decide a threshold that we will want to use to alert and redact on.</strong>'
You can see in this example it identified a <code>LOC</code> as an <code>ORG</code>, we can filter this out / find them by setting a threshold.</p>
<p><strong>Field:</strong> <code>ml.inference.entities.entity</code><br />
<strong>Sample Value:</strong> <code>[Paul Buck, Steven Glens, South Amyborough, ME, Costco]</code><br />
<strong>Description:</strong> The array of entities identified that align positionally with the <code>class_name</code> and <code>class_probability</code>.</p>
<p><strong>Field:</strong> <code>ml.inference.predicted_value</code><br />
<strong>Sample Value:</strong> <code>[2024-09-23T14:32:14.608207-07:00Z] log.level=INFO: Payment successful for order #4594 (user: [Paul Buck](PER&amp;Paul+Buck), david59@burgess.net). Phone: 726-632-0527x520, Address: 3713 [Steven Glens](PER&amp;Steven+Glens), [South Amyborough](LOC&amp;South+Amyborough), [ME](ORG&amp;ME) 93580, Ordered from: [Costco](ORG&amp;Costco)</code><br />
<strong>Description:</strong> The predicted value of the model.</p>
<h4>PII Assessment Dashboard</h4>
<p>Lets take a quick look at a dashboard built to assess PII the data.</p>
<p>To load the dashboard, go to Kibana -&gt; Stack Management -&gt; Saved Objects and import the <code>pii-dashboard-part-1.ndjson</code> file that can be found here:</p>
<p><a href="https://github.com/bvader/elastic-pii/blob/main/elastic/blog-part-1/pii-dashboard-part-1.ndjson">https://github.com/bvader/elastic-pii/blob/main/elastic/blog-part-1/pii-dashboard-part-1.ndjson</a></p>
<p>More complete instructions on Kibana Saved Objects can be found <a href="https://www.elastic.co/guide/en/kibana/current/managing-saved-objects.html">here</a>.</p>
<p>After loading the dashboard, navigate to it and select the right time range and you should see something like below. It shows metrics such as sample rate, percent of logs with NER, NER Score Trends etc. We will examine the assessment and actions in part 2 of this blog.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/pii-ner-regex-assess-redact-part-1/pii-dashboard-1-part-1.png" alt="PII Dashboard 1" /></p>
<h2>Summary and Next Steps</h2>
<p>In this first part of the blog, we have accomplished the following.</p>
<ul>
<li>Reviewed the techniques and tools we have available for PII detection and assement</li>
<li>Reviewed NLP / NER role in PII detection and assessment</li>
<li>Built the necessary composable ingest pipelines to sample logs and run them through the NER Model</li>
<li>Reviewed the NER results and are ready to move to the second blog</li>
</ul>
<p>In the upcoming <a href="https://www.elastic.co/observability-labs/blog/pii-ner-regex-assess-redact-part-2">Part 2 of this blog</a> of this blog, we will cover the following:</p>
<ul>
<li>Redact PII using NER and redact processor</li>
<li>Apply field-level security to control access to the un-redacted data</li>
<li>Enhance the dashboards and alerts</li>
<li>Production considerations and scaling</li>
<li>How to run these processes on incoming or historical data</li>
</ul>
<h2>Data Loading Appendix</h2>
<h4>Code</h4>
<p>The data loading code can be found here:</p>
<p><a href="https://github.com/bvader/elastic-pii">https://github.com/bvader/elastic-pii</a></p>
<pre><code>$ git clone https://github.com/bvader/elastic-pii.git
</code></pre>
<h4>Creating and Loading the Sample Data Set</h4>
<pre><code>$ cd elastic-pii
$ cd python
$ python -m venv .env
$ source .env/bin/activate
$ pip install elasticsearch
$ pip install Faker
</code></pre>
<p>Run the log generator</p>
<pre><code>$ python generate_random_logs.py
</code></pre>
<p>If you do not changes any parameters, this will create 10000 random logs in a file named pii.log with a mix of logs that containe and do not contain PII.</p>
<p>Edit <code>load_logs.py</code> and set the following</p>
<pre><code># The Elastic User 
ELASTIC_USER = &quot;elastic&quot;

# Password for the 'elastic' user generated by Elasticsearch
ELASTIC_PASSWORD = &quot;askdjfhasldfkjhasdf&quot;

# Found in the 'Manage Deployment' page
ELASTIC_CLOUD_ID = &quot;deployment:sadfjhasfdlkjsdhf3VuZC5pbzo0NDMkYjA0NmQ0YjFiYzg5NDM3ZDgxM2YxM2RhZjQ3OGE3MzIkZGJmNTE0OGEwODEzNGEwN2E3M2YwYjcyZjljYTliZWQ=&quot;
</code></pre>
<p>Then run the following command.</p>
<pre><code>$ python load_logs.py
</code></pre>
<h4>Reloading the logs</h4>
<p><strong>Note</strong> To reload the logs, you can simply re-run the above command. You can run the command multiple time during this exercise and the logs will be reloaded (actually loaded again). The new logs will not collide with previous runs as there will be a unique <code>run.id</code> for each run which is displayed at the end of the loading process.</p>
<pre><code>$ python load_logs.py
</code></pre>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/pii-ner-regex-assess-redact-part-1/pii-ner-regex-assess-redact-part-1.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Using NLP and Pattern Matching to Detect, Assess, and Redact PII in Logs - Part 2]]></title>
            <link>https://www.elastic.co/observability-labs/blog/pii-ner-regex-assess-redact-part-2</link>
            <guid isPermaLink="false">pii-ner-regex-assess-redact-part-2</guid>
            <pubDate>Tue, 22 Oct 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[How to detect, assess, and redact PII in your logs using Elasticsearch, NLP and Pattern Matching]]></description>
            <content:encoded><![CDATA[<h2>Introduction:</h2>
<p>The prevalence of high-entropy logs in distributed systems has significantly raised the risk of PII (Personally Identifiable Information) seeping into our logs, which can result in security and compliance issues. This 2-part blog delves into the crucial task of identifying and managing this issue using the Elastic Stack. We will explore using NLP (Natural Language Processing) and Pattern matching to detect, assess, and, where feasible, redact PII from logs being ingested into Elasticsearch.</p>
<p>In <a href="https://www.elastic.co/observability-labs/blog/pii-ner-regex-assess-redact-part-1">Part 1 of this blog</a>, we covered the following:</p>
<ul>
<li>Review the techniques and tools we have available to manage PII in our logs</li>
<li>Understand the roles of NLP / NER in PII detection</li>
<li>Build a composable processing pipeline to detect and assess PII</li>
<li>Sample logs and run them through the NER Model</li>
<li>Assess the results of the NER Model</li>
</ul>
<p>In <strong>Part 2</strong> of this blog, we will cover the following:</p>
<ul>
<li>Apply the <code>redact</code> regex pattern processor and assess the results</li>
<li>Create Alerts using ESQL</li>
<li>Apply field-level security to control access to the un-redacted data</li>
<li>Production considerations and scaling</li>
<li>How to run these processes on incoming or historical data</li>
</ul>
<p>Reminder of the overall flow we will construct over the 2 blogs:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/pii-ner-regex-assess-redact-part-1/pii-overall-flow.png" alt="PII Overall Flow" /></p>
<p>All code for this exercise can be found at:
<a href="https://github.com/bvader/elastic-pii">https://github.com/bvader/elastic-pii</a>.</p>
<h3>Part 1 Prerequisites</h3>
<p>This blog picks up where <a href="https://www.elastic.co/observability-labs/blog/pii-ner-regex-assess-redact-part-1">Part 1 of this blog</a> left off. You must have the NER model, ingest pipelines, and dashboard from Part 1 installed and working.</p>
<ul>
<li>Loaded and configured NER Model</li>
<li>Installed all the composable ingest pipelines from Part 1 of the blog</li>
<li>Installed dashboard</li>
</ul>
<p>You can access the <a href="https://github.com/bvader/elastic-pii/blob/main/elastic/blog-part-1/logs-sampler-composable-pipelines-blog-1-complete.json">complete solution for Blog 1 here</a>. Don't forget to load the dashboard, found <a href="https://github.com/bvader/elastic-pii/blob/main/elastic/blog-part-1/pii-dashboard-part-1.ndjson">here</a>.</p>
<h3>Applying the Redact Processor</h3>
<p>Next, we will apply the <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/redact-processor.html"><code>redact</code> processor</a>. The <code>redact</code> processor is a simple regex-based processor that takes a list of regex patterns and looks for them in a field and replaces them with literals when found. The <code>redact</code> processor is reasonably performant and can run at scale. At the end, we will discuss this in detail in the <a href="#production-scaling">production scaling</a> section.</p>
<p>Elasticsearch comes packaged with a number of useful predefined <a href="https://github.com/elastic/elasticsearch/blob/8.15/libs/grok/src/main/resources/patterns/ecs-v1">patterns</a> that can be conveniently referenced by the <code>redact</code> processor. If one does not suit your needs, create a new pattern with a custom definition. The Redact processor replaces every occurrence of a match. If there are multiple matches, they will all be replaced with the pattern name.</p>
<p>In the code below, we leveraged some of the predefined patterns as well as constructing several custom patterns.</p>
<pre><code class="language-bash">        &quot;patterns&quot;: [
          &quot;%{EMAILADDRESS:EMAIL_REGEX}&quot;,      &lt;&lt; Predefined
          &quot;%{IP:IP_ADDRESS_REGEX}&quot;,           &lt;&lt; Predefined
          &quot;%{CREDIT_CARD:CREDIT_CARD_REGEX}&quot;, &lt;&lt; Custom
          &quot;%{SSN:SSN_REGEX}&quot;,                 &lt;&lt; Custom
          &quot;%{PHONE:PHONE_REGEX}&quot;              &lt;&lt; Custom
        ]
</code></pre>
<p>We also replaced the PII with easily identifiable patterns we can use for assessment.</p>
<p>In addition, it is important to note that since the redact processor is a simple regex find and replace, it can be used against many &quot;secrets&quot; patterns, not just PII. There are many references for regex and secrets patterns, so you can reuse this capability to detect secrets in your logs.</p>
<p><a href="https://github.com/bvader/elastic-pii/blob/main/elastic/blog-part-2/pii-redact-composable-pipelines-blog-2-redact-processor-1.json">The code can be found here</a> for the following two sections of code.</p>
&lt;details open&gt;
  &lt;summary&gt;redact processor pipeline code - click to open/close&lt;/summary&gt;
<pre><code class="language-bash"># Add the PII redact processor pipeline
DELETE _ingest/pipeline/logs-pii-redact-processor
PUT _ingest/pipeline/logs-pii-redact-processor
{
  &quot;processors&quot;: [
    {
      &quot;set&quot;: {
        &quot;field&quot;: &quot;redact.proc.successful&quot;,
        &quot;value&quot;: true
      }
    },
    {
      &quot;set&quot;: {
        &quot;field&quot;: &quot;redact.proc.found&quot;,
        &quot;value&quot;: false
      }
    },
    {
      &quot;set&quot;: {
        &quot;if&quot;: &quot;ctx?.redact?.message == null&quot;,
        &quot;field&quot;: &quot;redact.message&quot;,
        &quot;copy_from&quot;: &quot;message&quot;
      }
    },
    {
      &quot;redact&quot;: {
        &quot;field&quot;: &quot;redact.message&quot;,
        &quot;prefix&quot;: &quot;&lt;REDACTPROC-&quot;,
        &quot;suffix&quot;: &quot;&gt;&quot;,
        &quot;patterns&quot;: [
          &quot;%{EMAILADDRESS:EMAIL_REGEX}&quot;,
          &quot;%{IP:IP_ADDRESS_REGEX}&quot;,
          &quot;%{CREDIT_CARD:CREDIT_CARD_REGEX}&quot;,
          &quot;%{SSN:SSN_REGEX}&quot;,
          &quot;%{PHONE:PHONE_REGEX}&quot;
        ],
        &quot;pattern_definitions&quot;: {
          &quot;CREDIT_CARD&quot;: &quot;&quot;&quot;\d{4}[ -]\d{4}[ -]\d{4}[ -]\d{4}&quot;&quot;&quot;,
          &quot;SSN&quot;: &quot;&quot;&quot;\d{3}-\d{2}-\d{4}&quot;&quot;&quot;,
          &quot;PHONE&quot;: &quot;&quot;&quot;(\+\d{1,2}\s?)?1?\-?\.?\s?\(?\d{3}\)?[\s.-]?\d{3}[\s.-]?\d{4}&quot;&quot;&quot;
        },
        &quot;on_failure&quot;: [
          {
            &quot;set&quot;: {
              &quot;description&quot;: &quot;Set 'error.message'&quot;,
              &quot;field&quot;: &quot;failure&quot;,
              &quot;value&quot;: &quot;REDACT_PROCESSOR_FAILED&quot;,
              &quot;override&quot;: false
            }
          },
          {
            &quot;set&quot;: {
              &quot;field&quot;: &quot;redact.proc.successful&quot;,
              &quot;value&quot;: false
            }
          }
        ]
      }
    },
    {
      &quot;set&quot;: {
        &quot;if&quot;: &quot;ctx?.redact?.message.contains('REDACTPROC')&quot;,
        &quot;field&quot;: &quot;redact.proc.found&quot;,
        &quot;value&quot;: true
      }
    },
    {
      &quot;set&quot;: {
        &quot;if&quot;: &quot;ctx?.redact?.pii?.found == null&quot;,
        &quot;field&quot;: &quot;redact.pii.found&quot;,
        &quot;value&quot;: false
      }
    },
    {
      &quot;set&quot;: {
        &quot;if&quot;: &quot;ctx?.redact?.proc?.found == true&quot;,
        &quot;field&quot;: &quot;redact.pii.found&quot;,
        &quot;value&quot;: true
      }
    }
  ],
  &quot;on_failure&quot;: [
    {
      &quot;set&quot;: {
        &quot;field&quot;: &quot;failure&quot;,
        &quot;value&quot;: &quot;GENERAL_FAILURE&quot;,
        &quot;override&quot;: false
      }
    }
  ]
}
</code></pre>
&lt;/details&gt;
<p>And now, we will add the <code>logs-pii-redact-processor</code> pipeline to the overall <code>process-pii</code> pipeline</p>
&lt;details open&gt;
  &lt;summary&gt;redact processor pipeline code - click to open/close&lt;/summary&gt;
<pre><code class="language-bash"># Updated Process PII pipeline that now call the NER and Redact Processor pipeline
DELETE _ingest/pipeline/process-pii
PUT _ingest/pipeline/process-pii
{
  &quot;processors&quot;: [
    {
      &quot;set&quot;: {
        &quot;description&quot;: &quot;Set true if enabling sampling, otherwise false&quot;,
        &quot;field&quot;: &quot;sample.enabled&quot;,
        &quot;value&quot;: true
      }
    },
    {
      &quot;set&quot;: {
        &quot;description&quot;: &quot;Set Sampling Rate 0 None 10000 all allows for 0.01% precision&quot;,
        &quot;field&quot;: &quot;sample.sample_rate&quot;,
        &quot;value&quot;: 1000
      }
    },
    {
      &quot;set&quot;: {
        &quot;description&quot;: &quot;Set to false if you want to drop unsampled data, handy for reindexing hostorical data&quot;,
        &quot;field&quot;: &quot;sample.keep_unsampled&quot;,
        &quot;value&quot;: true
      }
    },
    {
      &quot;pipeline&quot;: {
        &quot;if&quot;: &quot;ctx.sample.enabled == true&quot;,
        &quot;name&quot;: &quot;logs-sampler&quot;,
        &quot;ignore_failure&quot;: true
      }
    },
    {
      &quot;pipeline&quot;: {
        &quot;if&quot;: &quot;ctx.sample.enabled == false || (ctx.sample.enabled == true &amp;&amp; ctx.sample.sampled == true)&quot;,
        &quot;name&quot;: &quot;logs-ner-pii-processor&quot;
      }
    },
    {
      &quot;pipeline&quot;: {
        &quot;if&quot;: &quot;ctx.sample.enabled == false || (ctx.sample.enabled == true &amp;&amp;  ctx.sample.sampled == true)&quot;,
        &quot;name&quot;: &quot;logs-pii-redact-processor&quot;
      }
    }
  ]
}
</code></pre>
&lt;/details&gt;
<p>Reload the data as described in the <a href="#reloading-the-logs">Reloading the logs</a>. If you have not generated the logs the first time, follow the instructions in the <a href="#data-loading-appendix">Data Loading Appendix</a></p>
<p>Go to Discover and enter the following into the KQL bar
<code>sample.sampled : true and redact.message: REDACTPROC</code> and add the <code>redact.message</code> to the table and you should see something like this.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/pii-ner-regex-assess-redact-part-2/pii-discover-1-part-2.png" alt="PII Discover Blog 2 Part 1" /></p>
<p>And if you did not load the dashboard from Blog Part 1 at already, load it, it can be found <a href="https://github.com/bvader/elastic-pii/blob/main/elastic/blog-part-1/pii-dashboard-part-1.ndjson">here</a> using the Kibana -&gt; Stack Management -&gt; Saved Objects -&gt; Import.</p>
<p>It should look something like this now. Note that the REGEX portions of the dashboard are now active.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/pii-ner-regex-assess-redact-part-2/pii-dashboard-1-part-2.png" alt="PII Dashboards Blog 2 Part 1" /></p>
<h2>Checkpoint</h2>
<p>At this point, we have the following capabilities:</p>
<ul>
<li>Ability to sample incoming logs and apply this PII redaction</li>
<li>Detect and Assess PII with the NER/NLP and Pattern Matching</li>
<li>Assess the amount, type and quality of the PII detections</li>
</ul>
<p>This is a great point to stop if you are just running all this once to see how it works, but we have a few more steps to make this useful in production systems.</p>
<ul>
<li>Clean up the working and unredacted data</li>
<li>Update the Dashboard to work with the cleaned-up data</li>
<li>Apply Role Based Access Control to protect the raw  unredacted data</li>
<li>Create Alerts</li>
<li>Production and Scaling Considerations</li>
<li>How to run these processes on incoming or historical data</li>
</ul>
<h2>Applying to Production Systems</h2>
<h3>Cleanup working data and update the dashboard</h3>
<p>And now we will add the cleanup code to the overall <code>process-pii</code> pipeline.</p>
<p>In short, we set a flag <code>redact.enable: true</code> that directs the pipeline to move the unredacted <code>message</code> field to <code>raw.message</code> and the move the redacted message field <code>redact.message</code>to the <code>message</code> field. We will &quot;protect&quot; the <code>raw.message</code> in the following section.</p>
<p><strong>NOTE:</strong> Of course you can change this behavior if you want to completely delete the unredacted data. In this exercise we will keep it and protect it.</p>
<p>In addition we set <code>redact.cleanup: true</code> to clean up the NLP working data.</p>
<p>These fields allow a lot of control over what data you decide to keep and analyze.</p>
<p><a href="https://github.com/bvader/elastic-pii/blob/main/elastic/blog-part-2/pii-redact-composable-pipelines-blog-2-redact-processor-2.json">The code can be found here</a> for the following two sections of code.</p>
&lt;details open&gt;
  &lt;summary&gt;redact processor pipeline code - click to open/close&lt;/summary&gt;
<pre><code class="language-bash"># Updated Process PII pipeline that now call the NER and Redact Processor pipeline and cleans up 
DELETE _ingest/pipeline/process-pii
PUT _ingest/pipeline/process-pii
{
  &quot;processors&quot;: [
    {
      &quot;set&quot;: {
        &quot;description&quot;: &quot;Set true if enabling sampling, otherwise false&quot;,
        &quot;field&quot;: &quot;sample.enabled&quot;,
        &quot;value&quot;: true
      }
    },
    {
      &quot;set&quot;: {
        &quot;description&quot;: &quot;Set Sampling Rate 0 None 10000 all allows for 0.01% precision&quot;,
        &quot;field&quot;: &quot;sample.sample_rate&quot;,
        &quot;value&quot;: 1000
      }
    },
    {
      &quot;set&quot;: {
        &quot;description&quot;: &quot;Set to false if you want to drop unsampled data, handy for reindexing hostorical data&quot;,
        &quot;field&quot;: &quot;sample.keep_unsampled&quot;,
        &quot;value&quot;: true
      }
    },
    {
      &quot;pipeline&quot;: {
        &quot;if&quot;: &quot;ctx.sample.enabled == true&quot;,
        &quot;name&quot;: &quot;logs-sampler&quot;,
        &quot;ignore_failure&quot;: true
      }
    },
    {
      &quot;pipeline&quot;: {
        &quot;if&quot;: &quot;ctx.sample.enabled == false || (ctx.sample.enabled == true &amp;&amp; ctx.sample.sampled == true)&quot;,
        &quot;name&quot;: &quot;logs-ner-pii-processor&quot;
      }
    },
    {
      &quot;pipeline&quot;: {
        &quot;if&quot;: &quot;ctx.sample.enabled == false || (ctx.sample.enabled == true &amp;&amp;  ctx.sample.sampled == true)&quot;,
        &quot;name&quot;: &quot;logs-pii-redact-processor&quot;
      }
    },
    {
      &quot;set&quot;: {
        &quot;description&quot;: &quot;Set to true to actually redact, false will run processors but leave original&quot;,
        &quot;field&quot;: &quot;redact.enable&quot;,
        &quot;value&quot;: true
      }
    },
    {
      &quot;rename&quot;: {
        &quot;if&quot;: &quot;ctx?.redact?.pii?.found == true &amp;&amp; ctx?.redact?.enable == true&quot;,
        &quot;field&quot;: &quot;message&quot;,
        &quot;target_field&quot;: &quot;raw.message&quot;
      }
    },
    {
      &quot;rename&quot;: {
        &quot;if&quot;: &quot;ctx?.redact?.pii?.found == true &amp;&amp; ctx?.redact?.enable == true&quot;,
        &quot;field&quot;: &quot;redact.message&quot;,
        &quot;target_field&quot;: &quot;message&quot;
      }
    },
    {
      &quot;set&quot;: {
        &quot;description&quot;: &quot;Set to true to actually to clean up working data&quot;,
        &quot;field&quot;: &quot;redact.cleanup&quot;,
        &quot;value&quot;: true
      }
    },
    {
      &quot;remove&quot;: {
        &quot;if&quot;: &quot;ctx?.redact?.cleanup == true&quot;,
        &quot;field&quot;: [
          &quot;ml&quot;
        ],
        &quot;ignore_failure&quot;: true
      }
    }
  ]
}
</code></pre>
&lt;/details&gt;
<p>Reload the data as described here in the <a href="#reloading-the-logs">Reloading the logs</a>.</p>
<p>Go to Discover and enter the following into the KQL bar
<code>sample.sampled : true and redact.pii.found: true</code> and add the following fields to the table</p>
<p><code>message</code>,<code>raw.message</code>,<code>redact.ner.found</code>,<code>redact.proc.found</code>,<code>redact.pii.found</code></p>
<p>You should see something like this
<img src="https://www.elastic.co/observability-labs/assets/images/pii-ner-regex-assess-redact-part-2/pii-discover-2-part-2.png" alt="PII Discover Part 2 Blog 2" /></p>
<p>We have everything we need to move forward with protecting the PII and Alerting on it.</p>
<p>Load up the new dashboard that works on the cleaned-up data</p>
<p>To load the dashboard, go to Kibana -&gt; Stack Management -&gt; Saved Objects and import the <code>pii-dashboard-part-2.ndjson</code> file that can be found <a href="https://github.com/bvader/elastic-pii/blob/main/elastic/blog-part-2/pii-dashboard-part-2.ndjson">here</a>.</p>
<p>The new dashboard should look like this. Note: It uses different fields under the covers since we have cleaned up the underlying data.</p>
<p>You should see something like this
<img src="https://www.elastic.co/observability-labs/assets/images/pii-ner-regex-assess-redact-part-2/pii-dashboard-2-part-2.png" alt="PII Dashboard Part 2 Blog 2" /></p>
<h3>Apply Role Based Access Control to protect the raw unredacted data</h3>
<p>Elasticsearch supports role-based access control, including field and document level access control natively; it dramatically reduces the operational and maintenance complexity required to secure our application.</p>
<p>We will create a Role that does not allow access to the <code>raw.message</code> field and then create a user and assign that user the role. With that role, the user will only be able to see the redacted message, which is now in the <code>message</code> field, but will not be able to access the protected <code>raw.message</code> field.</p>
<p><strong>NOTE:</strong> Since we only sampled 10% of the data in this exercise the non-sampled <code>message</code> fields are not moved to the <code>raw.message</code>, so they are still viewable, but this shows the capability you can apply in a production system.</p>
<p><a href="https://github.com/bvader/elastic-pii/blob/main/elastic/blog-part-2/pii-redact-composable-pipelines-blog-2-rbac.json">The code can be found here</a> for the following section of code.</p>
&lt;details open&gt;
  &lt;summary&gt;RBAC protect-pii role and user code - click to open/close&lt;/summary&gt;
<pre><code class="language-bash"># Create role with no access to the raw.message field
GET _security/role/protect-pii
DELETE _security/role/protect-pii
PUT _security/role/protect-pii
{
 &quot;cluster&quot;: [],
 &quot;indices&quot;: [
   {
     &quot;names&quot;: [
       &quot;logs-*&quot;
     ],
     &quot;privileges&quot;: [
       &quot;read&quot;,
       &quot;view_index_metadata&quot;
     ],
     &quot;field_security&quot;: {
       &quot;grant&quot;: [
         &quot;*&quot;
       ],
       &quot;except&quot;: [
         &quot;raw.message&quot;
       ]
     },
     &quot;allow_restricted_indices&quot;: false
   }
 ],
 &quot;applications&quot;: [
   {
     &quot;application&quot;: &quot;kibana-.kibana&quot;,
     &quot;privileges&quot;: [
       &quot;all&quot;
     ],
     &quot;resources&quot;: [
       &quot;*&quot;
     ]
   }
 ],
 &quot;run_as&quot;: [],
 &quot;metadata&quot;: {},
 &quot;transient_metadata&quot;: {
   &quot;enabled&quot;: true
 }
}

# Create user stephen with protect-pii role
GET _security/user/stephen
DELETE /_security/user/stephen
POST /_security/user/stephen
{
 &quot;password&quot; : &quot;mypassword&quot;,
 &quot;roles&quot; : [ &quot;protect-pii&quot; ],
 &quot;full_name&quot; : &quot;Stephen Brown&quot;
}

</code></pre>
 &lt;/details&gt;
<p>Now log into a separate window with the new user <code>stephen</code> with the <code>protect-pii role</code>. Go to Discover and put <code>redact.pii.found : true</code> in the KQL bar and add the <code>message</code> field to the table. Also, notice that the <code>raw.message</code> is not available.</p>
<p>You should see something like this
<img src="https://www.elastic.co/observability-labs/assets/images/pii-ner-regex-assess-redact-part-2/pii-discover-3-part-2.png" alt="PII Dashboard Part 2 Blog 2" /></p>
<h3>Create an Alert when PII Detected</h3>
<p>Now, with the processing of the pipelines, creating an alert when PII is detected is easy. To review <a href="https://www.elastic.co/guide/en/kibana/current/alerting-getting-started.html">Alerting in Kibana</a> in detail if needed</p>
<p>NOTE: <a href="#reloading-the-logs">Reload</a> the data if needed to have recent data.</p>
<p>First, we will create a simple <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/esql.html">ES|QL query</a> in Discover.</p>
<p><a href="https://github.com/bvader/elastic-pii/blob/main/elastic/blog-part-2/pii-redact-esql-alert-blog-2.txt">The code can be found here.</a></p>
<pre><code>FROM logs-pii-default
| WHERE redact.pii.found == true
| STATS pii_count = count(*)
| WHERE pii_count &gt; 0
</code></pre>
<p>When you run this you should see something like this.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/pii-ner-regex-assess-redact-part-2/pii-esql-1-part-2.png" alt="PII ESQL Part 1 Blog 2" /></p>
<p>Now click the Alerts menu and select <code>Create search threshold rule</code>, and will create an alert to alert us when PII is found.</p>
<p><strong>Select a time field: @timestamp
Set the time window: 5 minutes</strong></p>
<p>Assuming you loaded the data recently when you run <strong>Test</strong> it should do something like</p>
<p>pii_count : <code>343</code>
Alerts generated <code>query matched</code></p>
<p>Add an action when the alert is Active.</p>
<p><strong>For each alert: <code>On status changes</code>
Run when: <code>Query matched</code></strong></p>
<pre><code>Elasticsearch query rule {{rule.name}} is active:

- PII Found: true
- PII Count: {{#context.hits}} {{_source.pii_count}}{{/context.hits}}
- Conditions Met: {{context.conditions}} over {{rule.params.timeWindowSize}}{{rule.params.timeWindowUnit}}
- Timestamp: {{context.date}}
- Link: {{context.link}}
</code></pre>
<p>Add an Action for when the Alert is Recovered.</p>
<p><strong>For each alert: <code>On status changes</code>
Run when: <code>Recovered</code></strong></p>
<pre><code>Elasticsearch query rule {{rule.name}} is Recovered:

- PII Found: false
- Conditions Not Met: {{context.conditions}} over {{rule.params.timeWindowSize}}{{rule.params.timeWindowUnit}}
- Timestamp: {{context.date}}
- Link: {{context.link}}
</code></pre>
<p>When all setup it should look like this and <code>Save</code></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/pii-ner-regex-assess-redact-part-2/pii-alert-1-part2.png" alt="Alert Setup" /><br />
<img src="https://www.elastic.co/observability-labs/assets/images/pii-ner-regex-assess-redact-part-2/pii-alert-2-part2.png" alt="Action Alert" /><br />
<img src="https://www.elastic.co/observability-labs/assets/images/pii-ner-regex-assess-redact-part-2/pii-alert-3-part2.png" alt="Action Alert" /></p>
<p>You should get an Active alert that looks like this if you have recent data. I sent mine to Slack.</p>
<pre><code>Elasticsearch query rule pii-found-esql is active:
- PII Found: true
- PII Count:  374
- Conditions Met: Query matched documents over 5m
- Timestamp: 2024-10-15T02:44:52.795Z
- Link: https://mydeployment123.aws.found.io:9243/app/management/insightsAndAlerting/triggersActions/rule/7d6faecf-964e-46da-aaba-8a2f89f33989
</code></pre>
<p>And then if you wait you will get a Recovered alert that looks like this.</p>
<pre><code>Elasticsearch query rule pii-found-esql is Recovered:
- PII Found: false
- Conditions Not Met: Query did NOT match documents over 5m
- Timestamp: 2024-10-15T02:49:04.815Z
- Link: https://mydeployment123.kb.us-west-1.aws.found.io:9243/app/management/insightsAndAlerting/triggersActions/rule/7d6faecf-964e-46da-aaba-8a2f89f33989
</code></pre>
<h3>Production Scaling</h3>
<h4>NER Scaling</h4>
<p>As we mentioned <a href="https://www.elastic.co/observability-labs/blog/pii-ner-regex-assess-redact-part-1#named-entity-recognition-ner-detection">Part 1 of this blog</a> of this blog, NER / NLP Models are CPU-intensive and expensive to run at scale; thus, we employed a sampling technique to understand the risk in our logs without sending the full logs volume through the NER Model.</p>
<p>Please review <a href="https://www.elastic.co/observability-labs/blog/pii-ner-regex-assess-redact-part-1#loading-configuration-and-execution-of-the-ner-pipeline">the setup and configuration of the NER</a> model from Part 1 of the blog.</p>
<p>We chose the base BERT NER model <a href="https://huggingface.co/dslim/bert-base-NER">bert-base-NER</a> for our PII case.</p>
<p>To scale ingest, we will focus on scaling the allocations for the deployed model. More information on this topic is available <a href="https://www.elastic.co/guide/en/machine-learning/current/ml-nlp-deploy-model.html">here</a>. The number of allocations must be less than the available allocated processors (cores, not vCPUs) per node.</p>
<p>The metrics below are related to the model and configuration from Part 1 of the blog.</p>
<ul>
<li>4 Allocations to allow for more parallel ingestion</li>
<li>1 Thread per Allocation</li>
<li>0 Byes Cache, as we expect a low cache hit rate
<strong>Note</strong> If there are many repeated logs, cache can help, but with timestamps and other variations, cache will not help and can even slow down the process</li>
<li>8192 Queue</li>
</ul>
<pre><code class="language-bash">GET _ml/trained_models/dslim__bert-base-ner/_stats
.....
           &quot;node&quot;: {
              &quot;0m4tq7tMRC2H5p5eeZoQig&quot;: {
.....
                &quot;attributes&quot;: {
                  &quot;xpack.installed&quot;: &quot;true&quot;,
                  &quot;region&quot;: &quot;us-west-1&quot;,
                  &quot;ml.allocated_processors&quot;: &quot;5&quot;, &lt;&lt; HERE 
.....
            },
            &quot;inference_count&quot;: 5040,
            &quot;average_inference_time_ms&quot;: 138.44285714285715, &lt;&lt; HERE 
            &quot;average_inference_time_ms_excluding_cache_hits&quot;: 138.44285714285715,
            &quot;inference_cache_hit_count&quot;: 0,
.....
            &quot;threads_per_allocation&quot;: 1,
            &quot;number_of_allocations&quot;: 4,  &lt;&lt;&lt; HERE
            &quot;peak_throughput_per_minute&quot;: 1550,
            &quot;throughput_last_minute&quot;: 1373,
            &quot;average_inference_time_ms_last_minute&quot;: 137.55280407865988,
            &quot;inference_cache_hit_count_last_minute&quot;: 0
          }
        ]
      }
    }
</code></pre>
<p>There are 3 key pieces of information above:</p>
<ul>
<li>
<p><code>&quot;ml.allocated_processors&quot;: &quot;5&quot;</code>
The number of physical cores / processors available</p>
</li>
<li>
<p><code>&quot;number_of_allocations&quot;: 4</code>
The number of allocations which is maximum 1 per physical core. <strong>Note</strong>: we could have used 5 allocations, but we only allocated 4 for this exercise</p>
</li>
<li>
<p><code>&quot;average_inference_time_ms&quot;: 138.44285714285715</code>
The averages inference time per document.</p>
</li>
</ul>
<p>The math is pretty straightforward for throughput for Inferences per Min (IPM) per allocation (1 allocation per physical core), since an inference uses a single core and a single thread.</p>
<p>Then the Inferences per Min per Allocation is simply:</p>
<p><code>IPM per allocation = 60,000 ms (in a minute) / 138ms per inference = 435</code></p>
<p>When then lines up with the Total Inferences per Minute</p>
<p><code>Total IPM = 435 IPM / allocation * 4 Allocations = ~1740</code></p>
<p>Suppose we want to do 10,000 IPMs, how many allocations (cores) would I need?</p>
<p><code>Allocations = 10,000 IPM / 435 IPM per allocation = 23 Allocation (cores rounded up)</code></p>
<p>Or perhaps logs are coming in at 5000 EPS and you want to do 1% Sampling.</p>
<p><code>IPM = 5000 EPS * 60sec * 0.01 sampling = 3000 IPM sampled</code></p>
<p>Then</p>
<p><code>Number of Allocators = 3000 IPM / 435 IPM per allocation = 7 allocations (cores rounded up)</code></p>
<p><strong>Want Faster!</strong> Turns out there is a more lightweight NER Model <a href="https://huggingface.co/dslim/distilbert-NER">
distilbert-NER</a> model that is faster, but the tradeoff is a little less accuracy.</p>
<p>Running the logs through this model results in an inference time nearly twice as fast!</p>
<p><code>&quot;average_inference_time_ms&quot;: 66.0263959390863</code></p>
<p>Here is some quick math:
<code>$IPM per allocation = 60,000 ms (in a minute) / 61ms per inference = 983</code></p>
<p>Suppose we want to do 25,000 IPMs, how many allocations (cores) would I need?</p>
<p><code>Allocations = 25,000 IPM / 983 IPM per allocation = 26 Allocation (cores rounded up)</code></p>
<p><strong>Now you can apply this math to determine the correct sampling and NER scaling to support your logging use case.</strong></p>
<h4>Redact Processor Scaling</h4>
<p>In short, the <code>redact</code> processor should scale to production loads as long as you are using appropriately sized and configured nodes and have well-constructed regex patterns.</p>
<h3>Assessing incoming logs</h3>
<p>If you want to test on incoming logs data in a data stream. All you need to do is change the conditional in the <code>logs@custom</code> pipeline to apply the <code>process-pii</code> to the dataset you want to. You can use any conditional that fits your condition.</p>
<p>Note: Just make sure that you have accounted for the proper scaling for the NER and Redact processors they were described above in <a href="#production-scaling">Production Scaling</a></p>
<pre><code class="language-bash">    {
      &quot;pipeline&quot;: {
        &quot;description&quot; : &quot;Call the process_pii pipeline on the correct dataset&quot;,
        &quot;if&quot;: &quot;ctx?.data_stream?.dataset == 'pii'&quot;, &lt;&lt;&lt; HERE
        &quot;name&quot;: &quot;process-pii&quot;
      }
    }
</code></pre>
<p>So if for example your logs are coming into <code>logs-mycustomapp-default</code> you would just change the conditional to</p>
<pre><code>        &quot;if&quot;: &quot;ctx?.data_stream?.dataset == 'mycustomapp'&quot;,
</code></pre>
<h3>Assessing historical data</h3>
<p>If you have a historical (already ingested) data stream or index you can run the assessment over them using the <code>_reindex</code> API&gt;</p>
<p>Note: Just make sure that you have accounted for the proper scaling for the NER and Redact processors they were described above in <a href="#production-scaling">Production Scaling</a></p>
<p>There are a couple of extra steps:
<a href="https://github.com/bvader/elastic-pii/blob/main/elastic/blog-part-2/pii-redact-historical-data-blog-2.json">The code can be found here.</a></p>
<ol>
<li>First we can set the parameters to ONLY keep the sampled data as there is no reason to make a copy of all the unsampled data. In the <code>process-pii</code> pipeline, there is a setting <code>sample.keep_unsampled</code>, which we can set to <code>false</code>, which will then only keep the sampled data</li>
</ol>
<pre><code class="language-bash">    {
      &quot;set&quot;: {
        &quot;description&quot;: &quot;Set to false if you want to drop unsampled data, handy for reindexing hostorical data&quot;,
        &quot;field&quot;: &quot;sample.keep_unsampled&quot;,
        &quot;value&quot;: false &lt;&lt;&lt; SET TO false
      }
    },
</code></pre>
<ol start="2">
<li>Second, we will create a pipeline that will reroute the data to the correct data stream to run through all the PII assessment/detection pipelines. It also sets the correct <code>dataset</code> and <code>namespace</code></li>
</ol>
<pre><code class="language-bash">DELETE _ingest/pipeline/sendtopii
PUT _ingest/pipeline/sendtopii
{
  &quot;processors&quot;: [
    {
      &quot;set&quot;: {
        &quot;field&quot;: &quot;data_stream.dataset&quot;,
        &quot;value&quot;: &quot;pii&quot;
      }
    },
    {
      &quot;set&quot;: {
        &quot;field&quot;: &quot;data_stream.namespace&quot;,
        &quot;value&quot;: &quot;default&quot;
      }
    },
    {
      &quot;reroute&quot; : 
      {
        &quot;dataset&quot; : &quot;{{data_stream.dataset}}&quot;,
        &quot;namespace&quot;: &quot;{{data_stream.namespace}}&quot;
      }
    }
  ]
}
</code></pre>
<ol start="3">
<li>Finally, we can run a <code>_reindex</code> to select the data we want to test/assess. It is recommended to review the <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-reindex.html">_reindex</a> documents before trying this. First, select the source data stream you want to assess, in this example, it is the <code>logs-generic-default</code> logs data stream. Note: I also added a <code>range</code> filter to select a specific time range. There is a bit of a &quot;trick&quot; that we need to use since we are re-routing the data to the data stream <code>logs-pii-default</code>. To do this, we just set <code>&quot;index&quot;: &quot;logs-tmp-default&quot;</code> in the <code>_reindex</code> as the correct data stream will be set in the pipeline. We must do that because <code>reroute</code> is a <code>noop</code> if it is called from/to the same datastream.</li>
</ol>
<pre><code class="language-bash">POST _reindex?wait_for_completion=false
{
  &quot;source&quot;: {
    &quot;index&quot;: &quot;logs-generic-default&quot;,
    &quot;query&quot;: {
      &quot;bool&quot;: {
        &quot;filter&quot;: [
          {
            &quot;range&quot;: {
              &quot;@timestamp&quot;: {
                &quot;gte&quot;: &quot;now-1h/h&quot;,
                &quot;lt&quot;: &quot;now&quot;
              }
            }
          }
        ]
      }
    }
  },
  &quot;dest&quot;: {
    &quot;op_type&quot;: &quot;create&quot;,
    &quot;index&quot;: &quot;logs-tmp-default&quot;,
    &quot;pipeline&quot;: &quot;sendtopii&quot;
  }
}
</code></pre>
<h2>Summary</h2>
<p>At this point, you have the tools and processes need to assess, detect, analyze, alert and protect PII in your logs.</p>
<p><a href="https://github.com/bvader/elastic-pii/tree/main/elastic/blog-complete-end-solution">The end state solution can be found here:</a>.</p>
<p>In <a href="https://www.elastic.co/observability-labs/blog/pii-ner-regex-assess-redact-part-1">Part 1 of this blog</a>, we accomplished the following.</p>
<ul>
<li>Reviewed the techniques and tools we have available for PII detection and assessment</li>
<li>Reviewed NLP / NER role in PII detection and assessment</li>
<li>Built the necessary composable ingest pipelines to sample logs and run them through the NER Model</li>
<li>Reviewed the NER results and are ready to move to the second blog</li>
</ul>
<p>In <strong>Part 2</strong> of this blog, we covered the following:</p>
<ul>
<li>Redact PII using NER and redact processor</li>
<li>Apply field-level security to control access to the un-redacted data</li>
<li>Enhance the dashboards and alerts</li>
<li>Production considerations and scaling</li>
<li>How to run these processes on incoming or historical data</li>
</ul>
<p><em><strong>So get to work and reduce risk in your logs!</strong></em></p>
<h2>Data Loading Appendix</h2>
<h4>Code</h4>
<p>The data loading code can be found here:</p>
<p><a href="https://github.com/bvader/elastic-pii">https://github.com/bvader/elastic-pii</a></p>
<pre><code>$ git clone https://github.com/bvader/elastic-pii.git
</code></pre>
<h4>Creating and Loading the Sample Data Set</h4>
<pre><code>$ cd elastic-pii
$ cd python
$ python -m venv .env
$ source .env/bin/activate
$ pip install elasticsearch
$ pip install Faker
</code></pre>
<p>Run the log generator</p>
<pre><code>$ python generate_random_logs.py
</code></pre>
<p>If you do not changes any parameters, this will create 10000 random logs in a file named pii.log with a mix of logs that containe and do not contain PII.</p>
<p>Edit <code>load_logs.py</code> and set the following</p>
<pre><code># The Elastic User 
ELASTIC_USER = &quot;elastic&quot;

# Password for the 'elastic' user generated by Elasticsearch
ELASTIC_PASSWORD = &quot;askdjfhasldfkjhasdf&quot;

# Found in the 'Manage Deployment' page
ELASTIC_CLOUD_ID = &quot;deployment:sadfjhasfdlkjsdhf3VuZC5pbzo0NDMkYjA0NmQ0YjFiYzg5NDM3ZDgxM2YxM2RhZjQ3OGE3MzIkZGJmNTE0OGEwODEzNGEwN2E3M2YwYjcyZjljYTliZWQ=&quot;
</code></pre>
<p>Then run the following command.</p>
<pre><code>$ python load_logs.py
</code></pre>
<h4>Reloading the logs</h4>
<p><strong>Note</strong> To reload the logs, you can simply re-run the above command. You can run the command multiple time during this exercise and the logs will be reloaded (actually loaded again). The new logs will not collide with previous runs as there will be a unique <code>run.id</code> for each run which is displayed at the end of the loading process.</p>
<pre><code>$ python load_logs.py
</code></pre>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/pii-ner-regex-assess-redact-part-2/pii-ner-regex-assess-redact-part-2.png" length="0" type="image/png"/>
        </item>
    </channel>
</rss>