<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/">
    <channel>
        <title>Elastic Observability Labs - Kubernetes</title>
        <link>https://www.elastic.co/observability-labs</link>
        <description>Trusted security news &amp; research from the team at Elastic.</description>
        <lastBuildDate>Mon, 16 Mar 2026 06:34:53 GMT</lastBuildDate>
        <docs>https://validator.w3.org/feed/docs/rss2.html</docs>
        <generator>https://github.com/jpmonette/feed</generator>
        <image>
            <title>Elastic Observability Labs - Kubernetes</title>
            <url>https://www.elastic.co/observability-labs/assets/observability-labs-thumbnail.png</url>
            <link>https://www.elastic.co/observability-labs</link>
        </image>
        <copyright>© 2026. Elasticsearch B.V. All Rights Reserved</copyright>
        <item>
            <title><![CDATA[Bringing Your Cloud-Managed Kubernetes Audit Logs into Elasticsearch]]></title>
            <link>https://www.elastic.co/observability-labs/blog/bringing-your-cloud-managed-kubernetes-audit-logs-into-elasticsearch</link>
            <guid isPermaLink="false">bringing-your-cloud-managed-kubernetes-audit-logs-into-elasticsearch</guid>
            <pubDate>Mon, 19 Aug 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[How to bring your Cloud-Managed Kubernetes Audit Logs into Elasticsearch]]></description>
            <content:encoded><![CDATA[<h2>Introduction:</h2>
<p>Kubernetes audit logs are essential for ensuring the security, compliance, and transparency of Kubernetes clusters. However, with managed Kubernetes infrastructure, traditional audit file-based log shipping is often not supported, and audit logs are only available via the control plane API or the Cloud Provider logging facility. In this blog, we will show you how to ingest the audit logs from these other sources and still take advantage of the <a href="https://www.elastic.co/docs/current/integrations/kubernetes/audit-logs">Elastic Kubernetes Audit Log Integration</a>.</p>
<p>In this blog we will be focusing on AWS as our cloud provider and when ingesting logs from AWS you have several options:</p>
<ul>
<li><a href="https://www.elastic.co/docs/current/integrations/aws_logs">AWS Custom Logs integration</a> (which we will utilize in this blog)</li>
<li><a href="https://www.elastic.co/observability-labs/blog/aws-kinesis-data-firehose-observability-analytics">AWS Firehose</a> to send logs from Cloudwatch to Elastic</li>
<li><a href="https://www.elastic.co/docs/current/integrations/aws">AWS General integration</a> which supports many AWS sources</li>
</ul>
<p>In part 1 of this two-part series, we will focus on properly ingesting Kubernetes Audit, and part 2 will focus on investigation, analytics, and alerting.</p>
<p>Kubernetes auditing <a href="https://kubernetes.io/docs/tasks/debug/debug-cluster/audit/">documentation</a> describes the need for auditing in order to get answers to the questions below:</p>
<ul>
<li>What happened?</li>
<li>When did it happen?</li>
<li>Who initiated it?</li>
<li>What resource did it occur on?</li>
<li>Where was it observed?</li>
<li>From where was it initiated (Source IP)?</li>
<li>Where was it going (Destination IP)?</li>
</ul>
<p>Answers to the above questions become important when an incident occurs and an investigation follows. Alternatively, it could just be a log retention use case for a regulated company trying to fulfill compliance requirements. </p>
<p>We are giving special importance to audit logs in Kubernetes because audit logs are not enabled by default. Audit logs can take up a large amount of memory and storage. So, usually, it’s a balance between retaining/investigating audit logs against giving up resources budgeted otherwise for workloads to be hosted on the Kubernetes cluster. Another reason we’re talking about audit logs in Kubernetes is that, unlike usual container logs, after being turned on, these logs are orchestrated to write to the cloud provider’s logging service. This is true for most cloud providers because the Kubernetes control plane is managed by the cloud providers. It makes sense for cloud providers to use their built-in orchestration workflows involving the control plane for a managed service backed by their implementation of a logging framework.</p>
<p>Kubernetes audit logs can be quite verbose by default. Hence, it becomes important to selectively choose how much logging needs to be done so that all the audit requirements are met for the organization. This is done in the <a href="https://kubernetes.io/docs/tasks/debug/debug-cluster/audit/#audit-policy">audit policy</a> file. The audit policy file is submitted against the<code> kube-apiserver.</code> It is not necessary that all flavors of cloud-provider-hosted Kubernetes clusters allow you to play with the <code>kube-apiserver</code> directly. For example, AWS EKS allows for this <a href="https://docs.aws.amazon.com/eks/latest/userguide/control-plane-logs.html">logging</a> to be done only by the control plane.</p>
<p><strong>In this blog we will be using Elastic Kubernetes Service (Amazon EKS) on AWS with the Kubernetes Audit Logs that are automatically shipped to AWS CloudWatch.</strong></p>
<p>A sample audit log for a secret by the name “empty-secret” created by an admin user on EKS  is logged on AWS CloudWatch in the following format: </p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/bringing-your-cloud-managed-kubernetes-audit-logs-into-elasticsearch/aws-clougwatch-logs.png" alt="Alt text" /></p>
<p>Once the audit logs show up on CloudWatch, it is time to consider how to transfer them to Elasticsearch. Elasticsearch is a great platform for creating dashboards that visualize different audit events recorded in a Kubernetes cluster. It is also a powerful tool for analyzing various audit events. For example, how many secret object creation attempts were made in an hour? </p>
<p>Now that we established the Kubernetes audit logs are being logged in CloudWatch, let’s discuss how to get the logs ingested into Elasticsearch. Elasticsearch has an integration to consume logs written on CloudWatch. Just using this integration by default is going to get the JSON from CloudWatch as is i.e. the real audit log JSON is nested inside the wrapper CloudWatch JSON. When bringing logs to Elasticsearch, it is important that we use the <a href="https://www.elastic.co/guide/en/ecs/current/index.html">Elastic Common Schema</a>(ECS) to get the best search and analytics performance. This means that there needs to  be an ingest pipeline that parses a standard Kubernetes audit JSON message and creates an ECS Compliant document in Elasticsearch. Let’s dive into how to achieve this.</p>
<p>Elasticsearch has a Kubernetes integration using Elastic Agent to consume Kubernetes container logs from the console and audit logs written to a file path. For a cloud-provider use case, as described above, it may not be feasible to write audit logs to a path on the Kubernetes cluster. So, how do we leverage the<a href="https://github.com/elastic/integrations/blob/main/packages/kubernetes/data_stream/audit_logs/fields/fields.yml"> ECS designed for parsing the Kubernetes audit logs</a> already implemented in the Kubernetes integration to work on the CloudWatch audit logs? That is the most exciting plumbing piece! Let’s see how to do it.</p>
<h3>What we’re going to do is:</h3>
<ul>
<li>
<p>Read the Kubernetes audit logs from the cloud provider’s logging module, in our case, AWS CloudWatch since this is where logs reside. We will use Elastic Agent and <a href="https://www.elastic.co/docs/current/integrations/aws_logs">Elasticsearch AWS Custom Logs integration </a> to read from logs from CloudWatch. <strong>Note:</strong> please be aware, there are several Elastic AWS integration, we are specifically using the AWS Custom Logs integration.</p>
</li>
<li>
<p>Create two simple ingest pipelines (we do this for best practices of isolation and composability) </p>
</li>
<li>
<p>The first pipeline looks for Kubernetes audit JSON messages and then redirects them to the second pipeline</p>
</li>
<li>
<p>The second custom pipeline will associate the JSON <code>message</code> field with the correct field expected by the Elasticsearch Kubernetes Audit managed pipeline (aka the Integration) and then <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/reroute-processor.html"><code>reroute</code></a> the message to the correct data stream, <code>kubernetes.audit_logs-default,</code> which in turn applies all the proper mapping and ingest pipelines for the incoming message</p>
</li>
<li>
<p>The overall flow will be</p>
</li>
</ul>
<p><img src="https://www.elastic.co/observability-labs/assets/images/bringing-your-cloud-managed-kubernetes-audit-logs-into-elasticsearch/overall-ingestion-flow.png" alt="Alt text" /></p>
<h3>1. Create an AWS CloudWatch integration:</h3>
<p>a.  Populate the AWS access key and secret pair values</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/bringing-your-cloud-managed-kubernetes-audit-logs-into-elasticsearch/aws-custom-logs-integration-1.png" alt="Alt text" /></p>
<p>b. In the logs section, populate the log ARN, Tags and Preserve the original event if you want to, and then Save this integration and exit from the page</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/bringing-your-cloud-managed-kubernetes-audit-logs-into-elasticsearch/aws-custom-logs-integration-2.png" alt="Alt text" /></p>
<h3>2. Next, we will configure the custom ingest pipeline</h3>
<p>We are doing this because we want to override what the generic managed pipeline does. We will retrieve the custom component name by searching for managed pipeline created as an asset when we install the AWS CloudWatch integration. In this case we will be adding the custom ingest pipeline <code>logs-aws_logs.generic@custom</code></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/bringing-your-cloud-managed-kubernetes-audit-logs-into-elasticsearch/aws-logs-index-management.png" alt="Alt text" /></p>
<p>From the Dev tools console, run below. Here, we are extracting the message field from the CloudWatch JSON and putting the value in a field called kubernetes.audit. Then, we are rerouting this message to the default Kubernetes audit dataset or ECS that comes with Kubernetes integration</p>
<pre><code>PUT _ingest/pipeline/logs-aws_logs.generic@custom
{
    &quot;processors&quot;: [
      {
        &quot;pipeline&quot;: {
          &quot;if&quot;: &quot;ctx.message.contains('audit.k8s.io')&quot;,
          &quot;name&quot;: &quot;logs-aws-process-k8s-audit&quot;
        }
      }
    ]
}

PUT _ingest/pipeline/logs-aws-process-k8s-audit
{
  &quot;processors&quot;: [
    {
      &quot;json&quot;: {
        &quot;field&quot;: &quot;message&quot;,
        &quot;target_field&quot;: &quot;kubernetes.audit&quot;
      }
    },
    {
      &quot;remove&quot;: {
        &quot;field&quot;: &quot;message&quot;
      }
    },
    {
      &quot;reroute&quot;: {
        &quot;dataset&quot;: &quot;kubernetes.audit_logs&quot;,
        &quot;namespace&quot;: &quot;default&quot;
      }
    }
  ]
}
</code></pre>
<p>Let’s understand this further:</p>
<ul>
<li>
<p>When we create a Kubernetes integration, we get a managed index template called <code>logs-kubernetes.audit_logs</code> that writes to the pipeline called <code>logs-kubernetes.audit_logs-1.62.2</code> by default</p>
</li>
<li>
<p>If we look into the pipeline<code> logs-kubernetes.audit_logs-1.62.2</code>, we see that all the processor logic is working against the field <code>kubernetes.audit</code>. This is the reason why our json processor in the above code snippet is creating a field called <code>kubernetes.audit </code>before dropping the original <em>message</em> field and rerouting. Rerouting is directed to the <code>kubernetes.audit_logs</code> dataset that backs the <code>logs-kubernetes.audit_logs-1.62.2</code> pipeline (dataset name is derived from the pipeline name convention that’s in the format <code>logs-&lt;datasetname&gt;-version</code>)</p>
</li>
</ul>
<p><img src="https://www.elastic.co/observability-labs/assets/images/bringing-your-cloud-managed-kubernetes-audit-logs-into-elasticsearch/ingest-pipelines.png" alt="Alt text" /></p>
<h3>3.  Now let’s verify that the logs are actually flowing through and the audit message is being parsed</h3>
<p>a. We will use Elastic Agent and enroll using Fleet and the integration policy we created in the Step 1. There are a number of ways to <a href="https://www.elastic.co/guide/en/fleet/current/install-fleet-managed-elastic-agent.html">deploy Elastic Agent</a> and for this exercise we will deploy using docker which is quick and easy.</p>
<pre><code>% docker run --env FLEET_ENROLL=1 --env FLEET_URL=&lt;&lt;fleet_URL&gt;&gt; --env FLEET_ENROLLMENT_TOKEN=&lt;&lt;fleet_enrollment_token&gt;&gt;  --rm docker.elastic.co/beats/elastic-agent:8.19.12
</code></pre>
<p>b. Check the messages in Discover. In 8.15 there is also a new feature called Logs Explorer which provides an ability to see Kubernetes Audit logs (and container logs) with a few clicks (see image below). Voila! We can see the Kubernetes audit messages parsed!</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/bringing-your-cloud-managed-kubernetes-audit-logs-into-elasticsearch/discover.jpg" alt="Alt text" /></p>
<h3>4. Let's do a quick recap of what we did</h3>
<p>We configured CloudWatch integration in Elasticsearch to read Kubernetes audit logs from CloudWatch. Then, we created custom ingest pipelines to reroute the audit messages to the correct data stream and all the OOTB mappings and parsing that come with the Kubernetes Audit Logs integration. </p>
<p>In the next part, we’ll look at how to analyze the ingested Kubernetes Audit log data.</p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/bringing-your-cloud-managed-kubernetes-audit-logs-into-elasticsearch/bringing-your-cloud-managed-kubernetes-audit-logs-into-elasticsearch.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Debugging Azure Networking for Elastic Cloud Serverless]]></title>
            <link>https://www.elastic.co/observability-labs/blog/debugging-aks-packet-loss</link>
            <guid isPermaLink="false">debugging-aks-packet-loss</guid>
            <pubDate>Thu, 05 Jun 2025 00:00:00 GMT</pubDate>
            <description><![CDATA[Learn how Elastic SREs uncovered and resolved unexpected packet loss in Azure Kubernetes Service (AKS), impacting Elastic Cloud Serverless performance.]]></description>
            <content:encoded><![CDATA[&lt;h2&gt; Summary of Findings &lt;/h2&gt; 
<p>Elastic's Site Reliability Engineering team (SRE) observed unstable throughput and packet loss in Elastic Cloud Serverless running on Azure Kubernetes Service (AKS). After investigation, we identified the primary contributing factors to be RX ring buffer overflows and kernel input queue saturation on SR-IOV interfaces. To address this, we increased RX buffer sizes and adjusted the netdev backlog, which significantly improved network stability.</p>
&lt;h2&gt; Setting the Scene &lt;/h2&gt;
<p><a href="https://www.elastic.co/cloud/serverless">Elastic Cloud Serverless</a> is a fully managed solution that allows you to deploy and use Elastic for your use cases without managing the underlying infrastructure. Built on Kubernetes, it represents a shift in how you interact with Elasticsearch. Instead of managing clusters, nodes, data tiers, and scaling, you create serverless projects that are fully managed and automatically scaled by Elastic. This abstraction of infrastructure decisions allows you to focus solely on gaining value and insight from your data.</p>
<p>Elastic Cloud Serverless is generally available (GA) on AWS, GCP and currently in <a href="https://www.elastic.co/guide/en/serverless/current/regions.html">Technical Preview on Azure</a>. As part of preparing Elastic Cloud Serverless GA on Azure, we have been conducting extensive performance and scalability tests to ensure that our users get a consistent and reliable user experience.</p>
<p>In this post, we’ll take you behind the scenes of a deep technical investigation into a surprising performance issue that affected Serverless Elasticsearch in our Azure Kubernetes clusters. At first, the network seemed like the least likely place to look, especially with a high-speed 100 Gb/s interface on the host backing it. But as we dug deeper, with help from the Microsoft Azure team, that’s exactly where the problem led us.</p>
&lt;h2&gt; Unexpected Results! &lt;/h2&gt;
<p>While the high-level architectures and system design patterns of the major cloud provider’s systems are often similar, the implementations are different, and these differences can have dramatic impacts on a system’s performance characteristics.</p>
<p>One of the most significant differences between the different cloud providers is that the underlying hypervisor software and server hardware of the Virtual Machines can vary significantly, even between instance families of the same provider.</p>
<p>There is no way to fully abstract the hardware away from an application like Elasticsearch. Fundamentally, its performance is dictated by the CPU, memory, disks, and network interfaces on the physical server. In preparation for the Elastic Cloud Serverless GA on Azure, our Elasticsearch Performance team kicked off large-scale load testing against Serverless Elasticsearch projects running on <a href="https://docs.azure.cn/en-us/aks/what-is-aks">Azure Kubernetes Service (AKS)</a>, using <a href="https://azure.microsoft.com/en-us/blog/azure-cobalt-100-based-virtual-machines-are-now-generally-available/">ARM-based VMs</a> (we’re big fans!). Throughout this process, we relied heavily on Elastic tools to analyse system behaviour, identify bottlenecks, and validate performance under load.</p>
<p>To perform these scale and load tests, the Elasticsearch Performance team use <a href="https://github.com/elastic/rally">Rally</a>, an open-source benchmarking tool designed to measure the performance of Elasticsearch clusters. The workload (or in Rally nomenclature, ‘Track’) used for these tests was the <a href="https://github.com/elastic/rally-tracks/tree/master/github_archive">GitHub Archive Track</a>. Rally collects and sends test telemetry using the <a href="https://www.elastic.co/docs/reference/elasticsearch/clients/python">official Python client</a> to a separate Elasticsearch cluster running <a href="https://www.elastic.co/observability">Elastic Observability</a>, which allows for monitoring and analysis during these scale and load tests in real time via <a href="https://www.elastic.co/docs/explore-analyze">Kibana</a>.</p>
<p>When we looked at the results, we observed that the indexing rate (the number of docs/s) for the Serverless projects was not only much lower than we had expected for the given hardware, but the throughput was also quite unstable. There were peaks and valleys, interspersed with frequent errors, whereas we were instead expecting a stable indexing rate for the duration of the test.</p>
<p>These tests are designed to push the system to its limits, and in doing so, they surfaced unexpected behavior in the form of unstable indexing throughput and intermittent errors. This was precisely the kind of problem we'd hoped to uncover prior to going GA — giving us the opportunity to work closely with Azure.</p>
&lt;div align=&quot;center&quot;&gt;
![Indexing Rate with Packet Loss](/assets/images/debugging-aks-packet-loss/indexing-rate-before.png)
_A Kibana visualisation of Rally telemetry, showing fluctuating Elasticsearch indexing rates alongside spikes in 5xx and 4xx HTTP error responses._
&lt;/div&gt;
&lt;h2&gt; Debugging! &lt;/h2&gt;
<p>Debugging performance issues can feel a little bit like trying to find a <a href="https://www.youtube.com/watch?v=7AO4wz6gI3Q">‘Butterfly in a Hurricane’</a>, so it’s crucial that you take a methodological approach to analysing application and system performance.</p>
<p>Using methodologies helps you to be more consistent and thorough in your debugging, and avoids missing things. We started with the <a href="https://www.brendangregg.com/usemethod.html">Utilisation Saturation and Errors (USE) Method</a>, looking at both the client and server side to identify any obvious bottlenecks in the system.</p>
<p>Elastic's Site Reliability Engineers (SREs) maintain a suite of custom <a href="https://www.elastic.co/docs/solutions/observability/get-started/what-is-elastic-observability">Elastic Observability</a> dashboards designed to visualise data collected from various <a href="https://www.elastic.co/docs/extend/integrations/what-is-an-integration">Elastic Integrations</a>. These dashboards provide deep visibility into the health and performance of Elastic Cloud infrastructure and systems.</p>
<p>For this investigation, we leveraged a custom dashboard built using metrics and log data from the <a href="https://www.elastic.co/docs/reference/integrations/system">System</a> and <a href="https://www.elastic.co/docs/reference/integrations/linux">Linux</a> Integrations:</p>
&lt;div align=&quot;center&quot;&gt;
  ![Node Overview Dashboard](/assets/images/debugging-aks-packet-loss/overview-dashboard.png)
  _One of many Elastic Observability dashboards built and maintained by the SRE team._
&lt;/div&gt;
<p>Following the USE Method, these dashboards highlight resource utilisation, saturation, and errors across our systems. With their help, we quickly identified that the AKS nodes hosting the Elasticsearch pods under test were dropping thousands of packets per second.</p>
&lt;div align=&quot;center&quot;&gt;
![Node Packet Loss Before Tuning](/assets/images/debugging-aks-packet-loss/packet-loss-before.png)
_A Kibana visualisation of [Elastic Agent's System Integration](https://www.elastic.co/docs/reference/integrations/system), showing the rate of packet drops per second for AKS nodes._
&lt;/div&gt;
<p>Dropping packets forces reliable protocols, such as TCP, to retransmit any missing packets. These retransmissions can introduce significant delays, which kills the throughput of any system where client requests are only triggered upon the previous request completion (known as a <a href="https://www.usenix.org/legacy/event/nsdi06/tech/full_papers/schroeder/schroeder.pdf">Closed System</a>).</p>
<p>To investigate further, we jumped onto one of the AKS nodes exhibiting the packet loss to check the basics. First off, we wanted to identify what type of packet drops or errors we’re seeing; is it for specific pods, or the host as a whole?</p>
<pre><code>root@aks-k8s-node-1:~# ip -s link show
2: eth0: &lt;BROADCAST,MULTICAST,UP,LOWER_UP&gt; mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether 7c:1e:52:be:ce:5e brd ff:ff:ff:ff:ff:ff
    RX:    bytes   packets errors dropped  missed   mcast
    373507935420 134292481      0       0       0      15
    TX:    bytes   packets errors dropped carrier collsns
    644247778936 303191014      0       0       0       0
3: enP42266s1: &lt;BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP&gt; mtu 1500 qdisc mq master eth0 state UP mode DEFAULT group default qlen 1000
    link/ether 7c:1e:52:be:ce:5e brd ff:ff:ff:ff:ff:ff
    RX:    bytes   packets errors dropped  missed   mcast
    386782548951 307000571      0       0 5321081       0
    TX:    bytes   packets errors dropped carrier collsns
    655758630548 477594747      0       0       0       0
    altname enP42266p0s2
15: lxc0ca0ec41ecd2@if14: &lt;BROADCAST,MULTICAST,UP,LOWER_UP&gt; mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether f6:f5:5e:c9:4e:fb brd ff:ff:ff:ff:ff:ff link-netns cni-3f90ab53-df66-cac5-bd19-9cea4a68c29b
    RX:    bytes   packets errors dropped  missed   mcast
    627954576078  54297550      0    1600       0       0
    TX:    bytes   packets errors dropped carrier collsns
    372155326349 133538064      0    3927       0       0
</code></pre>
<p>In this output you can see the <code>enP42266s1</code> interface is showing a significant number of packets in the <code>missed</code> column. That’s interesting, sure, but what does missed actually represent? And what is <code>enP42266s1</code>?</p>
<p>To understand, let’s look at roughly what happens when a packet arrives at the NIC:</p>
<ol>
<li>A packet arrives at the NIC from the network.</li>
<li>The NIC uses DMA (Direct Memory Access) to place the packet into a receive ring buffer allocated in memory by the kernel, mapped for use by the NIC. Since our NICs supports multiple hardware queues, each queue has its own dedicated ring buffer, IRQ, and NAPI context.</li>
<li>The NIC raises a hardware interrupt (IRQ) to notify the CPU that a packet is ready.</li>
<li>The CPU runs the NIC driver’s IRQ handler. The driver schedules a NAPI (New API) poll to defer packet processing to a softirq context. A mechanism in the Linux kernel that defers work to be processed outside of the hard IRQ context, for better batching and CPU efficiency, enabling improved scalability.</li>
<li>The NAPI poll function is executed in a softirq context (<code>NET_RX_SOFTIRQ</code>) and retrieves packets from the ring buffer. This polling continues either until the driver’s packet budget is exhausted (<code>net.core.netdev_budget</code>) or the time limit is hit (<code>net.core.netdev_budget_usecs</code>).</li>
<li>Each packet is wrapped in an <code>sk_buff</code> (socket buffer) structure, which includes metadata such as protocol headers, timestamps, and interface identifiers.</li>
<li>If the networking stack is slower than the rate at which NAPI fetches packets, excess packets are queued in a per-CPU backlog queue (via <code>enqueue_to_backlog</code>). The maximum size of this backlog is controlled by the <code>net.core.netdev_max_backlog</code> sysctl.</li>
<li>Packets are then handed off to the kernel’s networking stack for routing, filtering, and protocol-specific processing (e.g. TCP, UDP).</li>
<li>Finally, packets reach the appropriate socket receive buffer, where they are available for consumption by the user-space application.</li>
</ol>
<p>Visualised, it looks something like this:</p>
&lt;div align=&quot;center&quot;&gt;
![Linux Packet Flow Diagram](/assets/images/debugging-aks-packet-loss/packet-flow.png)
_Image © 2018 Leandro Moreira. Used under the [BSD 3-Clause License](https://opensource.org/licenses/BSD-3-Clause). Source: [GitHub repository](https://github.com/leandromoreira/linux-network-performance-parameters)._
&lt;/div&gt;
<p>The <code>missed</code> counter is incremented whenever the NIC tries to DMA a packet into a fully occupied <a href="https://en.wikipedia.org/wiki/Circular_buffer">ring buffer</a>. The NIC essentially &quot;misses&quot; the chance to deliver the packet to the VM’s memory. However, what’s most interesting is that this counter seldom increments for VMs. This is because Virtual NICs are usually implemented as software via the hypervisor, which typically has much more flexible memory management compared to the physical NICs and can reduce the chance of ring buffer overflow.</p>
<p>We mentioned earlier that we’re building Azure Elasticsearch Serverless on top of Azure’s AKS service, which is important to note because all of our AKS nodes use an Azure feature called <a href="https://learn.microsoft.com/en-us/azure/virtual-network/accelerated-networking-overview">Accelerated Networking</a>. In this setup, network traffic is delivered directly to the VM’s network interface, bypassing the hypervisor. This is enabled by <a href="https://learn.microsoft.com/en-us/windows-hardware/drivers/network/overview-of-single-root-i-o-virtualization--sr-iov-">single root I/O virtualization (SR-IOV)</a>, which offers much lower latency and higher throughput than traditional VM networking. Each node is physically connected to a 100 Gb/s network interface, although the SR-IOV Virtual Function (VF) exposed to the VM typically provides only a fraction of that total bandwidth.</p>
<p>Despite the VM only having a fraction of the 100 Gb/s bandwidth, microbursts are still very possible. These physical interfaces are so fast that they can transmit and receive multiple packets in just nanoseconds, far faster than most buffers or processing queues can absorb. At these timescales, even a short-lived burst of traffic can overwhelm the receiver, leading to dropped packets and unpredictable latency.</p>
<p>Direct access to the SR-IOV interface means that our VMs are responsible for handling the hardware interrupts triggered by the NIC in a timely manner, if there's any delay in handling the hardware interrupt (e.g. waiting to be scheduled onto CPU by the hypervisor) then network packets can be missed!</p>
&lt;h2&gt; Firstly - NIC-level Tuning &lt;/h2&gt;
<p>Since we'd confirmed that our VMs were using SR-IOV, we established that the <code>enP42266s1</code> and <code>eth0</code> interfaces <a href="https://learn.microsoft.com/en-us/azure/virtual-network/accelerated-networking-how-it-works">were a bonded pair and acted as a single interface</a>. Knowing this, then we reasoned that we should be able to adjust the ring buffer values directly using <code>ethtool</code>.</p>
<pre><code>root@aks-k8s-node-1:~# ethtool -g enP42266s1
Ring parameters for enP42266s1:
Pre-set maximums:
RX:		8192
RX Mini:	n/a
RX Jumbo:	n/a
TX:		8192
Current hardware settings:
RX:		1024
RX Mini:	n/a
RX Jumbo:	n/a
TX:		1024
</code></pre>
<p>In the output above, we were using only 1/8th of the available ring buffer descriptors. These values were set by the OS defaults, which generally aim to balance performance and resource usage. Set too low, they risk packet drops under load; set too high, they can lead to unnecessary memory consumption. We knew that the VMs were backed by a virtual function carved out of the directly attached 100 Gb/s network interface, which is fast enough to deliver microbursts that could easily overwhelm small buffers. To better absorb those short, high-intensity bursts of traffic, we increased the NIC’s RX ring buffer size from 1024 to 8192. Using a privileged DaemonSet, we rolled out the change across all of our AKS nodes by installing <a href="https://en.wikipedia.org/wiki/Udev">a <code>udev</code> rule</a> to automatically increase the buffer size:</p>
<pre><code># Match Mellanox ConnectX network cards and run ethtool to update the ring buffer settings
ENV{INTERFACE}==&quot;en*&quot;, ENV{ID_NET_DRIVER}==&quot;mlx5_core&quot;, RUN+=&quot;/sbin/ethtool -G %k rx ${CONFIG_AZURE_MLX_RING_BUFFER_SIZE} tx ${CONFIG_AZURE_MLX_RING_BUFFER_SIZE}&quot;
</code></pre>
&lt;div align=&quot;center&quot;&gt;
![AKS Node Packet Loss after RX ring buffer change](/assets/images/debugging-aks-packet-loss/packet-loss-after.png)
_A Kibana visualisation of [Elastic Agent's System Integration](https://www.elastic.co/docs/reference/integrations/system), showing packet loss reduced by ~99% after increasing the NIC's RX ring buffer values._
&lt;/div&gt;
<p>As soon as the change had been applied to all AKS nodes we stopped ‘missing’ RX packets! Fantastic! As a result of this simple change we observed a significant improvement in our indexing throughput and stability.</p>
&lt;div align=&quot;center&quot;&gt;
![Indexing rate after RX ring buffer change](/assets/images/debugging-aks-packet-loss/indexing-rate-after.png)
_A Kibana visualisation of Rally telemetry, showing stable and improved Elasticsearch indexing rates after increasing the RX ring buffer size._
&lt;/div&gt;
<p>Job done, right? Not quite..</p>
&lt;h2&gt; Further improvements - Kernel-level Tuning &lt;/h2&gt;
<p>Eagle eyed readers may have noticed two things:</p>
<ol>
<li>In the previous screenshot, despite adjusting the physical RX ring buffer values, we still observed a small number of <code>dropped</code> packets on the TX side.</li>
<li>In the original <code>ip link -s show</code> output, one of the ‘logical’ interfaces used by the Elasticsearch pod was showing <code>dropped</code> packets on both the TX and RX sides.</li>
</ol>
<pre><code>15: lxc0ca0ec41ecd2@if14: &lt;BROADCAST,MULTICAST,UP,LOWER_UP&gt; mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether f6:f5:5e:c9:4e:fb brd ff:ff:ff:ff:ff:ff link-netns cni-3f90ab53-df66-cac5-bd19-9cea4a68c29b
    RX:    bytes   packets errors dropped  missed   mcast
    627954576078  54297550      0    1600       0       0
    TX:    bytes   packets errors dropped carrier collsns
    372155326349 133538064      0    3927       0       0
</code></pre>
<p>So, we continued to dig. We’d eliminated ~99% of the packet loss, and the remaining loss rate wasn’t as significant as what we’d started with, but we still wanted to understand why it was occurring even after adjusting the RX ring buffer size of the NIC.</p>
<p>So what does <code>dropped</code> represent, and what is this <code>lxc0ca0ec41ecd2</code> interface? <code>dropped</code> is similar to <code>missed</code>, but only occurs when packets are deliberately dropped by the kernel or network interface. Crucially though, it doesn’t tell you why a packet was dropped. As for the <code>lxc0ca0ec41ecd2</code> interface, we use the <a href="https://learn.microsoft.com/en-us/azure/aks/azure-cni-powered-by-cilium">Azure CNI Powered by Cilium</a> to provide the network functionality to our AKS clusters. Any pod spun up on an AKS node gets a ‘logical’ interface, which is a virtual ethernet (<code>veth</code>) pair that connects the pod’s network namespace with the host’s network namespace. It was here that we were dropping packets.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/debugging-aks-packet-loss/aks-node-network-topology.png" alt="AKS Node Networking Diragram" /></p>
<p>In our experience, packet drops at this layer are unusual, so we started digging deeper into the cause of the drops. There are numerous ways you can debug why a packet is being dropped, but one of the easiest is <a href="https://perfwiki.github.io/main/">to use <code>perf</code></a> attach to the <code>skb:kfree_skb</code> tracepoint. The &quot;socket buffer&quot; (<code>skb</code>) is the primary data structure used to represent network packets in the Linux kernel. When a packet is dropped, its corresponding socket buffer is usually freed, triggering the <code>kfree_skb</code> tracepoint. Using <code>perf</code> to attach to this event allowed us to capture stack traces to analyze the cause of the drops.</p>
&lt;div align=&quot;center&quot;&gt;
```
# perf record -g -a -e skb:kfree_skb
```
&lt;/div&gt;
<p>We left this to run for ~10 minutes or so to capture as many drops as possible, and then ‘heavily inspired’ by <a href="https://gist.github.com/bobrik/0e57671c732d9b13ac49fed85a2b2290">this GitHub Gist by Ivan Babrou</a>, we converted the stack traces into an ‘easier’ to read <a href="https://github.com/brendangregg/FlameGraph">Flamegraphs</a>:</p>
<pre><code># perf script | sed -e 's/skb:kfree_skb:.*reason:\(.*\)/\n\tfffff \1 (unknown)/' -e 's/^\(\w\+\)\s\+/kernel /' &gt; stacks.txt
cat stacks.txt | stackcollapse-perf.pl --all | perl -pe 's/.*?;//' | sed -e 's/.*irq_exit_rcu_\[k\];/irq_exit_rcu_[k];/' | flamegraph.pl --colors=java --hash --title=aks-k8s-node-1 --width=1440 --minwidth=0.005 &gt; aks-k8s-node-1.svg
</code></pre>
&lt;div align=&quot;center&quot;&gt;
![AKS Node Packet Loss Flamegraph](/assets/images/debugging-aks-packet-loss/aks-packet-loss-flamegraph.png)
_A Flamegraph showing the various stack trace ancestry of packet loss._
&lt;/div&gt;
<p>The flamegraph here shows how often different functions appeared in stack traces for packets drops. Each box represents a function call and wider boxes mean the function appears more frequently in the traces. The stack's ancestry builds upward from the bottom with earlier calls, to the top with later calls.</p>
<p>Firstly, we quickly discovered that unfortunately the <code>skb_drop_reason</code> enum <a href="https://github.com/torvalds/linux/commit/c504e5c2f9648a1e5c2be01e8c3f59d394192bd3">was only added in Kernel 5.17</a> (Azure’s Node Image at the time was using 5.15). This meant that there was no single human readable message that told us why the packets were being dropped, instead all we got was <code>NOT_SPECIFIED</code>. To work out why packets were being dropped we needed to do a little sleuthing through the stack traces to work out what code paths were being taken when a packet was dropped.</p>
<p>In the flamegraph above you can see that many of the stack traces include <code>veth</code> driver function calls (e.g. <code>veth_xmit</code>), and many end abruptly with a call to the <code>enqueue_to_backlog</code> function. When many stacks end at the same function (like <code>enqueue_to_backlog</code>) it suggests that function is a common point where packets are being dropped. If you go back to the earlier explanation of what happens when a packet arrives at the NIC, you’ll notice that in step 7 we explained:</p>
<blockquote>
<p><em>7. If the networking stack is slower than the rate at which NAPI fetches packets, excess packets are queued in a per-CPU backlog queue (via <code>enqueue_to_backlog</code>). The maximum size of this backlog is controlled by the <code>net.core.netdev_max_backlog</code> sysctl.</em></p>
</blockquote>
<p>Using the same privileged DaemonSet method for the RX ring buffer adjustment, we set the value of the <code>net.core.netdev_max_backlog</code> adjustable kernel parameter from 1000 to 32768:</p>
<pre><code>/usr/sbin/sysctl -w net.core.netdev_max_backlog=32768
</code></pre>
<p>This value was based on the fact we knew the hosts were using a 100 Gb/s SR-IOV NIC, even if the VM was allowed only a fraction of the total bandwidth. We acknowledge that it’s worth revisiting this value in the future to see if it can be better optimised to not waste extraneous memory, but at the time “perfect was the enemy of good”.</p>
<p>We re-ran the load tests and compared the three sets of results we’d collected thus far.</p>
&lt;div align=&quot;center&quot;&gt;
![Final Indexing Rate Results](/assets/images/debugging-aks-packet-loss/indexing-rate-final.png)
_A Kibana visualisation of Rally results, comparing impact to median throughput after each configuration change._
&lt;/div&gt;
<table>
<thead>
<tr>
<th>Tuning Step</th>
<th>Packet Loss</th>
<th>Median indexing throughput</th>
</tr>
</thead>
<tbody>
<tr>
<td>Baseline</td>
<td>High</td>
<td>~18,000 docs/s</td>
</tr>
<tr>
<td>+RX Buffer</td>
<td>~99% drop ↓</td>
<td>~26,000 (+ ~40% from baseline)</td>
</tr>
<tr>
<td>+Backlog &amp; +RX Buffer</td>
<td>Near zero</td>
<td>~29,000 (+ ~60% from baseline)</td>
</tr>
</tbody>
</table>
<p>Here you can see the P50 of throughput in docs/s over the course of the hours-long load tests. Compared to the baseline, we saw a roughly <strong>~40%</strong> increase in throughput by only adjusting the RX ring buffer values, and a <strong>~50-60%</strong> increase with both the RX ring buffer and backlog changes! Hooray!</p>
<p>A great result and one more step on our journey towards better Serverless Elasticsearch performance.</p>
&lt;h2&gt; Working with Azure &lt;/h2&gt;
<p>It’s great that we were able to quickly identify and mitigate the majority of our packet loss issues, but since we were using AKS with AKS node images, it made sense to engage with Azure to understand why the defaults weren’t working for our workload.</p>
<p>We walked Azure through our investigation, mitigations and results, and asked for some additional validation of our mitigations. Azure Engineering confirmed that the host NICs were not discarding packets, which confirmed that everything arriving at the host level was passed through to the hypervisor on the host. Further investigation confirmed that no loss or discards were occurring to Azure network fabric, or internal to the hypervisor – which shifted focus from the host to the guest OS and why the guest OS kernel was slow when reading packets off of the <code>enP*</code> SR-IOV interfaces.</p>
<p>Given the complexity of our load testing scenario — which involved configuring multiple systems and tools, including <a href="https://www.elastic.co/observability">Elastic Observability</a>, we also developed a simplified reproduction of the packet loss issue using <a href="https://github.com/esnet/iperf"><code>iperf3</code></a>. This simplified test was created specifically to share with Azure for targeted analysis, and added to the broader monitoring and analysis enabled by Elastic Observability and Rally.</p>
<p>With this reproduction Azure was able to confirm the increasing <code>missed</code> and <code>dropped</code> packet counters we had observed, and confirmed the increased RX ring buffer and <code>netdev_max_backlog</code> increase as the recommended mitigations.</p>
&lt;h2&gt; Conclusion &lt;/h2&gt;
<p>While cloud providers offer various abstractions to manage your resources, the underlying hardware ultimately determines your application's performance and stability. High-performance hardware often requires tuning at the operating system level, well beyond the default settings most environments ship with. In managed platforms like AKS, where Azure controls both the node images and infrastructure, it is easy to overlook the impact of low-level configurations such as network device ring buffer sizes or sysctls like <code>net.core.netdev_max_backlog</code>.</p>
<p>Our experience shows that even with the convenience of a managed Kubernetes service, performance issues can still emerge if these hardware parameters are not tuned appropriately. It was tempting to assume that high-speed 100 Gb/s network interfaces, directly attached to the VM using SR-IOV would eliminate any chance of network-related bottlenecks. In reality, that assumption didn’t hold up.</p>
<p>Engaging early with Azure was essential, as they provided deeper visibility into the underlying infrastructure and worked with us to tune low-level, performance-critical settings. Combined with thorough load and scale testing and robust observability using tools like Elastic Observability, this collaboration helped us detect and rectify the issue early in order to deliver a consistent, reliable, and high-performing experience for our users.</p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/debugging-aks-packet-loss/debugging-aks-packet-loss.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Using Elastic to observe GKE Autopilot clusters]]></title>
            <link>https://www.elastic.co/observability-labs/blog/observe-gke-autopilot-clusters</link>
            <guid isPermaLink="false">observe-gke-autopilot-clusters</guid>
            <pubDate>Wed, 15 Mar 2023 00:00:00 GMT</pubDate>
            <description><![CDATA[See how deploying the Elastic Agent onto a GKE Autopilot cluster makes observing the cluster’s behavior easy. Kibana integrations make visualizing the behavior a simple addition to your observability dashboards.]]></description>
            <content:encoded><![CDATA[<p>Elastic has formally supported Google Kubernetes Engine (GKE) since January 2020, when Elastic Cloud on Kubernetes was announced. Since then, Google has expanded GKE, with new service offerings and delivery mechanisms. One of those new offerings is GKE Autopilot. Where GKE is a managed Kubernetes environment, GKE Autopilot is a mode of Kubernetes operation where Google manages your cluster configuration, scaling, security, and more. It is production ready and removes many of the challenges associated with tasks like workload management, deployment automation, and scalability rules. Autopilot lets you focus on building and deploying your application while Google manages everything else.</p>
<p>Elastic is committed to supporting Google Kubernetes Engine (GKE) in all of its delivery modes. In October, during the Google Cloud Next ‘22 event, we announced our intention to integrate and certify Elastic Agent on Anthos, Autopilot, Google Distributed Cloud, and more.</p>
<p>Since that event, we have worked together with Google to get the Elastic Agent certified for use on Anthos, but we didn’t stop there.</p>
<p>Today we are happy to <a href="https://github.com/elastic/elastic-agent/blob/autopilotdocumentaton/docs/elastic-agent-gke-autopilot.md">announce</a> that we have been certified for operation on GKE Autopilot.</p>
<h2>Hands on with Elastic and GKE Autopilot</h2>
<h3><a href="https://www.elastic.co/observability/kubernetes-monitoring">Kubernetes observability</a> has never been easier</h3>
<p>To show how easy it is to get started with Autopilot and Elastic, let's walk through deploying the Elastic Agent on an Autopilot cluster. I’ll show how easy it is to set up and monitor an Autopilot cluster with the Elastic Agent and observe the cluster’s behavior with Kibana integrations.</p>
<p>One of the main differences between GKE and GKE Autopilot is that Autopilot protects the system namespace “kube-system.” To increase the stability and security of a cluster, Autopilot prevents user space workloads from adding or modifying system pods. The default configuration for Elastic Agent is to install itself into the system namespace. The majority of the changes we will make here are to convince the Elastic Agent to run in a different namespace.</p>
<h2>Let’s get started with Elastic Stack!</h2>
<p>While writing this article, I used the latest version of Elastic. The best way for you to get started with Elastic Observability is to:</p>
<ol>
<li>Get an account on <a href="https://cloud.elastic.co/registration?fromURI=/home">Elastic Cloud</a> and look at this <a href="https://www.elastic.co/videos/training-how-to-series-cloud">tutoria</a>l to help launch your first stack, or</li>
<li><a href="https://www.elastic.co/partners/google-cloud">Launch Elastic Cloud on your Google Account</a></li>
</ol>
<h2>Provisioning an Autopilot cluster and an Elastic stack</h2>
<p>To test the agent, I first deployed the recommended, default GKE Autopilot cluster. Elastic’s GKE integration supports kube-state-metrics (KSM), which will increase the number of reported metrics available for reporting and dashboards. Like the Elastic Agent, KSM defaults to running in the system namespace, so I modified its manifest to work with Autopilot. For my testing, I also deployed a basic Elastic stack on Elastic Cloud in the same Google region as my Autopilot cluster. I used a fresh cluster deployed on Elastic’s managed service (ESS), but the process is the same if you are using an Elastic Cloud subscription purchased through the Google marketplace.</p>
<h2>Adding Elastic Observability to GKE Autopilot</h2>
<p>Because this is a brand new deployment, Elastic suggests adding integrations to it. Let’s add the Kubernetes integration into the new deployment:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observe-gke-autopilot-clusters/blog-welcome-to-elastic.png" alt="elastic agent GKE autopilot welcome" /></p>
<p>Elastic offers hundreds of integrations; filter the list by typing “kub” into the search bar (1) and then click the Kubernetes integration (2).</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observe-gke-autopilot-clusters/blog-elastic-kubernetes-integration.png" alt="elastic agent GKE autopilot kubernetes integration" /></p>
<p>The Kubernetes integration page gives you an overview of the integration and lets you manage the Kubernetes clusters you want to observe. We haven’t added a cluster yet, so I clicked “Add Kubernetes” to add the first integration.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observe-gke-autopilot-clusters/blog-elastic-add-kubernetes.png" alt="elastic agent GKE autopilot add kubernetes" /></p>
<p>I changed the integration name to reflect the Kubernetes offering type and then clicked “Save and continue” to accept the integration defaults.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observe-gke-autopilot-clusters/blog-elastic-add-kubernetes-integration.png" alt="elastic agent GKE autopilot add kubernetes integration" /></p>
<p>At this point, an Agent policy has been created. Now it’s time to install the agent. I clicked on the “Kubernetes” integration.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observe-gke-autopilot-clusters/blog-elastic-agent-policy-1.png" alt="elastic agent GKE autopilot agent policy" /></p>
<p>Then I selected the “integration policies” tab (1) and clicked “Add agent” (2).</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observe-gke-autopilot-clusters/blog-elastic-add-agent.png" alt="elastic agent GKE autopilot add agent" /></p>
<p>Finally, I downloaded the full manifest for a standard GKE environment.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observe-gke-autopilot-clusters/blog-elastic-download-manifest.png" alt="elastic agent GKE autopilot download manifest" /></p>
<p>We won’t be using this manifest directly, but it contains many of the values that we will need to deploy the agent on Autopilot in the next section.</p>
<p>The Elastic stack is ready and waiting for the Autopilot logs, metrics, and events. It’s time to connect Autopilot to this deployment using the Elastic Agent for GKE.</p>
<h2>Connect Autopilot to Elastic</h2>
<p>From the Google cloud terminal, I downloaded and edited the Elastic Agent manifest for GKE Autopilot.</p>
<pre><code class="language-bash">$ curl -o elastic-agent-managed-gke-autopilot.yaml \
https://github.com/elastic/elastic-agent/blob/autopilotdocumentaton/docs/manifests/elastic-agent-managed-gke-autopilot.yaml
</code></pre>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observe-gke-autopilot-clusters/blog-elastic-cloud-shell-editor.png" alt="elastic agent GKE autopilot cloud shell editor" /></p>
<p>I used the cloud shell editor to configure the manifest for my Autopilot and Elastic clusters. For example, I updated the following:</p>
<pre><code class="language-yaml">containers:
  - name: elastic-agent
    image: docker.elastic.co/beats/elastic-agent:8.19.12
</code></pre>
<p>I also changed the agent to the version of Elastic that I installed (8.6.0).</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observe-gke-autopilot-clusters/blog-elastic-google-cloud.png" alt="elastic agent GKE autopilot google cloud" /></p>
<p>From the Integration manifest I downloaded earlier, I copied the values for FLEET_URL and FLEET_ENROLLMENT_TOKEN into this YAML file.</p>
<p>Now it’s time to apply the updated manifest to the Autopilot instance.</p>
<p>Before I commit, I always like to see what’s going to be created (and check for syntax errors) with a dry run.</p>
<pre><code class="language-bash">$ clear
$ kubectl apply --dry-run=&quot;client&quot; -f elastic-agent-managed-gke-autopilot.yaml
</code></pre>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observe-gke-autopilot-clusters/blog-elastic-dry-run.png" alt="elastic agent GKE autopilot dry run" /></p>
<p>Everything looks good, so I’ll do it for real this time.</p>
<pre><code class="language-bash">$ clear
$ kubectl apply -f elastic-agent-managed-gke-autopilot.yaml
</code></pre>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observe-gke-autopilot-clusters/blog-elastic-autopilot-cluster.png" alt="elastic agent GKE autopilot cluster" /></p>
<p>After several minutes, metrics will start flowing from the Autopilot cluster directly into the Elastic deployment.</p>
<h2>Adding a workload to the Autopilot cluster</h2>
<p>Observing an Autopilot cluster without a workload is boring, so I deployed a modified version of Google’s <a href="https://github.com/bshetti/opentelemetry-microservices-demo">Hipster Shop</a> (which includes OpenTelemetry reporting):</p>
<pre><code class="language-yaml">$ git clone https://github.com/bshetti/opentelemetry-microservices-demo
$ cd opentelemetry-microservices-demo
$ nano ./deploy-with-collector-k8s/otelcollector.yaml
</code></pre>
<p>To get the application’s telemetry talking to our Elastic stack, I replaced all instances of the exporter type from HTTP (otlphttp/elastic) to gRPC (otlp/elastic). I then replaced OTEL_EXPORTER_OTLP_ENDPOINT with my APM endpoint and I replaced OTEL_EXPORTER_OTLP_HEADERS with my APM OTEL Bearer and Token.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observe-gke-autopilot-clusters/blog-elastic-terminal-telemetry.png" alt="elastic agent GKE autopilot terminal telemetry" /></p>
<p>Then I deployed the Hipster Shop.</p>
<pre><code class="language-bash">$ kubectl create -f ./deploy-with-collector-k8s/adservice.yaml
$ kubectl create -f ./deploy-with-collector-k8s/redis.yaml
$ kubectl create -f ./deploy-with-collector-k8s/cartservice.yaml
$ kubectl create -f ./deploy-with-collector-k8s/checkoutservice.yaml
$ kubectl create -f ./deploy-with-collector-k8s/currencyservice.yaml
$ kubectl create -f ./deploy-with-collector-k8s/emailservice.yaml
$ kubectl create -f ./deploy-with-collector-k8s/frontend.yaml
$ kubectl create -f ./deploy-with-collector-k8s/paymentservice.yaml
$ kubectl create -f ./deploy-with-collector-k8s/productcatalogservice.yaml
$ kubectl create -f ./deploy-with-collector-k8s/recommendationservice.yaml
$ kubectl create -f ./deploy-with-collector-k8s/shippingservice.yaml
$ kubectl create -f ./deploy-with-collector-k8s/loadgenerator.yaml
</code></pre>
<p>Once all of the shop’s pods were running, I deployed the OpenTelemetry collector.</p>
<pre><code class="language-bash">$ kubectl create -f ./deploy-with-collector-k8s/otelcollector.yaml
</code></pre>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observe-gke-autopilot-clusters/blog-elastic-deployed-opentelemetry-collector.png" alt="elastic agent GKE autopilot deployed opentelemetry collector" /></p>
<h2>Observe and visualize Autopilot’s metrics</h2>
<p>Now that we have added the Elastic Agent to our Autopilot cluster and added a workload, let's take a look at some of the Kubernetes visualizations the integration provides out of the box.</p>
<p>The “[Metrics Kubernetes] Overview” is a great place to start. It provides a high-level view of the resources used by the cluster and allows me to drill into more specific dashboards that I find interesting:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observe-gke-autopilot-clusters/blog-elastic-create-visualization.png" alt="elastic agent GKE autopilot create visualization" /></p>
<p>For example, the “[Metrics Kubernetes] Pods” gives me a high-level view of the pods deployed in the cluster:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observe-gke-autopilot-clusters/blog-elastic-pod.png" alt="elastic agent GKE autopilot pod" /></p>
<p>The “[Metrics Kubernetes] Volumes” gives me an in-depth view to how storage is allocated and used in the Autopilot cluster:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observe-gke-autopilot-clusters/blog-elastic-filesystem-information.png" alt="elastic agent GKE autopilot filesystem information" /></p>
<h2>Creating an alert</h2>
<p>From here, I can easily discover patterns in my cluster’s behavior and even create Alerts. Here is an example of an alert to notify me if the the main storage volume (called “volume”) exceeds 80% of its allocated space:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observe-gke-autopilot-clusters/blog-elastic-create-rule-elasticsearch-query.png" alt="elastic agent GKE autopilot create rule" /></p>
<p>With a little work, I created this view from the standard dashboard:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/observe-gke-autopilot-clusters/blog-elastic-kubernetes-dashboard.png" alt="elastic agent GKE autopilot kubernetes dashboard" /></p>
<h2>Conclusion</h2>
<p>Today I have shown how easy it is to monitor, observe, and generate alerts on a GKE Autopilot cluster. To get more information on what is possible, see the official Elastic documentation for <a href="https://github.com/elastic/elastic-agent/blob/autopilotdocumentaton/docs/elastic-agent-gke-autopilot.md">Autopilot observability with Elastic Agent</a>.</p>
<h2>Next steps</h2>
<p>If you don’t have Elastic yet, you can get started for free with an <a href="https://www.elastic.co/cloud/elasticsearch-service/signup">Elastic Trial</a> today. Get more from Elastic and Google together with a <a href="https://console.cloud.google.com/marketplace/browse?q=Elastic&amp;utm_source=Elastic&amp;utm_medium=qwiklabs&amp;utm_campaign=Qwiklabs+to+Marketplace">Marketplace subscription</a>. Elastic does more than just integrate with GKE — check out the almost <a href="https://www.elastic.co/integrations">300 integrations</a> that Elastic provides.</p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/observe-gke-autopilot-clusters/blog-elastic-kubernetes-dashboard.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Unlock possibilities with native OpenTelemetry: prioritize reliability, not proprietary limitations]]></title>
            <link>https://www.elastic.co/observability-labs/blog/elastic-opentelemetry-native-kubernetes-observability</link>
            <guid isPermaLink="false">elastic-opentelemetry-native-kubernetes-observability</guid>
            <pubDate>Tue, 12 Nov 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[Elastic now supports Elastic Distributions of OpenTelemetry (EDOT) deployment and management on Kubernetes, using OTel Operator. SREs can now access out-of the-box configurations and dashboards designed to streamline collector deployment, application auto-instrumentation and lifecycle management with Elastic Observability.]]></description>
            <content:encoded><![CDATA[<p>OpenTelemetry (OTel) is emerging as the standard for data ingestion since it delivers a vendor-agnostic way to ingest data across all telemetry signals. Elastic Observability is leading the OTel evolution with the following announcements:</p>
<ul>
<li>
<p><strong>Native OTel Integrity:</strong> Elastic is now 100% OTel-native, retaining OTel data natively without requiring data translation This eliminates the need for SREs to handle tedious schema conversions and develop customized views. All Elastic Observability capabilities—such as entity discovery, entity-centric insights, APM, infrastructure monitoring, and AI-driven issue analysis— now seamlessly work with  native OTel data.</p>
</li>
<li>
<p><strong>Powerful end to end OTel based Kubernetes observability with</strong> <a href="https://www.elastic.co/observability-labs/blog/elastic-distributions-opentelemetry"><strong>Elastic Distributions of OpenTelemetry (EDOT)</strong></a><strong>:</strong> Elastic now supports EDOT deployment and management on Kubernetes via the OTel Operator, enabling streamlined EDOT collector deployment, application auto-instrumentation, and lifecycle management. With out-of-the-box OTel-based Kubernetes integration and dashboards, SREs gain instant, real-time visibility into cluster and application metrics, logs, and traces—with no manual configuration needed.</p>
</li>
</ul>
<p>For organizations, it signals our commitment to open standards, streamlined data collection, and delivering insights from native OpenTelemetry data. Bring the power of Elastic Observability to your Kubernetes and OpenTelemetry deployments for maximum visibility and performance. </p>
<h1>Fully native OTel architecture with in-depth data analysis</h1>
<p>Elastic’s OpenTelemetry-first architecture is 100% OTel-native, fully retaining the OTel data model, including OTel Semantic Conventions and Resource attributes, so your observability data remains in OpenTelemetry standards. OTel data in Elastic is also backward compatible with the Elastic Common Schema (ECS).</p>
<p>SREs now gain a holistic view of resources, as Elastic accurately identifies entities through OTel resource attributes. For example, in a Kubernetes environment, Elastic identifies containers, hosts, and services and connects these entities to logs, metrics, and traces.</p>
<p>Once OTel data is in Elastic’s scalable vector datastore, Elastic’s capabilities such as the AI Assistant, zero-config machine learning-based anomaly detection, pattern analysis, and latency correlation empower SREs to quickly analyze and pinpoint potential issues in production environments.</p>
<h1>Kubernetes insights with Elastic Distributions of OpenTelemetry (EDOT)</h1>
<p>EDOT reduces manual effort through automated onboarding and pre-configured dashboards. With EDOT and OpenTelemetry, Elastic makes Kubernetes monitoring straightforward and accessible for organizations of any size.</p>
<p>EDOT paired with Elasticsearch,  enables storage for all signal types—logs, metrics, traces, and soon profiling—while maintaining essential resource attributes and semantic conventions.</p>
<p>Elastic’s OpenTelemetry-native solution enables customers to quickly extract insights from their data rather than manage complex infrastructure to ingest data. Elastic automates the deployment and configuration of observability components to deliver a user experience focused on ease and scalability, making it well-suited for large-scale environments and diverse industry needs.</p>
<p>Let’s take a look at how Elastic’s EDOT enables visibility into Kubernetes environments.</p>
<h2>1. Simple 3-step OTel ingest with lifecycle management and auto-instrumentation </h2>
<p>Elastic leverages the upstream OpenTelemetry Operator to automate its EDOT lifecycle management—including deployment, scaling, and updates—allowing customers to focus on visibility into their Kubernetes infrastructure and applications instead of their observability infrastructure for data collection.</p>
<p>The Operator integrates with the EDOT Collector and language SDKs to provide a consistent, vendor-agnostic experience. For instance, when customers deploy a new application, they don’t need to manually configure instrumentation for various languages; the OpenTelemetry Operator manages this through auto-instrumentation, as supported by the upstream OpenTelemetry project.</p>
<p>This integration simplifies observability by ensuring consistent application instrumentation across the Kubernetes environment. Elastic’s collaboration with the upstream OpenTelemetry project strengthens this automation, enabling users to benefit from the latest updates and improvements in the OpenTelemetry ecosystem. By relying on open source tools like the OpenTelemetry Operator, Elastic ensures that its solutions stay aligned with the latest advancements in the OpenTelemetry project, reinforcing its commitment to open standards and community-driven development.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-opentelemetry-native-kubernetes-observability/unified-otel-based-k8s-experience.png" alt="Unified OTel-based Kubernetes Experience" /></p>
<p>The diagram above shows how the operator can deploy multiple OTel collectors, helping SREs deploy individual EDOT Collectors for specific applications and infrastructure. This configuration improves availability for OTel ingest and the telemetry is sent directly to Elasticsearch servers via OTLP.</p>
<p><a href="https://www.elastic.co/observability-labs/blog/elastic-opentelemetry-otel-operator">Check out our recent blog on how to set this up</a>.</p>
<h2>2. Out-of-the-box OTel-based Kubernetes integration with dashboards</h2>
<p>Elastic delivers an OTel-based Kubernetes configuration for the OTel collector by packaging all necessary receivers, processors, and configurations for Kubernetes observability. This enables users to automatically collect, process, and analyze Kubernetes metrics, logs, and traces without the need to configure each component individually.</p>
<p>The OpenTelemetry Kubernetes Collector components provide essential building blocks, including receivers like the Kubernetes Receiver for cluster metrics, Kubeletstats Receiver for detailed node and container metrics, along with processors for data transformation and enrichment. By packaging these components, Elastic offers a turnkey solution that simplifies Kubernetes observability and eliminates the need for users to set up and configure individual collectors or processors.</p>
<p>This pre-packaged approach, which includes <a href="https://github.com/elastic/integrations/tree/main/packages/kubernetes_otel">OTel-native Kibana assets</a> such as dashboards, allows users to focus on analyzing their observability data rather than managing configuration details. Elastic’s Unified OpenTelemetry Experience ensures that users can harness OpenTelemetry’s full potential without needing deep expertise. Whether you’re monitoring resource usage, container health, or API server metrics, users gain comprehensive observability through EDOT.</p>
<p>For more details on OpenTelemetry Kubernetes Collector components, visit<a href="https://opentelemetry.io/docs/kubernetes/collector/components/"> OpenTelemetry Collector Components</a>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-opentelemetry-native-kubernetes-observability/otel-based-k8s-dashboard.png" alt="OTel-based Kubernetes Dashboard" /></p>
<h2>3. Streamlined ingest architecture with OTel data and Elasticsearch</h2>
<p>Elastic’s ingest architecture minimizes infrastructure overhead by enabling users to forward trace data directly into Elasticsearch with the EDOT Collector, removing the need for the Elastic APM server. This approach:</p>
<ul>
<li>
<p>Reduces the costs and complexity associated with maintaining additional infrastructure, allowing users to deploy, scale, and manage their observability solutions with fewer resources.</p>
</li>
<li>
<p>Allows all OTel data, metrics, logs, and traces to be ingested and stored in Elastic’s singular vector database store enabling further analysis with Elastic’s AI-driven capabilities.</p>
</li>
</ul>
<p>SREs can now reduce operational burdens while also gaining high performance analytics and observability insights provided by Elastic.</p>
<h1>Elastic’s ongoing commitment to open source and OpenTelemetry</h1>
<p>With <a href="https://www.elastic.co/blog/elasticsearch-is-open-source-again">Elasticsearch fully open source once again</a> under the AGPL license,  this change reinforces our deep commitment to open standards and the open source community. This aligns with Elastic’s OpenTelemetry-first approach to observability, where Elastic Distributions of OpenTelemetry (EDOT) streamline OTel ingestion and schema auto-detection, providing real-time insights for Kubernetes and application telemetry.</p>
<p>As users increasingly adopt OTel as their schema and data collection architecture for observability, Elastic’s Distribution of OpenTelemetry (EDOT), currently in tech preview, enhances standard OpenTelemetry capabilities and improves troubleshooting while also serving as a commercially supported OTel distribution. EDOT, together with Elastic’s recent contributions of the Elastic Profiling Agent and Elastic Common Schema (ECS) to OpenTelemetry, reinforces Elastic’s commitment to establishing OpenTelemetry as the industry standard.</p>
<p>Customers can now embrace open standards and enjoy the advantages of an open, extensible platform that integrates seamlessly with their environment. End result?  Reduced costs, greater visibility, and vendor independence.</p>
<h1>Getting hands-on with Elastic Observability and EDOT</h1>
<p>Ready to try out the OTel Operator with EDOT collector and SDKs to see how Elastic utilizes ingested OTel data in APM, Discover, Analysis, and out-of-the-box dashboards? </p>
<ul>
<li>
<p><a href="https://cloud.elastic.co/">Get an account on Elastic Cloud</a></p>
</li>
<li>
<p><a href="https://www.elastic.co/observability-labs/blog/elastic-distributions-opentelemetry">Learn about Elastic Distributions of OpenTelemetry Overview</a></p>
</li>
<li>
<p><a href="https://www.elastic.co/observability-labs/blog/opentelemetry-demo-with-the-elastic-distributions-of-opentelemetry">Utilize the OpenTelemetry Demo with EDOT</a></p>
</li>
<li>
<p><a href="https://www.elastic.co/observability-labs/blog/infrastructure-monitoring-with-opentelemetry-in-elastic-observability">Understand how you can monitor Kubernetes with EDOT</a></p>
</li>
<li>
<p><a href="https://github.com/elastic/opentelemetry">Utilize the EDOT Operator </a>and the <a href="https://www.elastic.co/observability-labs/blog/elastic-distribution-opentelemetry-collector">EDOT OTel collector</a></p>
</li>
</ul>
<p>If you have your own application and want to configure EDOT the application with auto-instrumentation, read the following blogs on Go, Java, PHP, Python</p>
<ul>
<li>
<p><a href="https://www.elastic.co/observability-labs/blog/auto-instrumentation-go-applications-opentelemetry">Auto-Instrumenting Go Applications with OpenTelemetry</a></p>
</li>
<li>
<p><a href="https://www.elastic.co/observability-labs/blog/elastic-distribution-opentelemetry-java-agent">Elastic Distribution OpenTelemetry Java Agent</a></p>
</li>
<li>
<p><a href="https://www.elastic.co/observability-labs/blog/elastic-opentelemetry-distribution-php">Elastic OpenTelemetry Distribution for PHP</a></p>
</li>
<li>
<p><a href="https://www.elastic.co/observability-labs/blog/elastic-opentelemetry-distribution-python">Elastic OpenTelemetry Distribution for Python</a></p>
</li>
</ul>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/elastic-opentelemetry-native-kubernetes-observability/Kubecon-main-blog.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Native OTel-based K8s & App Observability in 3 Steps with Elastic]]></title>
            <link>https://www.elastic.co/observability-labs/blog/elastic-opentelemetry-otel-operator</link>
            <guid isPermaLink="false">elastic-opentelemetry-otel-operator</guid>
            <pubDate>Wed, 13 Nov 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[Elastic's Distributions of OpenTelemetry are now supported with the OTel Operator, providing auto instrumentation of applications with EDOT SDKs, and deployment and lifecycle management of the EDOT OTel Collector for Kubernetes Observability. Learn how to configure this in 3 easy steps]]></description>
            <content:encoded><![CDATA[<p>Elastic recently released its Elastic Distributions of OpenTelemetry (EDOT) which have been developed to enhance the capabilities of standard OpenTelemetry distributions and improve existing OpenTelemetry support from Elastic. EDOT helps Elastic deliver its new Unified OpenTelemetry Experience. SRE’s are no longer burdened with a set of tedious steps instrumenting and ingesting OTel data into Observability. SREs get a simple and frictionless way to instrument the OTel collector, and applications, and ingest all the OTel data into Elastic. The components of this experience include: (detailed in the overview blog)</p>
<ul>
<li>
<p><a href="https://www.elastic.co/observability-labs/blog/elastic-distributions-opentelemetry">Elastic Distributions for OpenTelemetry (EDOT)</a></p>
</li>
<li>
<p>Elastic’s configuration for the OpenTelemetry Operator providing:</p>
<ul>
<li>
<p>OTel Lifecycle management for the OTel collector and SDKs</p>
</li>
<li>
<p>Auto instrumentation of apps, which most developers will not instrument</p>
</li>
</ul>
</li>
<li>
<p>Pre-packaged receivers, processors, exporters, and configuration for the OTel Kubernetes Collector</p>
</li>
<li>
<p>Out-of-the-box OTel-based K8S dashboards for metrics and logs</p>
</li>
<li>
<p>Discovered inventory views for services, hosts, and containers</p>
</li>
<li>
<p>Direct OTel ingest into Elasticsearch for EDOT (bypassing ingest into APM server) - all your data (logs, metrics, and traces) is now stored in Elastic’s Search AI Lake</p>
</li>
<li>
<p>All ingested OTel data used and displayed natively in Discovery, APM, Inventory, etc</p>
</li>
</ul>
<p>In this blog we will cover how to ingest OTel for K8S and your application in 3 easy steps:</p>
<ol>
<li>
<p>Copy the install commands from the UI</p>
</li>
<li>
<p>Add the OpenTelemetry helm charts, Install the OpenTelemetry Operator with Elastic’s helm configuration &amp; set your Elastic endpoint and authentication</p>
</li>
<li>
<p>Annotate the app services you want to be auto-instrumented </p>
</li>
</ol>
<p>Then you can easily see K8S metrics, logs and application logs, metrics, and traces in Elastic Observability.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-opentelemetry-otel-operator/unified-otel-based-k8s-experience.png" alt="OpenTelemetry Unified Obseervability Experience" /></p>
<p>To follow this blog you will need to have:</p>
<ol>
<li>
<p>An account on cloud.elastic.co, with access to get the Elasticsearch endpoint and authentication (api key)</p>
</li>
<li>
<p>A non-instrumented application with services based on Go, dotnet, Python, or Java. Auto-instrumentation through the OTel operator. In this example, we will be using the <a href="https://github.com/elastic/observability-examples/tree/main/Elastiflix">Elastiflix</a> application. </p>
</li>
<li>
<p>A Kubernetes cluster, we used EKS in our setup</p>
</li>
<li>
<p>Helm and Kubectl loaded</p>
</li>
</ol>
<p>To find the authentication, you can find it in the integrations section of Elastic. More information is also available in the <a href="https://www.elastic.co/guide/en/kibana/current/api-keys.html">documentation</a>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-opentelemetry-otel-operator/otel-api-keys.png" alt="OpenTelemetry API Keys" /></p>
<h2>K8S and Application Observability in Elastic:</h2>
<p>Before we walk you through the steps, let's show you what is visible in Elastic.</p>
<p>Once the Operator starts the OTel Collector, you can see the following in Elastic:</p>
<h3>Kubernetes metrics:</h3>
<p>Using an out-of-the-box dashboard, you can see node metrics, overall cluster metrics, and status across pods, deployments, etc.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-opentelemetry-otel-operator/otel-k8s-dashboard.png" alt="OTel-based Kubernetes dashboard" /></p>
<h3>Discovered Inventory for Hosts, services, and containers:</h3>
<p>This can be found at Observability-&gt;Inventory on the UI</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-opentelemetry-otel-operator/otel-k8s-inventory.png" alt="OTel-based Kubernetes inventory" /></p>
<h3>Detailed metrics, logs, and processor info on hosts:</h3>
<p>This can be found at Observability-&gt;Infrastructure-&gt;Hosts</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-opentelemetry-otel-operator/otel-k8s-hosts.png" alt="OTel-based Kubernetes host metrics" /></p>
<h3>K8S and application logs in Elastic’s New Discover (called Explorer)</h3>
<p>This can be found on Observability-&gt;Discover</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-opentelemetry-otel-operator/otel-ingest-logs.png" alt="OTel-based Kubernetes logs" /></p>
<h3>Application Service views (logs, metrics, and traces):</h3>
<p>This can be found on Observability-&gt;Application</p>
<p>Then select the service and drill down into different aspects.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-opentelemetry-otel-operator/otel-java-traces.png" alt="OTel-based Application Java traces" /></p>
<p>Above we are showing how traces are shown using Native OTel data.</p>
<h2>Steps to install</h2>
<h3>Step 0. Follow the commands listed in the UI</h3>
<p>Under Add data-&gt;Kubernetes-&gt;Kubernetes Monitoring with EDOT</p>
<p>You will find the following instructions, which we will follow here.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-opentelemetry-otel-operator/otel-edot-operator-install.png" alt="EDOT Operator Install" /></p>
<h3>Step 1. Install the EDOT config for the OpenTelemetry Operator</h3>
<p>Run the following commands. Please make sure that you have already authenticated in your K8s Cluster and this is where you will run the helm commands provided below.</p>
<pre><code class="language-bash"># Install helm repo needed
helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts --force-update
# Install needed secrets. Provide the Elasticsearch Endpoint URL and API key you have noted in previous steps
kubectl create ns opentelemetry-operator-system
kubectl create -n opentelemetry-operator-system secret generic elastic-secret-otel \
    --from-literal=elastic_endpoint='YOUR_ELASTICSEARCH_ENDPOINT' \
    --from-literal=elastic_api_key='YOUR_ELASTICSEARCH_API_KEY'
# Install the EDOT Operator
helm install opentelemetry-kube-stack open-telemetry/opentelemetry-kube-stack --namespace opentelemetry-operator-system --create-namespace --values https://raw.githubusercontent.com/elastic/opentelemetry/refs/heads/main/resources/kubernetes/operator/helm/values.yaml --version 0.3.0
</code></pre>
<p>The values.yaml file configuration can be found <a href="https://github.com/elastic/opentelemetry/blob/main/resources/kubernetes/operator/helm/values.yaml">here</a>.</p>
<h3>Step 1b: Ensure OTel data is arriving in Elastic</h3>
<p>The simplest way to check is to go to Menu &gt; Dashboards &gt; <strong>[OTEL][Metrics Kubernetes] Cluster Overview,</strong> and ensure you see the following dashboard being populated</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-opentelemetry-otel-operator/otel-k8s-dashboard.png" alt="OTel-based Kubernetes dashboard" /></p>
<h3>Step 2: Annotate the application with auto-instrumentation</h3>
<p>For this example, we’re only going to annotate one service, the favorite-java service in the <a href="https://github.com/elastic/observability-examples/tree/main/Elastiflix">Elastiflix</a> application</p>
<p>Use the following commands to initiate auto-instrumentation:</p>
<pre><code class="language-bash">#Annotate Java namespace
kubectl annotate namespace java instrumentation.opentelemetry.io/inject-java=&quot;opentelemetry-operator-system/elastic-instrumentation&quot;
#Restart the java-app to get the new annotation
kubectl rollout restart deployment java-app -n java
</code></pre>
<p>You can also modify the yaml for your pod with the annotation</p>
<pre><code class="language-bash">metadata:
 name: my-app
 annotations:
   instrumentation.opentelemetry.io/inject-python: &quot;true&quot;
</code></pre>
<p>These instructions are provided in the UI:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-opentelemetry-otel-operator/otel-edot-sdk-annotate.png" alt="Annotate Application with EDOT SDK" /></p>
<h2>Check out the service data in Elastic APM</h2>
<p>Once the OTel data is in Elastic, you can see:</p>
<ul>
<li>
<p>Out-of-the-box dashboards for OTel-based Kubernetes metrics</p>
</li>
<li>
<p>Discovered resources such as services, hosts, and containers that are part of the Kubernetes clusters</p>
</li>
<li>
<p>Kubernetes metrics, host metrics, logs, processor info, anomaly detection, and universal profiling.</p>
</li>
<li>
<p>Log analytics in Elastic Discover</p>
</li>
<li>
<p>APM features that show app overview, transactions, dependencies, errors, and more:</p>
</li>
</ul>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-opentelemetry-otel-operator/otel-java-service.png" alt="Java service in Elastic APM" /></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/elastic-opentelemetry-otel-operator/otel-java-traces.png" alt="OTel-based Application Java traces" /></p>
<h2>Try it out</h2>
<p>Elastic’s Distribution of OpenTelemetry (EDOT) transforms the observability experience by streamlining Kubernetes and application instrumentation. With EDOT, SREs and developers can bypass complex setups, instantly gain deep visibility into Kubernetes clusters, and capture critical metrics, logs, and traces—all within Elastic Observability. By following just a few simple steps, you’re empowered with a unified, efficient monitoring solution that brings your OpenTelemetry data directly into Elastic. With robust, out-of-the-box dashboards, automatic application instrumentation, and seamless integration, EDOT not only saves time but also enhances the accuracy and accessibility of observability across your infrastructure. Start leveraging EDOT today to unlock a frictionless observability experience and keep your systems running smoothly and insightfully.</p>
<p>Additional resources:</p>
<ul>
<li>
<p><a href="https://www.elastic.co/observability-labs/blog/elastic-distributions-opentelemetry">Elastic Distributions of OpenTelemetry Overview</a></p>
</li>
<li>
<p><a href="https://www.elastic.co/observability-labs/blog/opentelemetry-demo-with-the-elastic-distributions-of-opentelemetry">OpenTelemetry Demo with Elastic Distributions</a></p>
</li>
<li>
<p><a href="https://www.elastic.co/observability-labs/blog/infrastructure-monitoring-with-opentelemetry-in-elastic-observability">Infrastructure Monitoring with OpenTelemetry in Elastic Observability</a></p>
</li>
<li>
<p><a href="https://www.elastic.co/observability-labs/blog/auto-instrumentation-go-applications-opentelemetry">Auto-Instrumenting Go Applications with OpenTelemetry</a></p>
</li>
<li>
<p><a href="https://www.elastic.co/observability-labs/blog/elastic-distribution-opentelemetry-java-agent">Elastic Distribution OpenTelemetry Java Agent</a></p>
</li>
<li>
<p><a href="https://www.elastic.co/observability-labs/blog/elastic-opentelemetry-distribution-php">Elastic OpenTelemetry Distribution for PHP</a></p>
</li>
</ul>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/elastic-opentelemetry-otel-operator/OTel-operator.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[How to enable Kubernetes alerting with Elastic Observability]]></title>
            <link>https://www.elastic.co/observability-labs/blog/enable-kubernetes-alerting-observability</link>
            <guid isPermaLink="false">enable-kubernetes-alerting-observability</guid>
            <pubDate>Tue, 30 May 2023 00:00:00 GMT</pubDate>
            <description><![CDATA[In the Kubernetes world, different personas demand different kinds of insights. In this post, we’ll focus on alerting and provide an overview of how alerts in Elastic Observability can help users quickly identify Kubernetes problems.]]></description>
            <content:encoded><![CDATA[<p>In the Kubernetes world, different personas demand different kinds of insights. Developers are interested in granular metrics and debugging information. <a href="https://www.elastic.co/blog/elastic-observability-sre-incident-response">SREs</a> are interested in seeing everything at once to quickly get notified when a problem occurs and spot where the root cause is. In this post, we’ll focus on alerting and provide an overview of how alerts in Elastic Observability can help users quickly identify Kubernetes problems.</p>
<h2>Why do we need alerts?</h2>
<p>Logs, metrics, and traces are just the base to build a complete <a href="https://www.elastic.co/blog/kubernetes-cluster-metrics-logs-monitoring">monitoring solution for Kubernetes clusters</a>. Their main goal is to provide debugging information and historical evidence for the infrastructure.</p>
<p>While out-of-the-box dashboards, infrastructure topology, and logs exploration through Kibana are already quite handy to perform ad-hoc analyses, adding notifications and active monitoring of infrastructure allows users to deal with problems detected as early as possible and even proactively take actions to prevent their Kubernetes environments from facing even more serious issues.</p>
<h3>How can this be achieved?</h3>
<p>By building alerts on top of their infrastructure, users can leverage the data and effectively correlate it to a specific notification, creating a wide range of possibilities to dynamically monitor and observe their Kubernetes cluster.</p>
<p>In this blog post, we will explore how users can leverage Elasticsearch’s search powers to define alerting rules in order to be notified when a specific condition occurs.</p>
<h2>SLIs, alerts, and SLOs: Why are they important for SREs?</h2>
<p>For site reliability engineers (SREs), the <a href="https://www.elastic.co/blog/elastic-observability-sre-incident-response">incident response time</a> is tightly coupled with the success of everyday work. Monitoring, alerting, and actions will help to discover, resolve, or prevent issues in their systems.</p>
<blockquote>
<ul>
<li><em>An SLA (Service Level Agreement) is an agreement you create with your users to specify the level of service they can expect.</em></li>
<li><em>An SLO (Service Level Objective) is an agreement within an SLA about a specific metric like uptime or response time.</em></li>
<li><em>An SLI (Service Level Indicator) measures compliance with an SLO.</em></li>
</ul>
</blockquote>
<p>SREs’ day-to-day tasks and projects are driven by SLOs. By ensuring that SLOs are defended in the short term and that they can be maintained in the medium to long term, we lay the basis of a stable working infrastructure.</p>
<p>Having said this, identifying the high-level categories of SLOs is crucial in order to organize the work of an SRE. Then in each category of SLOs, SREs will need the corresponding SLIs that can cover the most important cases of their system under observation. Therefore, the decision of which SLIs we will need demands additional knowledge of the underlying system infrastructure.</p>
<p>One widely used approach to categorize SLIs and SLOs is the <a href="https://landing.google.com/sre/sre-book/chapters/monitoring-distributed-systems/#xref_monitoring_golden-signals">Four Golden Signals</a> method. The categories defined are Latency, Traffic, Errors, and Saturation.</p>
<p>A more specific approach is the <a href="https://thenewstack.io/monitoring-microservices-red-method/">The RED method</a> developed by Tom Wilkie, who was an SRE at Google and used the Four Golden Signals. The RED method drops the saturation category because this one is mainly used for more advanced cases — and people remember better things that come in threes.</p>
<p>Focusing on Kubernetes infrastructure operators, we will consider the following groups of infrastructure SLIs/SLOs:</p>
<ul>
<li>Group 1: Latency of control plane (apiserver,</li>
<li>Group 2: Resource utilization of the nodes/pods (how much cpu, memory, etc. is consumed)</li>
<li>Group 3: Errors (errors on logs or events or error count from components, network, etc.)</li>
</ul>
<h2>Creating alerts for a Kubernetes cluster</h2>
<p>Now that we have a complete outline of our goal to define alerts based on SLIs/SLOs, we will dive into defining the proper alerting. Alerts can be built using <a href="https://www.elastic.co/guide/en/kibana/current/alerting-getting-started.html">Kibana</a>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/enable-kubernetes-alerting-observability/blog-elastic-create-rule.png" alt="kubernetes create rule" /></p>
<p>See Elastic <a href="https://www.elastic.co/guide/en/kibana/current/alerting-getting-started.html">documentation</a>.</p>
<p>In this blog, we will define more complex alerts based on complex Elasticsearch queries provided by <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/watcher-getting-started.html">Watcher</a>’s functionality. <a href="https://www.elastic.co/guide/en/kibana/8.8/watcher-ui.html">Read more about Watcher</a> and how to properly use it in addition to the examples in this blog.</p>
<h3>Latency alerts</h3>
<p>For this kind of alert, we want to define the basic SLOs for a Kubernetes control plane, which will ensure that the basic control plane components can service the end users without an issue. For instance, facing high latencies in queries against the Kubernetes API Server is enough of a signal that action needs to be taken.</p>
<h3>Resource saturation</h3>
<p>The next group of alerting will be resource utilization. Node’s CPU utilization or changes in Node’s condition is something critical for a cluster to ensure the smooth servicing of the workloads provisioned to run the applications that end users will interact with.</p>
<h3>Error detection</h3>
<p>Last but not least, we will define alerts based on specific errors like the network error rate or Pods’ failures like the OOMKilled situation. It’s a very useful indicator for SRE teams to either detect issues on the infrastructure level or just be able to notify developer teams about problematic workloads. One example that we will examine later is having an application running as a Pod and constantly getting restarted because it hits its memory limit. In that case, the owners of this application will need to get notified to act properly.</p>
<h2>From Kubernetes data to Elasticsearch queries</h2>
<p>Having a solid plan about the alerts that we want to implement, it's time to explore the data we have collected from the Kubernetes cluster and stored in Elasticsearch. For this we will consult the list of the available data fields that are ingested using the Elastic Agent Kubernetes <a href="https://docs.elastic.co/en/integrations/kubernetes">integration</a> (the full list of fields can be found <a href="https://www.elastic.co/guide/en/beats/metricbeat/current/exported-fields-kubernetes.html">here</a>). Using these fields we can create various alerts like:</p>
<ul>
<li>Node CPU utilization</li>
<li>Node Memory utilization</li>
<li>BW utilization</li>
<li>Pod restarts</li>
<li>Pod CPU/memory utilization</li>
</ul>
<h3>CPU utilization alert</h3>
<p>Our first example will use the CPU utilization fields to calculate the Node’s CPU utilization and create an alert. For this alert, we leverage the metrics:</p>
<pre><code class="language-yaml">kubernetes.node.cpu.usage.nanocores
kubernetes.node.cpu.capacity.cores.
</code></pre>
<p>The following calculation (nodeUsage / 1000000000 ) /nodeCap grouped by node name will give us the CPU utilization of our cluster’s nodes.</p>
<p>The Watcher definition that implements this query can be created with the following API call to Elasticsearch:</p>
<pre><code class="language-bash">curl -X PUT &quot;https://elastic:changeme@localhost:9200/_watcher/watch/Node-CPU-Usage?pretty&quot; -k -H 'Content-Type: application/json' -d'
{
  &quot;trigger&quot;: {
    &quot;schedule&quot;: {
      &quot;interval&quot;: &quot;10m&quot;
    }
  },
  &quot;input&quot;: {
    &quot;search&quot;: {
      &quot;request&quot;: {
        &quot;body&quot;: {
          &quot;size&quot;: 0,
          &quot;query&quot;: {
            &quot;bool&quot;: {
              &quot;must&quot;: [
                {
                  &quot;range&quot;: {
                    &quot;@timestamp&quot;: {
                      &quot;gte&quot;: &quot;now-10m&quot;,
                      &quot;lte&quot;: &quot;now&quot;,
                      &quot;format&quot;: &quot;strict_date_optional_time&quot;
                    }
                  }
                },
                {
                  &quot;bool&quot;: {
                    &quot;must&quot;: [
                      {
                        &quot;query_string&quot;: {
                          &quot;query&quot;: &quot;data_stream.dataset: kubernetes.node OR data_stream.dataset: kubernetes.state_node&quot;,
                          &quot;analyze_wildcard&quot;: true
                        }
                      }
                    ],
                    &quot;filter&quot;: [],
                    &quot;should&quot;: [],
                    &quot;must_not&quot;: []
                  }
                }
              ],
              &quot;filter&quot;: [],
              &quot;should&quot;: [],
              &quot;must_not&quot;: []
            }
          },
          &quot;aggs&quot;: {
            &quot;nodes&quot;: {
              &quot;terms&quot;: {
                &quot;field&quot;: &quot;kubernetes.node.name&quot;,
                &quot;size&quot;: &quot;10000&quot;,
                &quot;order&quot;: {
                  &quot;_key&quot;: &quot;asc&quot;
                }
              },
              &quot;aggs&quot;: {
                &quot;nodeUsage&quot;: {
                  &quot;max&quot;: {
                    &quot;field&quot;: &quot;kubernetes.node.cpu.usage.nanocores&quot;
                  }
                },
                &quot;nodeCap&quot;: {
                  &quot;max&quot;: {
                    &quot;field&quot;: &quot;kubernetes.node.cpu.capacity.cores&quot;
                  }
                },
                &quot;nodeCPUUsagePCT&quot;: {
                  &quot;bucket_script&quot;: {
                    &quot;buckets_path&quot;: {
                      &quot;nodeUsage&quot;: &quot;nodeUsage&quot;,
                      &quot;nodeCap&quot;: &quot;nodeCap&quot;
                    },
                    &quot;script&quot;: {
                      &quot;source&quot;: &quot;( params.nodeUsage / 1000000000 ) / params.nodeCap&quot;,
                      &quot;lang&quot;: &quot;painless&quot;,
                      &quot;params&quot;: {
                        &quot;_interval&quot;: 10000
                      }
                    },
                    &quot;gap_policy&quot;: &quot;skip&quot;
                  }
                }
              }
            }
          }
        },
        &quot;indices&quot;: [
          &quot;metrics-kubernetes*&quot;
        ]
      }
    }
  },
  &quot;condition&quot;: {
    &quot;array_compare&quot;: {
      &quot;ctx.payload.aggregations.nodes.buckets&quot;: {
        &quot;path&quot;: &quot;nodeCPUUsagePCT.value&quot;,
        &quot;gte&quot;: {
          &quot;value&quot;: 80
        }
      }
    }
  },
  &quot;actions&quot;: {
    &quot;log_hits&quot;: {
      &quot;foreach&quot;: &quot;ctx.payload.aggregations.nodes.buckets&quot;,
      &quot;max_iterations&quot;: 500,
      &quot;logging&quot;: {
        &quot;text&quot;: &quot;Kubernetes node found with high CPU usage: {{ctx.payload.key}} -&gt; {{ctx.payload.nodeCPUUsagePCT.value}}&quot;
      }
    }
  },
  &quot;metadata&quot;: {
    &quot;xpack&quot;: {
      &quot;type&quot;: &quot;json&quot;
    },
    &quot;name&quot;: &quot;Node CPU Usage&quot;
  }
}
</code></pre>
<h3>OOMKilled Pods detection and alerting</h3>
<p>Another Watcher that we will explore is the one that detects Pods that have been restarted due to an OOMKilled error. This error is quite common in Kubernetes workloads and is useful to detect this early on to inform the team that owns this workload, so they can either investigate issues that could cause memory leaks or just consider increasing the required resources for the workload itself.</p>
<p>This information can be retrieved from a query like the following:</p>
<pre><code class="language-yaml">kubernetes.container.status.last_terminated_reason: OOMKilled
</code></pre>
<p>Here is how we can create the respective Watcher with an API call:</p>
<pre><code class="language-bash">curl -X PUT &quot;https://elastic:changeme@localhost:9200/_watcher/watch/Pod-Terminated-OOMKilled?pretty&quot; -k -H 'Content-Type: application/json' -d'
{
  &quot;trigger&quot;: {
    &quot;schedule&quot;: {
      &quot;interval&quot;: &quot;1m&quot;
    }
  },
  &quot;input&quot;: {
    &quot;search&quot;: {
      &quot;request&quot;: {
        &quot;search_type&quot;: &quot;query_then_fetch&quot;,
        &quot;indices&quot;: [
          &quot;*&quot;
        ],
        &quot;rest_total_hits_as_int&quot;: true,
        &quot;body&quot;: {
          &quot;size&quot;: 0,
          &quot;query&quot;: {
            &quot;bool&quot;: {
              &quot;must&quot;: [
                {
                  &quot;range&quot;: {
                    &quot;@timestamp&quot;: {
                      &quot;gte&quot;: &quot;now-1m&quot;,
                      &quot;lte&quot;: &quot;now&quot;,
                      &quot;format&quot;: &quot;strict_date_optional_time&quot;
                    }
                  }
                },
                {
                  &quot;bool&quot;: {
                    &quot;must&quot;: [
                      {
                        &quot;query_string&quot;: {
                          &quot;query&quot;: &quot;data_stream.dataset: kubernetes.state_container&quot;,
                          &quot;analyze_wildcard&quot;: true
                        }
                      },
                      {
                        &quot;exists&quot;: {
                          &quot;field&quot;: &quot;kubernetes.container.status.last_terminated_reason&quot;
                        }
                      },
                      {
                        &quot;query_string&quot;: {
                          &quot;query&quot;: &quot;kubernetes.container.status.last_terminated_reason: OOMKilled&quot;,
                          &quot;analyze_wildcard&quot;: true
                        }
                      }
                    ],
                    &quot;filter&quot;: [],
                    &quot;should&quot;: [],
                    &quot;must_not&quot;: []
                  }
                }
              ],
              &quot;filter&quot;: [],
              &quot;should&quot;: [],
              &quot;must_not&quot;: []
            }
          },
          &quot;aggs&quot;: {
            &quot;pods&quot;: {
              &quot;terms&quot;: {
                &quot;field&quot;: &quot;kubernetes.pod.name&quot;,
                &quot;order&quot;: {
                  &quot;_key&quot;: &quot;asc&quot;
                }
              }
            }
          }
        }
      }
    }
  },
  &quot;condition&quot;: {
    &quot;array_compare&quot;: {
      &quot;ctx.payload.aggregations.pods.buckets&quot;: {
        &quot;path&quot;: &quot;doc_count&quot;,
        &quot;gte&quot;: {
          &quot;value&quot;: 1,
          &quot;quantifier&quot;: &quot;some&quot;
        }
      }
    }
  },
  &quot;actions&quot;: {
    &quot;ping_slack&quot;: {
      &quot;foreach&quot;: &quot;ctx.payload.aggregations.pods.buckets&quot;,
      &quot;max_iterations&quot;: 500,
      &quot;webhook&quot;: {
        &quot;method&quot;: &quot;POST&quot;,
        &quot;url&quot;: &quot;https://hooks.slack.com/services/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&quot;,
        &quot;body&quot;: &quot;{\&quot;channel\&quot;: \&quot;#k8s-alerts\&quot;, \&quot;username\&quot;: \&quot;k8s-cluster-alerting\&quot;, \&quot;text\&quot;: \&quot;Pod {{ctx.payload.key}} was terminated with status OOMKilled.\&quot;}&quot;
      }
    }
  },
  &quot;metadata&quot;: {
    &quot;xpack&quot;: {
      &quot;type&quot;: &quot;json&quot;
    },
    &quot;name&quot;: &quot;Pod Terminated OOMKilled&quot;
  }
}
</code></pre>
<h3>From Kubernetes data to alerts summary</h3>
<p>So far we saw how we can start from plain Kubernetes fields, use them in ES queries, and build Watchers and alerts on top of them.</p>
<p>One can explore more possible data combinations and build queries and alerts following the examples we provided here. A <a href="https://github.com/elastic/integrations/tree/main/packages/kubernetes/docs">full list of alerts</a> is available, as well as a <a href="https://github.com/elastic/k8s-integration-infra/tree/main/scripts/alerting">basic scripted way of installing them</a>.</p>
<p>Of course, these examples come with simple actions defined that only log messages into the Elasticsearch logs. However, one can use more advanced and useful outputs like Slack’s webhooks:</p>
<pre><code class="language-json">&quot;actions&quot;: {
    &quot;ping_slack&quot;: {
      &quot;foreach&quot;: &quot;ctx.payload.aggregations.pods.buckets&quot;,
      &quot;max_iterations&quot;: 500,
      &quot;webhook&quot;: {
        &quot;method&quot;: &quot;POST&quot;,
        &quot;url&quot;: &quot;https://hooks.slack.com/services/T04SW3JHXasdfasdfasdfasdfasdf&quot;,
        &quot;body&quot;: &quot;{\&quot;channel\&quot;: \&quot;#k8s-alerts\&quot;, \&quot;username\&quot;: \&quot;k8s-cluster-alerting\&quot;, \&quot;text\&quot;: \&quot;Pod {{ctx.payload.key}} was terminated with status OOMKilled.\&quot;}&quot;
      }
    }
  }
</code></pre>
<p>The result would be a Slack message like the following:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/enable-kubernetes-alerting-observability/blog-elastic-k8s-cluster-alerting.png" alt="" /></p>
<h2>Next steps</h2>
<p>In our next steps, we would like to make these alerts part of our Kubernetes integration, which would mean that the predefined alerts would be installed when users install or enable the Kubernetes integration. At the same time, we plan to implement some of these as Kibana’s native SLIs, providing the option to our users to quickly define SLOs on top of the SLIs through a nice user interface. If you’re interested to learn more about these, follow the public GitHub issues for more information and feel free to provide your feedback:</p>
<ul>
<li><a href="https://github.com/elastic/package-spec/issues/484">https://github.com/elastic/package-spec/issues/484</a></li>
<li><a href="https://github.com/elastic/kibana/issues/150050">https://github.com/elastic/kibana/issues/150050</a></li>
</ul>
<p>For those who are eager to start using Kubernetes alerting today, here is what you need to do:</p>
<ol>
<li>Make sure that you have an Elastic cluster up and running. The fastest way to deploy your cluster is to spin up a <a href="https://www.elastic.co/elasticsearch/service">free trial of Elasticsearch Service</a>.</li>
<li>Install the latest Elastic Agent on your Kubernetes cluster following the respective <a href="https://www.elastic.co/guide/en/fleet/master/running-on-kubernetes-managed-by-fleet.html">documentation</a>.</li>
<li>Install our provided alerts that can be found at <a href="https://github.com/elastic/integrations/tree/main/packages/kubernetes/docs">https://github.com/elastic/integrations/tree/main/packages/kubernetes/docs</a> or at <a href="https://github.com/elastic/k8s-integration-infra/tree/main/scripts/alerting">https://github.com/elastic/k8s-integration-infra/tree/main/scripts/alerting</a>.</li>
</ol>
<p>Of course, if you have any questions, remember that we are always happy to help on the Discuss <a href="https://discuss.elastic.co/">forums</a>.</p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/enable-kubernetes-alerting-observability/alert-management.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[How to easily add application monitoring in Kubernetes pods]]></title>
            <link>https://www.elastic.co/observability-labs/blog/application-monitoring-kubernetes-pods</link>
            <guid isPermaLink="false">application-monitoring-kubernetes-pods</guid>
            <pubDate>Wed, 17 Jan 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[This blog walks through installing the Elastic APM K8s Attacher and shows how to configure your system for both common and non-standard deployments of Elastic APM agents.]]></description>
            <content:encoded><![CDATA[<p>The <a href="https://www.elastic.co/guide/en/apm/attacher/current/index.html">Elastic® APM K8s Attacher</a> allows auto-installation of Elastic APM application agents (e.g., the Elastic APM Java agent) into applications running in your Kubernetes clusters. The mechanism uses a <a href="https://kubernetes.io/docs/reference/access-authn-authz/admission-controllers/">mutating webhook</a>, which is a standard Kubernetes component, but you don’t need to know all the details to use the Attacher. Essentially, you can install the Attacher, add one annotation to any Kubernetes deployment that has an application you want monitored, and that’s it!</p>
<p>In this blog, we’ll walk through a full example from scratch using a Java application. Apart from the Java code and using a JVM for the application, everything else works the same for the other languages supported by the Attacher.</p>
<h2>Prerequisites</h2>
<p>This walkthrough assumes that the following are already installed on the system: JDK 17, Docker, Kubernetes, and Helm.</p>
<h2>The example application</h2>
<p>While the application (shown below) is a Java application, it would be easily implemented in any language, as it is just a simple loop that every 2 seconds calls the method chain methodA-&gt;methodB-&gt;methodC-&gt;methodD, with methodC sleeping for 10 milliseconds and methodD sleeping for 200 milliseconds. The choice of application is just to be able to clearly display in the Elastic APM UI that the application is being monitored.</p>
<p>The Java application in full is shown here:</p>
<pre><code class="language-java">package test;

public class Testing implements Runnable {

  public static void main(String[] args) {
    new Thread(new Testing()).start();
  }

  public void run()
  {
    while(true) {
      try {Thread.sleep(2000);} catch (InterruptedException e) {}
      methodA();
    }
  }

  public void methodA() {methodB();}

  public void methodB() {methodC();}

  public void methodC() {
    System.out.println(&quot;methodC executed&quot;);
    try {Thread.sleep(10);} catch (InterruptedException e) {}
    methodD();
  }

  public void methodD() {
    System.out.println(&quot;methodD executed&quot;);
    try {Thread.sleep(200);} catch (InterruptedException e) {}
  }
}
</code></pre>
<p>We created a Docker image containing that simple Java application for you that can be pulled from the following Docker repository:</p>
<pre><code class="language-bash">docker.elastic.co/demos/apm/k8s-webhook-test
</code></pre>
<h2>Deploy the pod</h2>
<p>First we need a deployment config. We’ll call the config file webhook-test.yaml, and the contents are pretty minimal — just pull the image and run that as a pod &amp; container called webhook-test in the default namespace:</p>
<pre><code class="language-yaml">apiVersion: v1
kind: Pod
metadata:
  name: webhook-test
  labels:
    app: webhook-test
spec:
  containers:
    - image: docker.elastic.co/demos/apm/k8s-webhook-test
      imagePullPolicy: Always
      name: webhook-test
</code></pre>
<p>This can be deployed normally using kubectl:</p>
<pre><code class="language-yaml">kubectl apply -f webhook-test.yaml
</code></pre>
<p>The result is exactly as expected:</p>
<pre><code class="language-bash">$ kubectl get pods
NAME           READY   STATUS    RESTARTS   AGE
webhook-test   1/1     Running   0          10s

$ kubectl logs webhook-test
methodC executed
methodD executed
methodC executed
methodD executed
</code></pre>
<p>So far, this is just setting up a standard Kubernetes application with no APM monitoring. Now we get to the interesting bit: adding in auto-instrumentation.</p>
<h2>Install Elastic APM K8s Attacher</h2>
<p>The first step is to install the <a href="https://www.elastic.co/guide/en/apm/attacher/current/index.html">Elastic APM K8s Attacher</a>. This only needs to be done once for the cluster — once installed, it is always available. Before installation, we will define where the monitored data will go. As you will see later, we can decide or change this any time. For now, we’ll specify our own Elastic APM server, which is at <a href="https://myserver.somecloud:443">https://myserver.somecloud:443</a> — we also have a secret token for authorization to that Elastic APM server, which has value MY_SECRET_TOKEN. (If you want to set up a quick test Elastic APM server, you can do so at <a href="https://cloud.elastic.co/">https://cloud.elastic.co/</a>).</p>
<p>There are two additional environment variables set for the application that are not generally needed but will help when we see the resulting UI content toward the end of the walkthrough (when the agent is auto-installed, these two variables tell the agent what name to give this application in the UI and what method to trace). Now we just need to define the custom yaml file to hold these. On installation, the custom yaml will be merged into the yaml for the Attacher:</p>
<pre><code class="language-yaml">apm:
  secret_token: MY_SECRET_TOKEN
  namespaces:
    - default
webhookConfig:
  agents:
    java:
      environment:
        ELASTIC_APM_SERVER_URL: &quot;https://myserver.somecloud:443&quot;
        ELASTIC_APM_TRACE_METHODS: &quot;test.Testing#methodB&quot;
        ELASTIC_APM_SERVICE_NAME: &quot;webhook-test&quot;
</code></pre>
<p>That custom.yaml file is all we need to install the attacher (note we’ve only specified the default namespace for agent auto-installation for now — this can be easily changed, as you’ll see later). Next we’ll add the Elastic charts to helm — this only needs to be done once, then all Elastic charts are available to helm. This is the usual helm add repo command, specifically:</p>
<pre><code class="language-bash">helm repo add elastic https://helm.elastic.co
</code></pre>
<p>Now the Elastic charts are available for installation (helm search repo would show you all the available charts). We’re going to use “elastic-webhook” as the name to install into, resulting in the following installation command:</p>
<pre><code class="language-bash">helm install elastic-webhook elastic/apm-attacher --namespace=elastic-apm --create-namespace --values custom.yaml
</code></pre>
<p>And that’s it, we now have the Elastic APM K8s Attacher installed and set to send data to the APM server defined in the custom.yaml file! (You can confirm installation with a helm list -A if you need.)</p>
<h2>Auto-install the Java agent</h2>
<p>The Elastic APM K8s Attacher is installed, but it doesn’t auto-install the APM application agents into every pod — that could lead to problems! Instead the Attacher is deliberately limited to auto-install agents into deployments defined a) by the namespaces listed in the custom.yaml, and b) to those deployments in those namespaces that have a specific annotation “co.elastic.apm/attach.”</p>
<p>So for now, restarting the webhook-test pod we created above won’t have any different effect on the pod, as it isn’t yet set to be monitored. What we need to do is add the annotation. Specifically, we need to add the annotation using the default agent configuration that was installed with the Attacher called “java” for the Java agent (we’ll see later how that agent configuration is altered — the default configuration installs the latest agent version and leaves everything else default for that version). So adding that annotation in to webhook-test yaml gives us the new yaml file contents (the additional config is shown labelled (1)):</p>
<pre><code class="language-yaml">apiVersion: v1
kind: Pod
metadata:
  name: webhook-test
  annotations: #(1)
    co.elastic.apm/attach: java #(1)
  labels:
    app: webhook-test
spec:
  containers:
    - image: docker.elastic.co/demos/apm/k8s-webhook-test
      imagePullPolicy: Always
      name: webhook-test
</code></pre>
<p>Applying this change gives us the application now monitored:</p>
<pre><code class="language-bash">$ kubectl delete -f webhook-test.yaml
pod &quot;webhook-test&quot; deleted
$ kubectl apply -f webhook-test.yaml
pod/webhook-test created
$ kubectl logs webhook-test
… StartupInfo - Starting Elastic APM 1.45.0 …
</code></pre>
<p>And since the agent is now feeding data to our APM server, we can now see it in the UI:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/application-monitoring-kubernetes-pods/webhook-test-k8s-blog.png" alt="webhook-test" /></p>
<p>Note that the agent identifies Testing.methodB method as a trace root because of the ELASTIC_APM_TRACE_METHODS environment variable set to test.Testing#methodB in the custom.yaml — this tells the agent to specifically trace that method. The time taken by that method will be available in the UI for each invocation, but we don’t see the sub-methods . . . currently. In the next section, we’ll see how easy it is to customize the Attacher, and in doing so we’ll see more detail about the method chain being executed in the application.</p>
<h2>Customizing the agents</h2>
<p>In your systems, you’ll likely have development, testing, and production environments. You’ll want to specify the version of the agent to use rather than just pull the latest version whatever that is, you’ll want to have debug on for some applications or instances, and you’ll want to have specific options set to specific values. This sounds like a lot of effort, but the attacher lets you enable these kinds of changes in a very simple way. In this section, we’ll add a configuration that specifies all these changes and we can see just how easy it is to configure and enable it.</p>
<p>We start at the custom.yaml file we defined above. This is the file that gets merged into the Attacher. Adding a new configuration with all the items listed in the last paragraph is easy — though first we need to decide a name for our new configuration. We’ll call it “java-interesting” here. The new custom.yaml in full is (the first part is just the same as before, the new config is simply appended):</p>
<pre><code class="language-yaml">apm:
  secret_token: MY_SECRET_TOKEN
  namespaces:
    - default
webhookConfig:
  agents:
    java:
      environment:
        ELASTIC_APM_SERVER_URL: &quot;https://myserver.somecloud:443&quot;
        ELASTIC_APM_TRACE_METHODS: &quot;test.Testing#methodB&quot;
        ELASTIC_APM_SERVICE_NAME: &quot;webhook-test&quot;
    java-interesting:
      image: docker.elastic.co/observability/apm-agent-java:1.55.4
      artifact: &quot;/usr/agent/elastic-apm-agent.jar&quot;
      environment:
        ELASTIC_APM_SERVER_URL: &quot;https://myserver.somecloud:443&quot;
        ELASTIC_APM_TRACE_METHODS: &quot;test.Testing#methodB&quot;
        ELASTIC_APM_SERVICE_NAME: &quot;webhook-test&quot;
        ELASTIC_APM_ENVIRONMENT: &quot;testing&quot;
        ELASTIC_APM_LOG_LEVEL: &quot;debug&quot;
        ELASTIC_APM_PROFILING_INFERRED_SPANS_ENABLED: &quot;true&quot;
        JAVA_TOOL_OPTIONS: &quot;-javaagent:/elastic/apm/agent/elastic-apm-agent.jar&quot;
</code></pre>
<p>Breaking the additional config down, we have:</p>
<ul>
<li>
<p>The name of the new config java-interesting</p>
</li>
<li>
<p>The APM Java agent image docker.elastic.co/observability/apm-agent-java</p>
<ul>
<li>With a specific version 1.43.0 instead of latest</li>
</ul>
</li>
<li>
<p>We need to specify the agent jar location (the attacher puts it here)</p>
<ul>
<li>artifact: &quot;/usr/agent/elastic-apm-agent.jar&quot;</li>
</ul>
</li>
<li>
<p>And then the environment variables</p>
</li>
<li>
<p>ELASTIC_APM_SERVER_URL as before</p>
</li>
<li>
<p>ELASTIC_APM_ENVIRONMENT set to testing, useful when looking in the UI</p>
</li>
<li>
<p>ELASTIC_APM_LOG_LEVEL set to debug for more detailed agent output</p>
</li>
<li>
<p>ELASTIC_APM_PROFILING_INFERRED_SPANS_ENABLED turning this on (setting to true) will give us additional interesting information about the method chain being executed in the application</p>
</li>
<li>
<p>And lastly we need to set JAVA_TOOL_OPTIONS to the enable starting the agent &quot;-javaagent:/elastic/apm/agent/elastic-apm-agent.jar&quot; — this is fundamentally how the attacher auto-attaches the Java agent</p>
</li>
</ul>
<p>More configurations and details about configuration options are <a href="https://www.elastic.co/guide/en/apm/agent/java/current/configuration.html">here for the Java agent</a>, and <a href="https://www.elastic.co/guide/en/apm/agent/index.html">other language agents</a> are also available.</p>
<h2>The application traced with the new configuration</h2>
<p>And finally we just need to upgrade the attacher with the changed custom.yaml:</p>
<pre><code class="language-bash">helm upgrade elastic-webhook elastic/apm-attacher --namespace=elastic-apm --create-namespace --values custom.yaml
</code></pre>
<p>This is the same command as the original install, but now using upgrade. That’s it — add config to the custom.yaml and upgrade the attacher, and it’s done! Simple.</p>
<p>Of course we still need to use the new config on an app. In this case, we’ll edit the existing webhook-test.yaml file, replacing java with java-interesting, so the annotation line is now:</p>
<pre><code class="language-yaml">co.elastic.apm/attach: java-interesting
</code></pre>
<p>Applying the new pod config and restarting the pod, you can see the logs now hold debug output:</p>
<pre><code class="language-bash">$ kubectl delete -f webhook-test.yaml
pod &quot;webhook-test&quot; deleted
$ kubectl apply -f webhook-test.yaml
pod/webhook-test created
$ kubectl logs webhook-test
… StartupInfo - Starting Elastic APM 1.44.0 …
… DEBUG co.elastic.apm.agent. …
… DEBUG co.elastic.apm.agent. …
</code></pre>
<p>More interesting is the UI. Now that inferred spans is on, the full method chain is visible.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/application-monitoring-kubernetes-pods/trace-sample-k8s-blog.png" alt="trace sample" /></p>
<p>This gives the details for methodB (it takes 211 milliseconds because it calls methodC - 10ms - which calls methodD - 200ms). The times for methodC and methodD are inferred rather than recorded, (inferred rather than traced — if you needed accurate times you would instead add the methods to trace_methods and have them traced too).</p>
<h2>Note on the ECK operator</h2>
<p>The <a href="https://www.elastic.co/guide/en/cloud-on-k8s/master/k8s-overview.html">Elastic Cloud on Kubernetes operator</a> allows you to install and manage a number of other Elastic components on Kubernetes. At the time of publication of this blog, the <a href="https://www.elastic.co/guide/en/apm/attacher/current/index.html">Elastic APM K8s Attacher</a> is a separate component, and there is no conflict between these management mechanisms — they apply to different components and are independent of each other.</p>
<h2>Try it yourself!</h2>
<p>This walkthrough is easily repeated on your system, and you can make it more useful by replacing the example application with your own and the Docker registry with the one you use.</p>
<p><a href="https://www.elastic.co/observability/kubernetes-monitoring">Learn more about real-time monitoring with Kubernetes and Elastic Observability</a>.</p>
<p><em>The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.</em></p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/application-monitoring-kubernetes-pods/139689_-_Blog_Header_Banner_V1.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[How to monitor Kafka and Confluent Cloud with Elastic Observability]]></title>
            <link>https://www.elastic.co/observability-labs/blog/monitor-kafka-confluent-cloud-elastic-observability</link>
            <guid isPermaLink="false">monitor-kafka-confluent-cloud-elastic-observability</guid>
            <pubDate>Mon, 03 Apr 2023 00:00:00 GMT</pubDate>
            <description><![CDATA[This blog post will take you through best practices to observe Kafka-based solutions implemented on Confluent Cloud with Elastic Observability.]]></description>
            <content:encoded><![CDATA[<p>The blog will take you through best practices to observe Kafka-based solutions implemented on Confluent Cloud with Elastic Observability. (To monitor Kafka brokers that are not in Confluent Cloud, I recommend checking out <a href="https://www.elastic.co/blog/how-to-monitor-containerized-kafka-with-elastic-observability">this blog</a>.) We will instrument Kafka applications with <a href="https://www.elastic.co/observability/application-performance-monitoring">Elastic APM</a>, use the Confluent Cloud metrics endpoint to get data about brokers, and pull it all together with a unified Kafka and Confluent Cloud monitoring dashboard in <a href="https://www.elastic.co/observability">Elastic Observability</a>.</p>
<h2>Using full-stack Elastic Observability to understand Kafka and Confluent performance</h2>
<p>In the <a href="https://dice.viewer.foleon.com/ebooks/dice-tech-salary-report-explore/">2023 Dice Tech Salary Report</a>, Elasticsearch and Kakfa are ranked #3 and #5 out of the top 12 <a href="https://dice.viewer.foleon.com/ebooks/dice-tech-salary-report-explore/salary-trends#Skills">most in demand skills</a> at the moment, so it’s no surprise that we are seeing a large number of customers who are implementing data in motion with Kafka.</p>
<p><a href="https://www.elastic.co/integrations/data-integrations?search=kafka">Kafka</a> comes with some additional complexities that go beyond traditional architectures and which make observability an even more important topic. Understanding where the bottlenecks are in messaging and stream-based architectures can be tough. This is why you need a comprehensive observability solution with <a href="https://www.elastic.co/blog/aiops-use-cases-observability-operations">machine learning</a> to help you.</p>
<p>In this blog, we will explore how to get Kafka applications instrumented with <a href="https://www.elastic.co/blog/apm-correlations-elastic-observability-root-cause-transactions">Elastic APM</a>, how to collect performance data with JMX, and how you can use the Elasticsearch Platform to pull in data from Confluent Cloud — which is by far the easiest and most cost-effective way to implement Kafka architectures.</p>
<p>For this blog post, we will be following the code at this <a href="https://github.com/davidgeorgehope/multi-cloud">git repository</a>. There are three services here that are designed to run on two clouds and push data from one cloud to the other and finally into Google BigQuery. We want to monitor all of this using Elastic Observability to give you a complete picture of Confluent and Kafka Services performance as a teaser — this is the goal below:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-kafka-confluent-cloud-elastic-observability/blog-elastic-observability-producer_metrics.png" alt="kafka producer metrics" /></p>
<h2>A look at the architecture</h2>
<p>As mentioned, we have three <a href="https://www.elastic.co/observability/cloud-monitoring">multi-cloud services</a> implemented in our example application.</p>
<p>The first service is a Spring WebFlux service that runs inside AWS EKS. This service will take a message from a REST Endpoint and simply put it straight on to a Kafka topic.</p>
<p>The second service, which is also a Spring WebFlux service hosted inside Google Cloud Platform (GCP) with its <a href="https://www.elastic.co/observability/google-cloud-monitoring">Google Cloud monitoring</a>, will then pick this up and forward it to another service that will put the message into BigQuery.</p>
<p>These services are all instrumented using Elastic APM. For this blog, we have decided to use Spring config to inject and configure the APM agent. You could of course use the “-javaagent” argument to inject the agent instead if preferred.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-kafka-confluent-cloud-elastic-observability/blog-elastic-obsevability-aws-kafka-google-cloud.png" alt="aws kafka google cloud" /></p>
<h2>Getting started with Elastic Observability and Confluent Cloud</h2>
<p>Before we dive into the application and its configuration, you will want to get an Elastic Cloud and Confluent Cloud account. You can sign up here for <a href="https://www.elastic.co/cloud/">Elastic</a> and here for <a href="https://www.confluent.io/confluent-cloud/">Confluent Cloud</a>. There are some initial configuration steps we need to do inside Confluent Cloud, as you will need to create three topics: gcpTopic, myTopic, and topic_2.</p>
<p>When you sign up for Confluent Cloud, you will be given an option of what type of cluster to create. For this walk-through, a Basic cluster is fine (as shown) — if you are careful about usage, it will not cost you a penny.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-kafka-confluent-cloud-elastic-observability/blog-elastic-observability-confluent-create-cluster.png" alt="confluent create cluster" /></p>
<p>Once you have a cluster, go ahead and create the three topics.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-kafka-confluent-cloud-elastic-observability/blog-elastic-observability-confluent-topics.png" alt="confluent topics" /></p>
<p>For this walk-through, you will only need to create single partition topics as shown below:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-kafka-confluent-cloud-elastic-observability/blog-elastic-observability-new-topic.png" alt="new topic" /></p>
<p>Now we are ready to set up the Elastic Cloud cluster.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-kafka-confluent-cloud-elastic-observability/blog-elastic-observability-create-a-deployment.png" alt="create a deployment" /></p>
<p>One thing to note here is that when setting up an Elastic cluster, the defaults are mostly OK. With one minor tweak to add in the Machine Learning under “Advanced Settings,” add capacity for machine learning here.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-kafka-confluent-cloud-elastic-observability/blog-elastic-observability-machine-learning-instances.png" alt="machine learning instances" /></p>
<h2>Getting APM up and running</h2>
<p>The first thing we want to do here is get our Spring Boot Webflux-based services up and running. For this blog, I have decided to implement this using the Spring Configuration, as you can see below. For brevity, I have not listed all the JMX configuration information, but you can see those details in <a href="https://github.com/davidgeorgehope/multi-cloud/blob/main/aws-multi-cloud/src/main/java/com/elastic/multicloud/ElasticApmConfig.java">GitHub</a>.</p>
<pre><code class="language-java">package com.elastic.multicloud;
import co.elastic.apm.attach.ElasticApmAttacher;
import jakarta.annotation.PostConstruct;
import lombok.Setter;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.springframework.boot.autoconfigure.condition.ConditionalOnProperty;
import org.springframework.boot.context.properties.ConfigurationProperties;
import org.springframework.context.annotation.Configuration;

import java.util.HashMap;
import java.util.Map;

@Setter
@Configuration
@ConfigurationProperties(prefix = &quot;elastic.apm&quot;)
@ConditionalOnProperty(value = &quot;elastic.apm.enabled&quot;, havingValue = &quot;true&quot;)
public class ElasticApmConfig {

    private static final String SERVER_URL_KEY = &quot;server_url&quot;;
    private String serverUrl;

    private static final String SERVICE_NAME_KEY = &quot;service_name&quot;;
    private String serviceName;

    private static final String SECRET_TOKEN_KEY = &quot;secret_token&quot;;
    private String secretToken;

    private static final String ENVIRONMENT_KEY = &quot;environment&quot;;
    private String environment;

    private static final String APPLICATION_PACKAGES_KEY = &quot;application_packages&quot;;
    private String applicationPackages;

    private static final String LOG_LEVEL_KEY = &quot;log_level&quot;;
    private String logLevel;
    private static final Logger LOGGER = LoggerFactory.getLogger(ElasticApmConfig.class);

    @PostConstruct
    public void init() {
        LOGGER.info(environment);

        Map&lt;String, String&gt; apmProps = new HashMap&lt;&gt;(6);
        apmProps.put(SERVER_URL_KEY, serverUrl);
        apmProps.put(SERVICE_NAME_KEY, serviceName);
        apmProps.put(SECRET_TOKEN_KEY, secretToken);
        apmProps.put(ENVIRONMENT_KEY, environment);
        apmProps.put(APPLICATION_PACKAGES_KEY, applicationPackages);
        apmProps.put(LOG_LEVEL_KEY, logLevel);
        apmProps.put(&quot;enable_experimental_instrumentations&quot;,&quot;true&quot;);
          apmProps.put(&quot;capture_jmx_metrics&quot;,&quot;object_name[kafka.producer:type=producer-metrics,client-id=*] attribute[batch-size-avg:metric_name=kafka.producer.batch-size-avg]&quot;);


        ElasticApmAttacher.attach(apmProps);
    }
}
</code></pre>
<p>Now obviously this requires some dependencies, which you can see here in the Maven pom.xml.</p>
<pre><code class="language-xml">&lt;dependency&gt;
			&lt;groupId&gt;co.elastic.apm&lt;/groupId&gt;
			&lt;artifactId&gt;apm-agent-attach&lt;/artifactId&gt;
			&lt;version&gt;1.35.1-SNAPSHOT&lt;/version&gt;
		&lt;/dependency&gt;
		&lt;dependency&gt;
			&lt;groupId&gt;co.elastic.apm&lt;/groupId&gt;
			&lt;artifactId&gt;apm-agent-api&lt;/artifactId&gt;
			&lt;version&gt;1.35.1-SNAPSHOT&lt;/version&gt;
		&lt;/dependency&gt;
</code></pre>
<p>Strictly speaking, the agent-api is not required, but it could be useful if you have a desire to add your own monitoring code (as per the example below). The agent will happily auto-instrument without needing to do that though.</p>
<pre><code class="language-java">Transaction transaction = ElasticApm.currentTransaction();
        Span span = ElasticApm.currentSpan()
                .startSpan(&quot;external&quot;, &quot;kafka&quot;, null)
                .setName(&quot;DAVID&quot;).setServiceTarget(&quot;kafka&quot;,&quot;gcp-elastic-apm-spring-boot-integration&quot;);
        try (final Scope scope = transaction.activate()) {
            span.injectTraceHeaders((name, value) -&gt; producerRecord.headers().add(name,value.getBytes()));
            return Mono.fromRunnable(() -&gt; {
                kafkaTemplate.send(producerRecord);
            });
        } catch (Exception e) {
            span.captureException(e);
            throw e;
        } finally {
            span.end();
        }
</code></pre>
<p>Now we have enough code to get our agent bootstrapped.</p>
<p>To get the code from the GitHub repository up and running, you will need the following installed on your system and to ensure that you have the credentials for your GCP and AWS cloud.</p>
<pre><code class="language-java">
Java
Maven
Docker
Kubernetes CLI (kubectl)
</code></pre>
<h3>Clone the project</h3>
<p>Clone the multi-cloud Spring project to your local machine.</p>
<pre><code class="language-bash">git clone https://github.com/davidgeorgehope/multi-cloud
</code></pre>
<h3>Build the project</h3>
<p>From each service in the project (aws-multi-cloud, gcp-multi-cloud, gcp-bigdata-consumer-multi-cloud), run the following commands to build the project.</p>
<pre><code class="language-bash">mvn clean install
</code></pre>
<p>Now you can run the Java project locally.</p>
<pre><code class="language-java">java -jar gcp-bigdata-consumer-multi-cloud-0.0.1-SNAPSHOT.jar --spring.config.location=/Users/davidhope/applicaiton-gcp.properties
</code></pre>
<p>That will just get the Java application running locally, but you can also deploy this to Kubernetes using EKS and GKE as shown below.</p>
<h3>Create a Docker image</h3>
<p>Create a Docker image from the built project using the dockerBuild.sh provided in the project. You may want to customize this shell script to upload the built docker image to your own docker repository.</p>
<pre><code class="language-bash">./dockerBuild.sh
</code></pre>
<h3>Create a namespace for each service</h3>
<pre><code class="language-bash">kubectl create namespace aws
</code></pre>
<pre><code class="language-bash">kubectl create namespace gcp-1
</code></pre>
<pre><code class="language-bash">kubectl create namespace gcp-2
</code></pre>
<p>Once you have the namespaces created, you can switch context using the following command:</p>
<pre><code class="language-bash">kubectl config set-context --current --namespace=my-namespace
</code></pre>
<h3>Configuration for each service</h3>
<p>Each service needs an application.properties file. I have put an example <a href="https://github.com/davidgeorgehope/multi-cloud/blob/main/gcp-bigdata-consumer-multi-cloud/application.properties">here</a>.</p>
<p>You will need to replace the following properties with those you find in Elastic.</p>
<pre><code class="language-bash">elastic.apm.server-url=
elastic.apm.secret-token=
</code></pre>
<p>These can be found by going into Elastic Cloud and clicking on <strong>Services</strong> inside APM and then <strong>Add Data</strong> , which should be visible in the top right corner.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-kafka-confluent-cloud-elastic-observability/blog-elastic-observability-add-data.png" alt="add data" /></p>
<p>From there you will see the following, which gives you the config information you need.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-kafka-confluent-cloud-elastic-observability/blog-elastic-observability-apm-agents.png" alt="apm agents" /></p>
<p>You will need to replace the following properties with those you find in Confluent Cloud.</p>
<pre><code class="language-bash">elastic.kafka.producer.sasl-jaas-config=
</code></pre>
<p>This configuration comes from the Clients page in Confluent Cloud.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-kafka-confluent-cloud-elastic-observability/blog-elastic-observability-confluent-new-client.png" alt="confluent new client" /></p>
<h3>Adding the config for each service in Kubernetes</h3>
<p>Once you have a fully configured application properties, you need to add it to your <a href="https://www.elastic.co/blog/kubernetes-cluster-metrics-logs-monitoring">Kubernetes environment</a> as below.</p>
<p>From the aws namespace.</p>
<pre><code class="language-bash">kubectl create secret generic my-app-config --from-file=application.properties
</code></pre>
<p>From the gcp-1 namespace.</p>
<pre><code class="language-bash">kubectl create secret generic my-app-config --from-file=application.properties
</code></pre>
<p>From the gcp-2 namespace.</p>
<pre><code class="language-bash">kubectl create secret generic bigdata-creds --from-file=elastic-product-marketing-e145e13fbc7c.json

kubectl create secret generic my-app-config-gcp-bigdata --from-file=application.properties
</code></pre>
<h3>Create a Kubernetes deployment</h3>
<p>Create a Kubernetes deployment YAML file and add your Docker image to it. You can use the deployment.yaml file provided in the project as a template. Make sure to update the image name in the file to match the name of the Docker image you just created.</p>
<pre><code class="language-bash">kubectl apply -f deployment.yaml
</code></pre>
<h3>Create a Kubernetes service</h3>
<p>Create a Kubernetes service YAML file and add your deployment to it. You can use the service.yaml file provided in the project as a template.</p>
<pre><code class="language-yaml">kubectl apply -f service.yaml
</code></pre>
<h3>Access your application</h3>
<p>Your application is now running in a Kubernetes cluster. To access it, you can use the service's cluster IP and port. You can get the service's IP and port using the following command.</p>
<pre><code class="language-bash">kubectl get services
</code></pre>
<p>Now once you know where the service is, you need to execute it!</p>
<p>You can regularly poke the service endpoint using the following command.</p>
<pre><code class="language-bash">curl -X POST -H &quot;Content-Type: application/json&quot; -d '{&quot;name&quot;: &quot;linuxize&quot;, &quot;email&quot;: &quot;linuxize@example.com&quot;}' http://localhost:8080/api/my-objects/publish
</code></pre>
<p>With this up and running, you should see the following service map build out in the Elastic APM product.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-kafka-confluent-cloud-elastic-observability/blog-elastic-observability-aws-elastic-apm-spring-boot.png" alt="aws elastic apm spring boot" /></p>
<p>And traces will contain a waterfall graph showing all the spans that have executed across this distributed application, allowing you to pinpoint where any issues are within each transaction.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-kafka-confluent-cloud-elastic-observability/blog-elastic-observability-services.png" alt="observability services" /></p>
<h2>JMX for Kafka Producer/Consumer metrics</h2>
<p>In the previous part of this blog, we briefly touched on the JMX metric configuration you can see below.</p>
<pre><code class="language-bash">&quot;capture_jmx_metrics&quot;,&quot;object_name[kafka.producer:type=producer-metrics,client-id=*] attribute[batch-size-avg:metric_name=kafka.producer.batch-size-avg]&quot;
</code></pre>
<p>We can use this “capture_jmx_metrics” configuration to configure JMX for any Kafka Producer/Consumer metrics we want to monitor.</p>
<p>Check out the documentation <a href="https://www.elastic.co/guide/en/apm/agent/java/current/config-jmx.html">here</a> to understand how to configure this and <a href="https://docs.confluent.io/platform/current/kafka/monitoring.html">here</a> to see the available JMX metrics you can monitor. In the <a href="https://github.com/davidgeorgehope/multi-cloud/blob/main/gcp-bigdata-consumer-multi-cloud/src/main/java/com/elastic/multicloud/ElasticApmConfig.java">example code in GitHub</a>, we actually pull all the available metrics in, so you can check in there how to configure this.</p>
<p>One thing that’s worth pointing out here is that it’s important to use the “metric_name” property shown above or it gets quite difficult to find the metrics in Elastic Discover without being specific here.</p>
<h2>Monitoring Confluent Cloud with Elastic Observability</h2>
<p>So we now have some good monitoring set up for Kafka Producers and Consumers and we can trace transactions between services down to the lines of code that are executing. The core part of our Kafka infrastructure is hosted in Confluent Cloud. How, then, do we get data from there into our <a href="https://www.elastic.co/observability">full stack observability solution</a>?</p>
<p>Luckily, Confluent has done a fantastic job of making this easy. It provides important Confluent Cloud metrics via an open Prometheus-based metrics URL. So let's get down to business and configure this to bring data into our <a href="https://www.elastic.co/observability">observability tool</a>.</p>
<p>The first step is to configure Confluent Cloud with the MetricsViewer. The MetricsViewer role provides service account access to the Metrics API for all clusters in an organization. This role also enables service accounts to import metrics into third-party metrics platforms.</p>
<p>To assign the MetricsViewer role to a new service account:</p>
<ol>
<li>In the top-right administration menu (☰) in the upper-right corner of the Confluent Cloud user interface, click <strong>ADMINISTRATION &gt; Cloud API keys</strong>.</li>
<li>Click <strong>Add key</strong>.</li>
<li>Click the <strong>Granular access tile</strong> to set the scope for the API key. Click <strong>Next</strong>.</li>
<li>Click <strong>Create a new one</strong> and specify the service account name. Optionally, add a description. Click <strong>Next</strong>.</li>
<li>The API key and secret are generated for the service account. You will need this API key and secret to connect to the cluster, so be sure to safely store this information. Click <strong>Save</strong>. The new service account with the API key and associated ACLs is created. When you return to the API access tab, you can view the newly-created API key to confirm.</li>
<li>Return to Accounts &amp; access in the administration menu, and in the Accounts tab, click <strong>Service accounts</strong> to view your service accounts.</li>
<li>Select the service account that you want to assign the MetricsViewer role to.</li>
<li>In the service account’s details page, click <strong>Access</strong>.</li>
<li>In the tree view, open the resource where you want the service account to have the MetricsViewer role.</li>
<li>Click <strong>Add role assignment</strong> and select the MetricsViewer tile. Click <strong>Save</strong>.</li>
</ol>
<p>Next we can head to <a href="https://www.elastic.co/observability">Elastic Observability</a> and configure the Prometheus integration to pull in the metrics data.</p>
<p>Go to the integrations page in Kibana.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-kafka-confluent-cloud-elastic-observability/blog-elastic-observability-integrations.png" alt="observability integrations" /></p>
<p>Find the Prometheus integration. We are using the Prometheus integration because the Confluent Cloud metrics server can provide data in prometheus format. Trust us, this works really well — good work Confluent!</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-kafka-confluent-cloud-elastic-observability/blog-elastic-observability-integrations-prometheus.png" alt="integrations prometheus" /></p>
<p>Add Prometheus in the next page.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-kafka-confluent-cloud-elastic-observability/blog-elastic-observability-add-prometheus.png" alt="add prometheus" /></p>
<p>Configure the Prometheus plugin in the following way: In the hosts box, add the following URL, replacing the resource kafka id with the cluster id you want to monitor.</p>
<pre><code class="language-bash">https://api.telemetry.confluent.cloud:443/v2/metrics/cloud/export?resource.kafka.id=lkc-3rw3gw
</code></pre>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-kafka-confluent-cloud-elastic-observability/blog-elastic-observability-collect-prometheus-metrics.png" alt="collect prometheus metrics" /></p>
<p>Add the username and password under the advanced options you got from the API keys step you executed against Confluent Cloud above.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-kafka-confluent-cloud-elastic-observability/blog-elastic-observability-http-config-options.png" alt="http config options" /></p>
<p>Once the Integration is created, <a href="https://www.elastic.co/guide/en/fleet/current/agent-policy.html#apply-a-policy">the policy needs to be applied</a> to an instance of a running Elastic Agent.</p>
<p>That’s it! It’s that easy to get all the data you need for a full stack observability monitoring solution.</p>
<p>Finally, let’s pull all this together in a dashboard.</p>
<h2>Pulling it all together</h2>
<p>Using Kibana to generate dashboards is super easy. If you configured everything the way we recommended above, you should find the metrics (producer/consumer/brokers) you need to create your own dashboard as per the following screenshot.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-kafka-confluent-cloud-elastic-observability/blog-elastic-observability-dashboard-metrics.png" alt="dashboard metrics" /></p>
<p>Luckily, I made a dashboard for you and stored it in <a href="https://github.com/davidgeorgehope/multi-cloud/blob/main/export.ndjson">GitHub</a>. Take a look below and use this to import it into your own environments.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-kafka-confluent-cloud-elastic-observability/blog-elastic-observability-producer_metrics.png" alt="producer metrics" /></p>
<h2>Adding the icing on the cake: machine learning anomaly detection</h2>
<p>Now that we have all the critical bits in place, we are going to add the icing on the cake: machine learning (ML)!</p>
<p>Within Kibana, let's head over to the Machine Learning tab in “Analytics.”</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-kafka-confluent-cloud-elastic-observability/blog-elastic-observability-kibana-analytics.png" alt="kibana analytics" /></p>
<p>Go to the jobs page, where we’ll get started creating our first anomaly detection job.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-kafka-confluent-cloud-elastic-observability/blog-elastic-observability-create-your-first-anomaly-detection-job.png" alt="create your first anomaly detection job" /></p>
<p>The metrics data view contains what we need to create this new anomaly detection job.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-kafka-confluent-cloud-elastic-observability/blog-elastic-observability-metrics.png" alt="observability metrics" /></p>
<p>Use the wizard and select a “Single Metric.”</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-kafka-confluent-cloud-elastic-observability/blog-elastic-observability-use-a-wizard.png" alt="use a wizard" /></p>
<p>Use the full data.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-kafka-confluent-cloud-elastic-observability/blog-elastic-observability-use-full-data.png" alt="use full data" /></p>
<p>In this example, we are going to look for anomalies in the connection count. We really do not want a major deviation here, as this could indicate something very bad occurring if we suddenly have too many or too few things connecting to our Kafka cluster.</p>
<p>Once you have selected the connection count metric, you can proceed through the wizard and eventually your ML job will be created and you should be able to view the data as per the example below.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/monitor-kafka-confluent-cloud-elastic-observability/blog-elastic-observability-single-metric-viewer.png" alt="single metric viewer" /></p>
<p>Congratulations, you have now created a machine learning job to alert you if there are any problems with your Kafka cluster, adding <a href="https://www.elastic.co/observability/aiops">a full AIOps solution</a> to your Kafka and Confluent observability!</p>
<h2>Summary</h2>
<p>We looked at monitoring Kafka-based solutions implemented on Confluent Cloud using Elastic Observability.</p>
<p>We covered the architecture of a multi-cloud solution involving AWS EKS, Confluent Cloud, and GCP GKE. We looked at how to instrument Kafka applications with Elastic APM, use JMX for Kafka Producer/Consumer metrics, integrate Prometheus, and set up machine learning anomaly detection.</p>
<p>We went through a detailed walk-through with code snippets, configuration steps, and deployment instructions included to help you get started.</p>
<p>Interested in learning more about Elastic Observability? Check out the following resources:</p>
<ul>
<li><a href="https://www.elastic.co/virtual-events/intro-to-elastic-observability">An Introduction to Elastic Observability</a></li>
<li><a href="https://www.elastic.co/training/observability-fundamentals">Observability Fundamentals Training</a></li>
<li><a href="https://www.elastic.co/observability/demo">Watch an Elastic Observability demo</a></li>
<li><a href="https://www.elastic.co/blog/observability-predictions-trends-2023">Observability Predictions and Trends for 2023</a></li>
</ul>
<p>And sign up for our <a href="https://www.elastic.co/virtual-events/emerging-trends-in-observability">Elastic Observability Trends Webinar</a> featuring AWS and Forrester, not to be missed!</p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/monitor-kafka-confluent-cloud-elastic-observability/patterns-white-background-no-logo-observability_(1).png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Ingesting and analyzing Prometheus metrics with Elastic Observability]]></title>
            <link>https://www.elastic.co/observability-labs/blog/ingesting-analyzing-prometheus-metrics-observability</link>
            <guid isPermaLink="false">ingesting-analyzing-prometheus-metrics-observability</guid>
            <pubDate>Mon, 09 Oct 2023 00:00:00 GMT</pubDate>
            <description><![CDATA[In this blog post, we will showcase the integration of Prometheus with Elastic, emphasizing how Elastic elevates metrics monitoring through extensive historical analytics, anomaly detection, and forecasting, all in a cost-effective manner.]]></description>
            <content:encoded><![CDATA[<p>In the world of monitoring and observability, <a href="https://prometheus.io/">Prometheus</a> has grown into the de-facto standard for monitoring in cloud-native environments because of its robust data collection mechanism, flexible querying capabilities, and integration with other tools for rich dashboarding and visualization.</p>
<p>Prometheus is primarily built for short-term metric storage, typically retaining data in-memory or on local disk storage, with a focus on real-time monitoring and alerting rather than historical analysis. While it offers valuable insights into current metric values and trends, it may pose economic challenges and fall short of the robust functionalities and capabilities necessary for in-depth historical analysis, long-term trend detection, and forecasting. This is particularly evident in large environments with a substantial number of targets or high data ingestion rates, where metric data accumulates rapidly.</p>
<p>Numerous organizations assess their unique needs and explore avenues to augment their Prometheus monitoring and observability capabilities. One effective approach is integrating Prometheus with Elastic®. In this blog post, we will showcase the integration of Prometheus with Elastic, emphasizing how Elastic elevates metrics monitoring through extensive historical analytics, anomaly detection, and forecasting, all in a cost-effective manner.</p>
<h2>Integrate Prometheus with Elastic seamlessly</h2>
<p>Organizations that have configured their cloud-native applications to expose metrics in Prometheus format can seamlessly transmit the metrics to Elastic by using <a href="https://www.elastic.co/guide/en/beats/metricbeat/current/metricbeat-module-prometheus.html">Prometheus integration</a>. Elastic enables organizations to monitor their metrics in conjunction with all other data gathered through <a href="https://www.elastic.co/integrations/data-integrations">Elastic's extensive integrations</a>.</p>
<p>Go to Integrations and find the Prometheus integration.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/ingesting-analyzing-prometheus-metrics-observability/elastic-blog-1-integrations.png" alt="1 - integrations" /></p>
<p>To gather metrics from Prometheus servers, the Elastic Agent is employed, with central management of Elastic agents handled through the <a href="https://www.elastic.co/guide/en/fleet/current/fleet-overview.html">Fleet server</a>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/ingesting-analyzing-prometheus-metrics-observability/elastic-blog-2-set-up-prometheus-integration.png" alt="2 - set up integration" /></p>
<p>After enrolling the Elastic Agent in the Fleet, users can choose from the following methods to ingest Prometheus metrics into Elastic.</p>
<h3>1. Prometheus collectors</h3>
<p><a href="https://docs.elastic.co/integrations/prometheus#prometheus-exporters-collectors">The Prometheus collectors</a> connect to the Prometheus server and pull metrics or scrape metrics from a Prometheus exporter.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/ingesting-analyzing-prometheus-metrics-observability/elastic-blog-3-prometheus-collectors.png" alt="3 - Prometheus collectors" /></p>
<h3>2. Prometheus queries</h3>
<p><a href="https://docs.elastic.co/integrations/prometheus#prometheus-queries-promql">The Prometheus queries</a> execute specific Prometheus queries against <a href="https://prometheus.io/docs/prometheus/latest/querying/api/#expression-queries">Prometheus Query API</a>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/ingesting-analyzing-prometheus-metrics-observability/elastic-blog-4-promtheus-queries.png" alt="4 - Prometheus queries" /></p>
<h3>3. Prometheus remote-write</h3>
<p><a href="https://docs.elastic.co/integrations/prometheus#prometheus-server-remote-write">The Prometheus remote_write</a> can receive metrics from a Prometheus server that has configured the <a href="https://prometheus.io/docs/prometheus/latest/configuration/configuration/#remote_write">remote_write</a> setting.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/ingesting-analyzing-prometheus-metrics-observability/elastic-blog-5-prometheus-remote-write.png" alt="5 - Prometheus remote-write" /></p>
<p>After your Prometheus metrics are ingested, you have the option to visualize your data graphically within the <a href="https://www.elastic.co/guide/en/observability/current/explore-metrics.html">Metrics Explorer</a> and further segment it based on labels, such as hosts, containers, and more.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/ingesting-analyzing-prometheus-metrics-observability/elastic-blog-10-metrics-explorer.png" alt="10 - metrics explorer" /></p>
<p>You can also query your metrics data in <a href="https://www.elastic.co/guide/en/kibana/current/discover.html">Discover</a> and explore the fields of your individual documents within the details panel.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/ingesting-analyzing-prometheus-metrics-observability/elastic-blog-7-expanded-doc.png" alt="7 - expanded document" /></p>
<h2>Storing historical metrics with Elastic’s data tiering mechanism</h2>
<p>By exporting Prometheus metrics to Elasticsearch, organizations can extend the retention period and gain the ability to analyze metrics historically. Elastic optimizes data storage and access based on the frequency of data usage and the performance requirements of different data sets. The goal is to efficiently manage and store data, ensuring that it remains accessible when needed while keeping storage costs in check.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/ingesting-analyzing-prometheus-metrics-observability/elastic-blog-8-hot-to-frozen.png" alt="8 - hot to frozen flow chart" /></p>
<p>After ingesting Prometheus metrics data, you have various retention options. You can set the duration for data to reside in the hot tier, which utilizes high IO hardware (SSD) and is more expensive. Alternatively, you can move the Prometheus metrics to the warm tier, employing cost-effective hardware like spinning disks (HDD) while maintaining consistent and efficient search performance. The cold tier mirrors the infrastructure of the warm tier for primary data but utilizes S3 for replica storage. Elastic automatically recovers replica indices from S3 in case of node or disk failure, ensuring search performance comparable to the warm tier while reducing disk cost.</p>
<p>The <a href="https://www.elastic.co/blog/introducing-elasticsearch-frozen-tier-searchbox-on-s3">frozen tier</a> allows direct searching of data stored in S3 or an object store, without the need for rehydration. The purpose is to further reduce storage costs for Prometheus metrics data that is less frequently accessed. By moving historical data into the frozen tier, organizations can optimize their storage infrastructure, ensuring that the recent, critical data remains in higher-performance tiers while less frequently accessed data is stored economically in the frozen tier. This way, organizations can perform historical analysis and trend detection, identify patterns and make informed decisions, and maintain compliance with regulatory standards in a cost-effective manner.</p>
<p>An alternative way to store your cloud-native metrics more efficiently is to use <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/tsds.html">Elastic Time Series Data Stream</a> (TSDS). TSDS can store your metrics data more efficiently with <a href="https://www.elastic.co/blog/70-percent-storage-savings-for-metrics-with-elastic-observability">~70% less disk space</a> than a regular data stream. The <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/downsampling.html">downsampling</a> functionality will further reduce the storage required by rolling up metrics within a fixed time interval into a single summary metric. This not only assists organizations in cutting down on storage expenses for metric data but also simplifies the metric infrastructure, making it easier for users to correlate metrics with logs and traces through a unified interface.</p>
<h2>Advanced analytics</h2>
<p>Besides <a href="https://www.elastic.co/guide/en/observability/current/explore-metrics.html">Metrics Explorer</a> and <a href="https://www.elastic.co/guide/en/kibana/current/discover.html">Discover</a>, Elasticsearch® provides more advanced analytics capabilities and empowers organizations to gain deeper, more valuable insights into their Prometheus metrics data.</p>
<p>Out of the box, Prometheus integration provides a default overview dashboard.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/ingesting-analyzing-prometheus-metrics-observability/elastic-blog-9-advacned-analytics.png" alt="9 - adv analytics" /></p>
<p>From Metrics Explorer or Discover, users can also easily edit their Prometheus metrics visualization in <a href="https://www.elastic.co/kibana/kibana-lens">Elastic Lens</a> or create new visualizations from Lens.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/ingesting-analyzing-prometheus-metrics-observability/elastic-blog-6-metrics-explorer.png" alt="6 - metrics explorer" /></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/ingesting-analyzing-prometheus-metrics-observability/elastic-blog-11-green-bars.png" alt="11 - green bars" /></p>
<p>Elastic Lens enables users to explore and visualize data intuitively through dynamic visualizations. This user-friendly interface eliminates the need for complex query languages, making data analysis accessible to a broader audience. Elasticsearch also offers other powerful visualization methods with <a href="https://www.elastic.co/guide/en/kibana/current/add-aggregation-based-visualization-panels.html">aggregations</a> and <a href="https://www.youtube.com/watch?v=I8NtctS33F0">filters</a>, enabling users to perform advanced analytics on their Prometheus metrics data, including short-term and historical data. To learn more, check out the <a href="https://www.elastic.co/videos/training-how-to-series-stack">how-to series: Kibana</a>.</p>
<h2>Anomaly detection and forecasting</h2>
<p>When analyzing data, maintaining a constant watch on the screen is simply not feasible, especially when dealing with millions of time series of Prometheus metrics. Engineers frequently encounter the challenge of differentiating normal from abnormal data points, which involves analyzing historical data patterns — a process that can be exceedingly time consuming and often exceeds human capabilities. Thus, there is a pressing need for a more intelligent approach to detect anomalies efficiently.</p>
<p>Setting up alerts may seem like an obvious solution, but relying solely on rule-based alerts with static thresholds can be problematic. What's normal on a Wednesday at 9:00 a.m. might be entirely different from a Sunday at 2:00 a.m. This often leads to complex and hard-to-maintain rules or wide alert ranges that end up missing crucial issues. Moreover, as your business, infrastructure, users, and products evolve, these fixed rules don't keep up, resulting in lots of false positives or, even worse, important issues slipping through the cracks without detection. A more intelligent and adaptable approach is needed to ensure accurate and timely anomaly detection.</p>
<p>Elastic's machine learning anomaly detection excels in such scenarios. It automatically models the normal behavior of your Prometheus data, learning trends, and identifying anomalies, thereby reducing false positives and improving mean time to resolution (MTTR). With over 13 years of development experience in this field, Elastic has emerged as a trusted industry leader.</p>
<p>The key advantage of Elastic's machine learning anomaly detection lies in its unsupervised learning approach. By continuously observing real-time data, it acquires an understanding of the data's behavior over time. This includes grasping daily and weekly patterns, enabling it to establish a normalcy range of expected behavior. Behind the scenes, it constructs statistical models that allow accurate predictions, promptly identifying any unexpected variations. In cases where emerging data exhibits unusual trends, you can seamlessly integrate with alerting systems, operationalizing this valuable insight.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/ingesting-analyzing-prometheus-metrics-observability/elastic-blog-12-LPO.png" alt="12 - LPO" /></p>
<p>Machine learning's ability to project into the future, forecasting data trends one day, a week, or even a month ahead, equips engineers not only with reporting capabilities but also with pattern recognition and failure prediction based on historical Prometheus data. This plays a crucial role in maintaining mission-critical workloads, offering organizations a proactive monitoring approach. By foreseeing and addressing issues before they escalate, organizations can avert downtime, cut costs, optimize resource utilization, and ensure uninterrupted availability of their vital applications and services.</p>
<p><a href="https://www.elastic.co/guide/en/machine-learning/current/ml-ad-run-jobs.html#ml-ad-create-job">Creating a machine learning job</a> for your Prometheus data is a straightforward task with a few simple steps. Simply specify the data index and set the desired time range in the single metric view. The machine learning job will then automatically process the historical data, building statistical models behind the scenes. These models will enable the system to predict trends and identify anomalies effectively, providing valuable and actionable insights for your monitoring needs.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/ingesting-analyzing-prometheus-metrics-observability/elastic-blog-13-creating-ML-job.png" alt="13 - create ML job" /></p>
<p>In essence, Elastic machine learning empowers us to harness the capabilities of data scientists and effectively apply them in monitoring Prometheus metrics. By seamlessly detecting anomalies and predicting potential issues in advance, Elastic machine learning bridges the gap and enables IT professionals to benefit from the insights derived from advanced data analysis. This practical and accessible approach to anomaly detection equips organizations with a proactive stance toward maintaining the reliability of their systems.</p>
<h2>Try it out</h2>
<p><a href="https://www.elastic.co/cloud/cloud-trial-overview">Start a free trial</a> on Elastic Cloud and <a href="https://www.elastic.co/guide/en/beats/metricbeat/current/metricbeat-module-prometheus.html">ingest your Prometheus metrics into Elastic</a>. Enhance your Prometheus monitoring with Elastic Observability. Stay ahead of potential issues with advanced AI/ML anomaly detection and prediction capabilities. Eliminate data silos, reduce costs, and enhance overall response efficiency.</p>
<p>Elevate your monitoring capabilities with Elastic today!</p>
<p><em>The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.</em></p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/ingesting-analyzing-prometheus-metrics-observability/illustration-machine-learning-anomaly-v2.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Dynamic workload discovery on Kubernetes now supported with EDOT Collector]]></title>
            <link>https://www.elastic.co/observability-labs/blog/k8s-discovery-with-EDOT-collector</link>
            <guid isPermaLink="false">k8s-discovery-with-EDOT-collector</guid>
            <pubDate>Tue, 01 Apr 2025 00:00:00 GMT</pubDate>
            <description><![CDATA[Discover how Elastic's OpenTelemetry Collector leverages Kubernetes pod annotations providing dynamic workload discovery and improves automated metric and log collection for Kubernetes clusters.]]></description>
            <content:encoded><![CDATA[<p>At Elastic, Kubernetes is one of the most significant observability use cases we focus on.
We want to provide the best onboarding experience and lifecycle management based on real-world GitOps best practices.</p>
<p>OpenTelemetry recently <a href="https://opentelemetry.io/blog/2025/otel-collector-k8s-discovery/">published a blog</a> on how to do <code>Autodiscovery based on Kubernetes Pods' annotations</code> with the OpenTelemetry Collector.</p>
<p>In this blog post, we will talk about how to use this Kubernetes-related feature of the OpenTelemetry Collector,
which is already available with the Elastic Distribution of the OpenTelemetry (EDOT) Collector.</p>
<p>In addition to this feature, at Elastic, we heavily invest in making OpenTelemetry the best, standardized ingest solution for Observability.
You might already have seen us focusing on:</p>
<ul>
<li>
<p><a href="https://www.elastic.co/blog/ecs-elastic-common-schema-otel-opentelemetry-announcement">Semantic Conventions standardization</a></p>
</li>
<li>
<p>significant <a href="https://www.elastic.co/observability-labs/blog/elastics-collaboration-opentelemetry-filelog-receiver">log collection improvements</a></p>
</li>
<li>
<p>various other topics around <a href="https://www.elastic.co/observability-labs/blog/auto-instrumentation-go-applications-opentelemetry">instrumentation</a></p>
</li>
<li>
<p><a href="https://www.elastic.co/observability-labs/blog/elastic-donation-proposal-to-contribute-profiling-agent-to-opentelemetry">profiling</a></p>
</li>
</ul>
<p>Let's walk you through a hands-on journey using the EDOT Collector covering various use cases you might encounter in the real world, highlighting the capabilities of this powerful feature.</p>
<h2>Configuring EDOT Collector</h2>
<p>The Collector’s configuration is not our main focus here, since based on the nature of this feature it is minimal,
letting workloads define how they should be monitored.</p>
<p>To illustrate the point, here is the Collector configuration snippet that enables the feature for both logs and metrics:</p>
<pre><code class="language-yaml">receivers:
    receiver_creator/metrics:
      watch_observers: [k8s_observer]
      discovery:
        enabled: true
      receivers:

    receiver_creator/logs:
      watch_observers: [k8s_observer]
      discovery:
        enabled: true
      receivers:
</code></pre>
<p>You can include the above in the EDOT’s Collector configuration, specifically the
<a href="https://github.com/elastic/elastic-agent/blob/v9.0.0-rc1/deploy/helm/edot-collector/kube-stack/values.yaml#L339">receivers’ section</a>.</p>
<p>Since logs collection in our examples will happen from the discovery feature make sure that the static filelog receiver
<a href="https://github.com/elastic/elastic-agent/blob/v9.0.0-rc1/deploy/helm/edot-collector/kube-stack/values.yaml#L348">configuration block</a> is removed
and its <a href="https://github.com/elastic/elastic-agent/blob/v9.0.0-rc1/deploy/helm/edot-collector/kube-stack/values.yaml#L193"><code>preset</code></a>
is disabled (i.e. set to <code>false</code>) to avoid having log duplication.</p>
<p>Make sure that the receiver creator is properly added in the pipelines for
<a href="https://github.com/elastic/elastic-agent/blob/v9.0.0-rc1/deploy/helm/edot-collector/kube-stack/values.yaml#L471">logs</a>
(in addition to removing the <code>filelog</code> receiver completely)
and <a href="https://github.com/elastic/elastic-agent/blob/v9.0.0-rc1/deploy/helm/edot-collector/kube-stack/values.yaml#L484">metrics</a>
respectively.</p>
<p>Ensure that <a href="https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/v0.122.0/extension/observer/k8sobserver/README.md"><code>k8sobserver</code></a>
is enabled as part of the extensions:</p>
<pre><code class="language-yaml">extensions:
  k8s_observer:
    observe_nodes: true
    observe_services: true
    observe_ingresses: true

// ...

service:
  extensions: [k8s_observer]
</code></pre>
<p>Last but not least, ensure the log files' volume is mounted properly:</p>
<pre><code class="language-yaml">volumeMounts:
 - name: varlogpods
   mountPath: /var/log/pods
   readOnly: true

volumes:
  - name: varlogpods
    hostPath:
      path: /var/log/pods
</code></pre>
<p>Once the configuration is ready follow the <a href="https://www.elastic.co/docs/reference/opentelemetry/quickstart/">Kubernetes quickstart guides on how to deploy the EDOT Collector</a>.
Make sure to replace the <code>values.yaml</code> file linked in the quickstart guide with the file that includes the above-described modifications.</p>
<h3>Collecting Metrics from Moving Targets Based on Their Annotations</h3>
<p>In this example, we have a Deployment with a Pod spec that consists of two different containers.
One container runs a Redis server, while the other runs an NGINX server. Consequently, we want to provide
different hints for each of these target containers.</p>
<p>The annotation-based discovery feature supports this, allowing us to specify metrics annotations
per exposed container port.</p>
<p>Here is how the complete spec file looks:</p>
<pre><code class="language-yaml">apiVersion: v1
kind: ConfigMap
metadata:
  name: nginx-conf
data:
  nginx.conf: |
    user  nginx;
    worker_processes  1;
    error_log  /dev/stderr warn;
    pid        /var/run/nginx.pid;
    events {
      worker_connections  1024;
    }
    http {
      include       /etc/nginx/mime.types;
      default_type  application/octet-stream;

      log_format  main  '$remote_addr - $remote_user [$time_local] &quot;$request&quot; '
                        '$status $body_bytes_sent &quot;$http_referer&quot; '
                        '&quot;$http_user_agent&quot; &quot;$http_x_forwarded_for&quot;';
      access_log  /dev/stdout main;
      server {
          listen 80;
          server_name localhost;

          location /nginx_status {
              stub_status on;
          }
      }
      include /etc/nginx/conf.d/*;
    }
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: redis-deployment
  labels:
    app: redis
spec:
  replicas: 1
  selector:
    matchLabels:
      app: redis
  template:
    metadata:
      labels:
        app: redis
      annotations:
        # redis container port hints
        io.opentelemetry.discovery.metrics.6379/enabled: &quot;true&quot;
        io.opentelemetry.discovery.metrics.6379/scraper: redis
        io.opentelemetry.discovery.metrics.6379/config: |
          collection_interval: &quot;20s&quot;
          timeout: &quot;10s&quot;

        # nginx container port hints
        io.opentelemetry.discovery.metrics.80/enabled: &quot;true&quot;
        io.opentelemetry.discovery.metrics.80/scraper: nginx
        io.opentelemetry.discovery.metrics.80/config: |
          endpoint: &quot;http://`endpoint`/nginx_status&quot;
          collection_interval: &quot;30s&quot;
          timeout: &quot;20s&quot;
    spec:
      volumes:
      - name: nginx-conf
        configMap:
          name: nginx-conf
          items:
            - key: nginx.conf
              path: nginx.conf
      containers:
        - name: webserver
          image: nginx:latest
          ports:
            - containerPort: 80
              name: webserver
          volumeMounts:
            - mountPath: /etc/nginx/nginx.conf
              readOnly: true
              subPath: nginx.conf
              name: nginx-conf
        - image: redis
          imagePullPolicy: IfNotPresent
          name: redis
          ports:
            - name: redis
              containerPort: 6379
              protocol: TCP
</code></pre>
<p>When this workload is deployed, the Collector will automatically discover it and identify the specific annotations.
After this, two different receivers will be started, each one responsible for each of the target containers.</p>
<h3>Collecting Logs from Multiple Target Containers</h3>
<p>The annotation-based discovery feature also supports log collection based on the provided annotations.
In the example below, we again have a Deployment with a Pod consisting of two different containers,
where we want to apply different log collection configurations.
We can specify annotations that are scoped to individual container names:</p>
<pre><code class="language-yaml">apiVersion: apps/v1
kind: Deployment
metadata:
  name: busybox-logs-deployment
  labels:
    app: busybox
spec:
  replicas: 1
  selector:
    matchLabels:
      app: busybox
  template:
    metadata:
      labels:
        app: busybox
      annotations:
        io.opentelemetry.discovery.logs.lazybox/enabled: &quot;true&quot;
        io.opentelemetry.discovery.logs.lazybox/config: |
          operators:
            - id: container-parser
              type: container
            - id: some
              type: add
              field: attributes.tag
              value: hints-lazybox
        io.opentelemetry.discovery.logs.busybox/enabled: &quot;true&quot;
        io.opentelemetry.discovery.logs.busybox/config: |
          operators:
            - id: container-parser
              type: container
            - id: some
              type: add
              field: attributes.tag
              value: hints-busybox
    spec:
      containers:
        - name: busybox
          image: busybox
          args:
            - /bin/sh
            - -c
            - while true; do echo &quot;otel logs from busybox at $(date +%H:%M:%S)&quot; &amp;&amp; sleep 5s; done
        - name: lazybox
          image: busybox
          args:
            - /bin/sh
            - -c
            - while true; do echo &quot;otel logs from lazybox at $(date +%H:%M:%S)&quot; &amp;&amp; sleep 25s; done
</code></pre>
<p>The above configuration enables two different filelog receiver instances, each applying a unique parsing configuration.
This is handy when we know how to parse specific technology logs, such as Apache server access logs.</p>
<h3>Combining Both Metrics and Logs Collection</h3>
<p>In our third example, we illustrate how to define both metrics and log annotations on the same workload.
This allows us to collect both signals from the discovered workload.
Below is a Deployment with a Pod consisting of a Redis server and a BusyBox container that performs dummy log writing.
We can target annotations to the port and container levels to collect metrics from the Redis server using
the Redis receiver, and logs from the BusyBox using the filelog receiver. Here’s how:</p>
<pre><code class="language-yaml">apiVersion: apps/v1
kind: Deployment
metadata:
  name: redis-deployment
  labels:
    app: redis
spec:
  replicas: 1
  selector:
    matchLabels:
      app: redis
  template:
    metadata:
      labels:
        app: redis
      annotations:
        io.opentelemetry.discovery.metrics.6379/enabled: &quot;true&quot;
        io.opentelemetry.discovery.metrics.6379/scraper: redis
        io.opentelemetry.discovery.metrics.6379/config: |
          collection_interval: &quot;20s&quot;
          timeout: &quot;10s&quot;

        io.opentelemetry.discovery.logs.busybox/enabled: &quot;true&quot;
        io.opentelemetry.discovery.logs.busybox/config: |
          operators:
            - id: container-parser
              type: container
            - id: some
              type: add
              field: attributes.tag
              value: hints
    spec:
      containers:
        - image: redis
          imagePullPolicy: IfNotPresent
          name: redis
          ports:
            - name: redis
              containerPort: 6379
              protocol: TCP
        - name: busybox
          image: busybox
          args:
            - /bin/sh
            - -c
            - while true; do echo &quot;otel logs at $(date +%H:%M:%S)&quot; &amp;&amp; sleep 15s; done
</code></pre>
<h3>Explore and analyse data coming from dynamic targets in Elastic</h3>
<p>Once the target Pods are discovered and the Collector has started collecting telemetry data from them,
we can then explore this data in Elastic. In Discover we can search for Redis and NGINX metrics as well as
logs collected from the Busybox container. Here is how it looks like:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/k8s-discovery-with-EDOT-collector/discoverlogs.png" alt="Logs Discovery" />
<img src="https://www.elastic.co/observability-labs/assets/images/k8s-discovery-with-EDOT-collector/discovermetrics.png" alt="Metrics Discovery" /></p>
<h2>Summary</h2>
<p>The examples above showcase how users of our OpenTelemetry Collector can take advantage of this new feature
— one we played a major role in developing.</p>
<p>For this, we leveraged our years of experience with similar features already supported in
<a href="https://www.elastic.co/guide/en/beats/metricbeat/current/configuration-autodiscover-hints.html">Metricbeat</a>,
<a href="https://www.elastic.co/guide/en/beats/filebeat/current/configuration-autodiscover-hints.html">Filebeat</a>, and
<a href="https://www.elastic.co/guide/en/fleet/current/hints-annotations-autodiscovery.html">Elastic-Agent</a>.
This makes us extremely happy and confident, as it closes the feature gap between Elastic's specific
monitoring agents and the OpenTelemetry Collector — making it even better.</p>
<p>Interested in learning more? Visit the
<a href="https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/receivercreator/README.md#generate-receiver-configurations-from-provided-hints">documentation</a>
and give it a try by following our <a href="https://www.elastic.co/docs/reference/opentelemetry/quickstart/">EDOT quickstart guide</a>.</p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/k8s-discovery-with-EDOT-collector/k8s-discovery-new.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Managing your Kubernetes cluster with Elastic Observability]]></title>
            <link>https://www.elastic.co/observability-labs/blog/kubernetes-cluster-metrics-logs-monitoring</link>
            <guid isPermaLink="false">kubernetes-cluster-metrics-logs-monitoring</guid>
            <pubDate>Mon, 24 Oct 2022 00:00:00 GMT</pubDate>
            <description><![CDATA[Unify all of your Kubernetes metrics, log, and trace data on a single platform and dashboard, Elastic. From the infrastructure to the application layer Elastic Observability makes it easier for you to understand how your cluster is performing.]]></description>
            <content:encoded><![CDATA[<p>As an operations engineer (SRE, IT manager, DevOps), you’re always struggling with how to manage technology and data sprawl. Kubernetes is becoming increasingly pervasive and a majority of these deployments will be in Amazon Elastic Kubernetes Service (EKS), Google Kubernetes Engine (GKE), or Azure Kubernetes Service (AKS). Some of you may be on a single cloud while others will have the added burden of managing clusters on multiple Kubernetes cloud services. In addition to cloud provider complexity, you also have to manage hundreds of deployed services generating more and more observability and telemetry data.</p>
<p>The day-to-day operations of understanding the status and health of your Kubernetes clusters and applications running on them, through the logs, metrics, and traces they generate, will likely be your biggest challenge. But as an operations engineer you will need all of that important data to help prevent, predict, and remediate issues. And you certainly don’t need that volume of metrics, logs and traces spread across multiple tools when you need to visualize and analyze Kubernetes telemetry data for troubleshooting and support.</p>
<p>Elastic Observability helps manage the sprawl of Kubernetes metrics and logs by providing extensive and centralized observability capabilities beyond just the logging that we are known for. Elastic Observability provides you with granular insights and context into the behavior of your Kubernetes clusters along with the applications running on them by unifying all of your metrics, log, and trace data through OpenTelemetry and APM agents.</p>
<p>Regardless of the cluster location (EKS, GKE, AKS, self-managed) or application, <a href="https://www.elastic.co/what-is/kubernetes-monitoring">Kubernetes monitoring</a> is made simple with Elastic Observability. All of the node, pod, container, application, and infrastructure (AWS, GCP, Azure) metrics, infrastructure and application logs, along with application traces are available in Elastic Observability.</p>
<p>In this blog we will show:</p>
<ul>
<li>How <a href="http://cloud.elastic.co">Elastic Cloud</a> can aggregate and ingest metrics and log data through the Elastic Agent (easily deployed on your cluster as a DaemonSet) to retrieve logs and metrics from the host (system metrics, container stats) along with logs from all services running on top of Kubernetes.</li>
<li>How Elastic Observability can bring a unified telemetry experience (logs, metrics,traces) across all your Kubernetes cluster components (pods, nodes, services, namespaces, and more).</li>
</ul>
<p><img src="https://www.elastic.co/observability-labs/assets/images/kubernetes-cluster-metrics-logs-monitoring/ManagingKubernetes-ElasticAgentIntegration-1.png" alt="Elastic Agent with Kubernetes Integration" /></p>
<h2>Prerequisites and config</h2>
<p>If you plan on following this blog, here are some of the components and details we used to set up this demonstration:</p>
<ul>
<li>Ensure you have an account on <a href="http://cloud.elastic.co">Elastic Cloud</a> and a deployed stack (<a href="https://www.elastic.co/guide/en/elastic-stack/current/installing-elastic-stack.html">see instructions here</a>).</li>
<li>While we used GKE, you can use any location for your Kubernetes cluster.</li>
<li>We used a variant of the ever so popular <a href="https://github.com/GoogleCloudPlatform/microservices-demo">HipsterShop</a> demo application. It was originally written by Google to showcase Kubernetes across a multitude of variants available such as the <a href="https://github.com/open-telemetry/opentelemetry-demo">OpenTelemetry Demo App</a>. To use the app, please go <a href="https://github.com/bshetti/opentelemetry-microservices-demo/tree/main/deploy-with-collector-k8s">here</a> and follow the instructions to deploy. You don’t need to deploy otelcollector for Kubernetes metrics to flow — we will cover this below.</li>
<li>Elastic supports native ingest from Prometheus and FluentD, but in this blog, we are showing a direct ingest from Kubernetes cluster via Elastic Agent. There will be a follow-up blog showing how Elastic can also pull in telemetry from Prometheus or FluentD/bit.</li>
</ul>
<h2>What can you observe and analyze with Elastic?</h2>
<p>Before we walk through the steps on getting Elastic set up to ingest and visualize Kubernetes cluster metrics and logs, let’s take a sneak peek at Elastic’s helpful dashboards.</p>
<p>As we noted, we ran a variant of HipsterShop on GKE and deployed Elastic Agents with Kubernetes integration as a DaemonSet on the GKE cluster. Upon deployment of the agents, Elastic starts ingesting metrics from the Kubernetes cluster (specifically from kube-state-metrics) and additionally Elastic will pull all log information from the cluster.</p>
<h3>Visualizing Kubernetes metrics on Elastic Observability</h3>
<p>Here are a few Kubernetes dashboards that will be available out of the box (OOTB) on Elastic Observability.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/kubernetes-cluster-metrics-logs-monitoring/ManagingKubernetes-HipsterShopMetrics-2.png" alt="HipsterShop cluster metrics on Elastic Kubernetes overview dashboard " /></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/kubernetes-cluster-metrics-logs-monitoring/ManagingKubernetes-HipsterShopDashboard-3.png" alt="HipsterShop default namespace pod dashboard on Elastic Observability" /></p>
<p>In addition to the cluster overview dashboard and pod dashboard, Elastic has several useful OOTB dashboards:</p>
<ul>
<li>Kubernetes overview dashboard (see above)</li>
<li>Kubernetes pod dashboard (see above)</li>
<li>Kubernetes nodes dashboard</li>
<li>Kubernetes deployments dashboard</li>
<li>Kubernetes DaemonSets dashboard</li>
<li>Kubernetes StatefulSets dashboards</li>
<li>Kubernetes CronJob &amp; Jobs dashboards</li>
<li>Kubernetes services dashboards</li>
<li>More being added regularly</li>
</ul>
<p>Additionally, you can either customize these dashboards or build out your own.</p>
<h3>Working with logs on Elastic Observability</h3>
<p><img src="https://www.elastic.co/observability-labs/assets/images/kubernetes-cluster-metrics-logs-monitoring/ManagingKubernetes-Logging-4.png" alt="Kubernetes container logs and Elastic Agent logs" /></p>
<p>As you can see from the screens above, not only can I get Kubernetes cluster metrics, but also all the Kubernetes logs simply by using the Elastic Agent in my Kubernetes cluster.</p>
<h3>Prevent, predict, and remediate issues</h3>
<p>In addition to helping manage metrics and logs, Elastic can help you detect and predict anomalies across your cluster telemetry. Simply turn on Machine Learning in Elastic against your data and watch it help you enhance your analysis work. As you can see below, Elastic is not only a unified observability location for your Kubernetes cluster logs and metrics, but it also provides extensive true machine learning capabilities to enhance your analysis and management.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/kubernetes-cluster-metrics-logs-monitoring/ManagingKubernetes-AnomalyDetection-5.png" alt="Anomaly detection across logs on Elastic Observability" /></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/kubernetes-cluster-metrics-logs-monitoring/ManagingKubernetes-PodIssues-6.png" alt="Analyzing issues on a Kubernetes pod with Elastic Observability " /></p>
<p>In the top graph, you see anomaly detection across logs and it shows something potentially wrong in the September 21 to 23 time period. Dig into the details on the bottom chart by analyzing a single kubernetes.pod.cpu.usage.node metric showing cpu issues early in September and again, later on in the month. You can do more complicated analyses on your cluster telemetry with Machine Learning using multi-metric analysis (versus the single metric issue I am showing above) along with population analysis.</p>
<p>Elastic gives you better machine learning capabilities to enhance your analysis of Kubernetes cluster telemetry. In the next section, let’s walk through how easy it is to get your telemetry data into Elastic.</p>
<h2>Setting it all up</h2>
<p>Let’s walk through the details of how to get metrics, logs, and traces into Elastic from a HipsterShop application deployed on GKE.</p>
<p>First, pick your favorite version of Hipstershop — as we noted above, we used a variant of the <a href="https://github.com/open-telemetry/opentelemetry-demo">OpenTelemetry-Demo</a> because it already has OTel. We slimmed it down for this blog, however (fewer services with some varied languages).</p>
<h3>Step 0: Get an account on Elastic Cloud</h3>
<p>Follow the instructions to <a href="https://cloud.elastic.co/registration?fromURI=/home">get started on Elastic Cloud</a>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/kubernetes-cluster-metrics-logs-monitoring/ManagingKubernetes-FreeElasticCloud-7.png" alt="" /></p>
<h3>Step 1: Get a Kubernetes cluster and load your Kubernetes app into your cluster</h3>
<p>Get your app on a Kubernetes cluster in your Cloud service of choice or local Kubernetes platform. Once your app is up on Kubernetes, you should have the following pods (or some variant) running on the default namespace.</p>
<pre><code class="language-yaml">NAME                                    READY   STATUS    RESTARTS   AGE
adservice-8694798b7b-jbfxt              1/1     Running   0          4d3h
cartservice-67b598697c-hfsxv            1/1     Running   0          4d3h
checkoutservice-994ddc4c4-p9p2s         1/1     Running   0          4d3h
currencyservice-574f65d7f8-zc4bn        1/1     Running   0          4d3h
emailservice-6db78645b5-ppmdk           1/1     Running   0          4d3h
frontend-5778bfc56d-jjfxg               1/1     Running   0          4d3h
jaeger-686c775fbd-7d45d                 1/1     Running   0          4d3h
loadgenerator-c8f76d8db-gvrp7           1/1     Running   0          4d3h
otelcollector-5b87f4f484-4wbwn          1/1     Running   0          4d3h
paymentservice-6888bb469c-nblqj         1/1     Running   0          4d3h
productcatalogservice-66478c4b4-ff5qm   1/1     Running   0          4d3h
recommendationservice-648978746-8bzxc   1/1     Running   0          4d3h
redis-cart-96d48485f-gpgxd              1/1     Running   0          4d3h
shippingservice-67fddb767f-cq97d        1/1     Running   0          4d3h
</code></pre>
<h3>Step 2: Turn on &lt;a href=&quot;https://github.com/kubernetes/kube-state-metrics&quot; target=&quot;_self&quot;&gt;kube-state-metrics&lt;/a&gt;</h3>
<p>Next you will need to turn on <a href="https://github.com/kubernetes/kube-state-metrics">kube-state-metrics</a>.</p>
<p>First:</p>
<pre><code class="language-bash">git clone https://github.com/kubernetes/kube-state-metrics.git
</code></pre>
<p>Next, in the kube-state-metrics directory under the examples directory, just apply the standard config.</p>
<pre><code class="language-bash">kubectl apply -f ./standard
</code></pre>
<p>This will turn on kube-state-metrics, and you should see a pod similar to this running in kube-system namespace.</p>
<pre><code class="language-yaml">kube-state-metrics-5f9dc77c66-qjprz                    1/1     Running   0          4d4h
</code></pre>
<h3>Step 3: Install the Elastic Agent with Kubernetes integration</h3>
<p><strong>Add Kubernetes Integration:</strong></p>
<ol>
<li><img src="https://images.contentstack.io/v3/assets/bltefdd0b53724fa2ce/blt5a3ae745e98b9e37/635691670a58db35cbdbc0f6/ManagingKubernetes-Addk8sButton-8.png" alt="" /></li>
<li>In Elastic, go to integrations and select the Kubernetes Integration, and select to Add Kubernetes.</li>
<li>Select a name for the Kubernetes integration.</li>
<li>Turn on kube-state-metrics in the configuration screen.</li>
<li>Give the configuration a name in the new-agent-policy-name text box.</li>
<li>Save the configuration. The integration with a policy is now created.</li>
</ol>
<p>You can read up on the agent policies and how they are used on the Elastic Agent <a href="https://www.elastic.co/guide/en/fleet/current/agent-policy.html">here</a>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/kubernetes-cluster-metrics-logs-monitoring/ManagingKubernetes-K8sIntegration-9.png" alt="" /></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/kubernetes-cluster-metrics-logs-monitoring/ManagingKubernetes-FleetManagement-10.png" alt="" /></p>
<ol>
<li>Add Kubernetes integration.</li>
<li>Select the policy you just created in the second.</li>
<li>In the third step of Add Agent instructions, copy and paste or download the manifest.</li>
<li>Add manifest to the shell where you have kubectl running, save it as elastic-agent-managed-kubernetes.yaml, and run the following command.</li>
</ol>
<pre><code class="language-yaml">kubectl apply -f elastic-agent-managed-kubernetes.yaml
</code></pre>
<p>You should see a number of agents come up as part of a DaemonSet in kube-system namespace.</p>
<pre><code class="language-yaml">NAME                                                   READY   STATUS    RESTARTS   AGE
elastic-agent-qr6hj                                    1/1     Running   0          4d7h
elastic-agent-sctmz                                    1/1     Running   0          4d7h
elastic-agent-x6zkw                                    1/1     Running   0          4d7h
elastic-agent-zc64h                                    1/1     Running   0          4d7h
</code></pre>
<p>In my cluster, I have four nodes and four elastic-agents started as part of the DaemonSet.</p>
<h3>Step 4: Look at Elastic out of the box dashboards (OOTB) for Kubernetes metrics and start discovering Kubernetes logs</h3>
<p>That is it. You should see metrics flowing into all the dashboards. To view logs for specific pods, simply go into Discover in Kibana and search for a specific pod name.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/kubernetes-cluster-metrics-logs-monitoring/ManagingKubernetes-HipsterShopMetrics-2.png" alt="HipsterShop cluster metrics on Elastic Kubernetes overview dashboard" /></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/kubernetes-cluster-metrics-logs-monitoring/ManagingKubernetes-HipsterShopDashboard-3.png" alt="Hipstershop default namespace pod dashboard on Elastic Observability" /></p>
<p>Additionally, you can browse all the pod logs directly in Elastic.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/kubernetes-cluster-metrics-logs-monitoring/ManagingKurbenetes-PodLogs-11.png" alt="frontendService and cartService logs" /></p>
<p>In the above example, I searched for frontendService and cartService logs.</p>
<h3>Step 5: Bonus!</h3>
<p>Because we were using an OTel based application, Elastic can even pull in the application traces. But that is a discussion for another blog.</p>
<p>Here is a quick peek at what Hipster Shop’s traces for a front end transaction look like in Elastic Observability.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/kubernetes-cluster-metrics-logs-monitoring/ManagingKubernetes-CheckOutTransaction-12.png" alt="Trace for Checkout transaction for HipsterShop" /></p>
<h2>Conclusion: Elastic Observability rocks for Kubernetes monitoring</h2>
<p>I hope you’ve gotten an appreciation for how Elastic Observability can help you manage Kubernetes clusters along with the complexity of the metrics, log, and trace data it generates for even a simple deployment.</p>
<p>A quick recap of lessons and more specifically learned:</p>
<ul>
<li>How <a href="http://cloud.elastic.co">Elastic Cloud</a> can aggregate and ingest telemetry data through the Elastic Agent, which is easily deployed on your cluster as a DaemonSet and retrieves metrics from the host, such as system metrics, container stats, and metrics from all services running on top of Kubernetes</li>
<li>Show what Elastic brings from a unified telemetry experience (Kubernenetes logs, metrics, traces) across all your Kubernetes cluster components (pods, nodes, services, any namespace, and more).</li>
<li>Interest in exploring Elastic’s ML capabilities which will reduce your <strong>MTTHH</strong> (mean time to happy hour)</li>
</ul>
<p>Ready to get started? <a href="https://cloud.elastic.co/registration">Register</a> and try out the features and capabilities I’ve outlined above.</p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/kubernetes-cluster-metrics-logs-monitoring/ManagingKubernetes-ElasticAgentIntegration-1.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Gain insights into Kubernetes errors with Elastic Observability logs and OpenAI]]></title>
            <link>https://www.elastic.co/observability-labs/blog/kubernetes-errors-observability-logs-openai</link>
            <guid isPermaLink="false">kubernetes-errors-observability-logs-openai</guid>
            <pubDate>Thu, 18 May 2023 00:00:00 GMT</pubDate>
            <description><![CDATA[This blog post provides an example of how one can analyze error messages in Elasticsearch with ChatGPT using the OpenAI API via Elasticsearch.]]></description>
            <content:encoded><![CDATA[<p>As we’ve shown in previous blogs, Elastic&lt;sup&gt;®&lt;/sup&gt; provides a way to ingest and manage telemetry from the <a href="https://www.elastic.co/blog/kubernetes-cluster-metrics-logs-monitoring">Kubernetes cluster</a> and the <a href="https://www.elastic.co/blog/opentelemetry-observability">application</a> running on it. Elastic provides out-of-the-box dashboards to help with tracking metrics, <a href="https://www.elastic.co/blog/log-management-observability-operations">log management and analytics</a>, <a href="https://www.elastic.co/blog/adding-free-and-open-elastic-apm-as-part-of-your-elastic-observability-deployment">APM functionality</a> (which also supports <a href="https://www.elastic.co/blog/opentelemetry-observability">native OpenTelemetry</a>), and the ability to analyze everything with <a href="https://www.elastic.co/blog/observability-logs-machine-learning-aiops">AIOps features</a> and <a href="https://www.elastic.co/what-is/elasticsearch-machine-learning?elektra=home">machine learning</a> (ML). While you can use pre-existing <a href="https://www.elastic.co/blog/improving-information-retrieval-elastic-stack-search-relevance">ML models in Elastic</a>, <a href="https://www.elastic.co/blog/aiops-automation-analytics-elastic-observability-use-cases">out-of-the-box AIOps features</a>, or your own ML models, there is a need to dig deeper into the root cause of an issue.</p>
<p>Elastic helps reduce the operational work to support more efficient operations, but users still need a way to investigate and understand everything from the cause of an issue to the meaning of specific error messages. As an operations user, if you haven’t run into a particular error before or it's part of some runbook, you will likely go to Google and start searching for information.</p>
<p>OpenAI’s ChatGPT is becoming an interesting generative AI tool that helps provide more information using the models behind it. What if you could use OpenAI to obtain deeper insights (even simple semantics) for an error in your production or development environment? You can easily tie Elastic to OpenAI’s API to achieve this.</p>
<p>Kubernetes, a mainstay in most deployments (on-prem or in a cloud service provider) requires a significant amount of expertise — even if that expertise is to manage a service like GKE, EKS, or AKS.</p>
<p>In this blog, I will cover how you can use <a href="https://www.elastic.co/guide/en/kibana/current/watcher-ui.html">Elastic’s watcher</a> capability to connect Elastic to OpenAI and ask it for more information about the error logs Elastic is ingesting from a Kubernetes cluster(s). More specifically, we will use <a href="https://azure.microsoft.com/en-us/products/cognitive-services/openai-service">Azure’s OpenAI Service</a>. Azure OpenAI is a partnership between Microsoft and OpenAI, so the same models from OpenAI are available in the Microsoft version.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/kubernetes-errors-observability-logs-openai/blog-elastic-azure-openai.png" alt="elastic azure openai" /></p>
<p>While this blog goes over a specific example, it can be modified for other types of errors Elastic receives in logs. Whether it's from AWS, the application, databases, etc., the configuration and script described in this blog can be modified easily.</p>
<h2>Prerequisites and config</h2>
<p>If you plan on following this blog, here are some of the components and details we used to set up the configuration:</p>
<ul>
<li>Ensure you have an account on <a href="http://cloud.elastic.co">Elastic Cloud</a> and a deployed stack (<a href="https://www.elastic.co/guide/en/elastic-stack/current/installing-elastic-stack.html">see instructions here</a>).</li>
<li>We used a GCP GKE Kubernetes cluster, but you can use any Kubernetes cluster service (on-prem or cloud based) of your choice.</li>
<li>We’re also running with a version of the OpenTelemetry Demo. Directions for using Elastic with OpenTelemetry Demo are <a href="https://github.com/elastic/opentelemetry-demo">here</a>.</li>
<li>We also have an Azure account and <a href="https://azure.microsoft.com/en-us/products/cognitive-services/openai-service">Azure OpenAI service configured</a>. You will need to get the appropriate tokens from Azure and the proper URL endpoint from Azure’s OpenAI service.</li>
<li>We will use <a href="https://www.elastic.co/guide/en/kibana/current/devtools-kibana.html">Elastic’s dev tools</a>, the console to be specific, to load up and run the script, which is an <a href="https://www.elastic.co/guide/en/kibana/current/watcher-ui.html">Elastic watcher</a>.</li>
<li>We will also add a new index to store the results from the OpenAI query.</li>
</ul>
<p>Here is the configuration we will set up in this blog:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/kubernetes-errors-observability-logs-openai/blog-elastic-configuration.png" alt="Configuration to analyze Kubernetes cluster errors" /></p>
<p>As we walk through the setup, we’ll also provide the alternative setup with OpenAI versus Azure OpenAI Service.</p>
<h2>Setting it all up</h2>
<p>Over the next few steps, I’ll walk through:</p>
<ul>
<li>Getting an account on Elastic Cloud and setting up your K8S cluster and application</li>
<li>Gaining Azure OpenAI authorization (alternative option with OpenAI)</li>
<li>Identifying Kubernetes error logs</li>
<li>Configuring the watcher with the right script</li>
<li>Comparing the output from Azure OpenAI/OpenAI versus ChatGPT UI</li>
</ul>
<h3>Step 0: Create an account on Elastic Cloud</h3>
<p>Follow the instructions to <a href="https://cloud.elastic.co/registration?fromURI=/home">get started on Elastic Cloud</a>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/kubernetes-errors-observability-logs-openai/blog-elastic-start-cloud-trial.png" alt="elastic start cloud trial" /></p>
<p>Once you have the Elastic Cloud login, set up your Kubernetes cluster and application. A complete step-by-step instructions blog is available <a href="https://www.elastic.co/blog/kubernetes-cluster-metrics-logs-monitoring">here</a>. This also provides an overview of how to see Kubernetes cluster metrics in Elastic and how to monitor them with dashboards.</p>
<h3>Step 1: Azure OpenAI Service and authorization</h3>
<p>When you log in to your Azure subscription and set up an instance of Azure OpenAI Service, you will be able to get your keys under Manage Keys.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/kubernetes-errors-observability-logs-openai/blog-elastic-microsoft-azure-manage-keys.png" alt="microsoft azure manage keys" /></p>
<p>There are two keys for your OpenAI instance, but you only need KEY 1 .</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/kubernetes-errors-observability-logs-openai/blog-elastic-pme-openai-keys-and-endpoint.png" alt="Used with permission from Microsoft." /></p>
<p>Additionally, you will need to get the service URL. See the image above with our service URL blanked out to understand where to get the KEY 1 and URL.</p>
<p>If you are not using Azure OpenAI Service and the standard OpenAI service, then you can get your keys at:</p>
<pre><code class="language-bash">**https** ://platform.openai.com/account/api-keys
</code></pre>
<p><img src="https://www.elastic.co/observability-labs/assets/images/kubernetes-errors-observability-logs-openai/blog-elastic-api-keys.png" alt="api keys" /></p>
<p>You will need to create a key and save it. Once you have the key, you can go to Step 2.</p>
<h3>Step 2: Identifying Kubernetes errors in Elastic logs</h3>
<p>As your Kubernetes cluster is running, <a href="https://docs.elastic.co/en/integrations/kubernetes">Elastic’s Kubernetes integration</a> running on the Elastic agent daemon set on your cluster is sending logs and metrics to Elastic. <a href="https://www.elastic.co/blog/log-monitoring-management-enterprise">The telemetry is ingested, processed, and indexed</a>. Kubernetes logs are stored in an index called .ds-logs-kubernetes.container_logs-default-* (* is for the date), and an automatic data stream logs-kubernetes.container_logs is also pre-loaded. So while you can use some of the out-of-the-box dashboards to investigate the metrics, you can also look at all the logs in Elastic Discover.</p>
<p>While any error from Kubernetes can be daunting, the more nuanced issues occur with errors from the pods running in the kube-system namespace. Take the pod konnectivity agent, which is essentially a network proxy agent running on the node to help establish tunnels and is a vital component in Kubernetes. Any error will cause the cluster to have connectivity issues and lead to a cascade of issues, so it’s important to understand and troubleshoot these errors.</p>
<p>When we filter out for error logs from the konnectivity agent, we see a good number of errors.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/kubernetes-errors-observability-logs-openai/blog-elastic-expanded-document.png" alt="expanded document" /></p>
<p>But unfortunately, we still can’t understand what these errors mean.</p>
<p>Enter OpenAI to help us understand the issue better. Generally, you would take the error message from Discover and paste it with a question in ChatGPT (or run a Google search on the message).</p>
<p>One error in particular that we’ve run into but do not understand is:</p>
<pre><code class="language-bash">E0510 02:51:47.138292       1 client.go:388] could not read stream err=rpc error: code = Unavailable desc = error reading from server: read tcp 10.120.0.8:46156-&gt;35.230.74.219:8132: read: connection timed out serverID=632d489f-9306-4851-b96b-9204b48f5587 agentID=e305f823-5b03-47d3-a898-70031d9f4768
</code></pre>
<p>The OpenAI output is as follows:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/kubernetes-errors-observability-logs-openai/blog-elastic-openai-output.png" alt="openai output" /></p>
<p>ChatGPT has given us a fairly nice set of ideas on why this rpc error is occurring against our konnectivity-agent.</p>
<p>So how can we get this output automatically for any error when those errors occur?</p>
<h3>Step 3: Configuring the watcher with the right script</h3>
<p><a href="https://www.elastic.co/guide/en/kibana/current/watcher-ui.html">What is an Elastic watcher?</a> Watcher is an Elasticsearch feature that you can use to create actions based on conditions, which are periodically evaluated using queries on your data. Watchers are helpful for analyzing mission-critical and business-critical streaming data. For example, you might watch application logs for errors causing larger operational issues.</p>
<p>Once a watcher is configured, it can be:</p>
<ol>
<li>Manually triggered</li>
<li>Run periodically</li>
<li>Created using a UI or a script</li>
</ol>
<p>In this scenario, we will use a script, as we can modify it easily and run it as needed.</p>
<p>We’re using the DevTools Console to enter the script and test it out:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/kubernetes-errors-observability-logs-openai/blog-elastic-test-script.png" alt="test script" /></p>
<p>The script is listed at the end of the blog in the <strong>appendix</strong>. It can also be downloaded <a href="https://github.com/elastic/chatgpt-error-analysis"><strong>here</strong></a> <strong>.</strong></p>
<p>The script does the following:</p>
<ol>
<li>It runs continuously every five minutes.</li>
<li>It will search the logs for errors from the container konnectivity-agent.</li>
<li>It will take the first error’s message, transform it (re-format and clean up), and place it into a variable first_hit.</li>
</ol>
<pre><code class="language-json">&quot;script&quot;: &quot;return ['first_hit': ctx.payload.first.hits.hits.0._source.message.replace('\&quot;', \&quot;\&quot;)]&quot;
</code></pre>
<ol start="4">
<li>The error message is sent into OpenAI with a query:</li>
</ol>
<pre><code class="language-yaml">What are the potential reasons for the following kubernetes error:
  { { ctx.payload.second.first_hit } }
</code></pre>
<ol start="5">
<li>If the search yielded an error, it will proceed to then create an index and place the error message, pod.name (which is konnectivity-agent-6676d5695b-ccsmx in our setup), and OpenAI output into a new index called chatgpt_k8_analyzed.</li>
</ol>
<p>To see the results, we created a new data view called chatgpt_k8_analyzed against the newly created index:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/kubernetes-errors-observability-logs-openai/blog-elastic-edit-data-view.png" alt="edit data view" /></p>
<p>In Discover, the output on the data view provides us with the analysis of the errors.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/kubernetes-errors-observability-logs-openai/blog-elastic-analysis-of-errors.png" alt="analysis of errors" /></p>
<p>For every error the script sees in the five minute interval, it will get an analysis of the error. We could alternatively also use a range as needed to analyze during a specific time frame. The script would just need to be modified accordingly.</p>
<h3>Step 4. Output from Azure OpenAI/OpenAI vs. ChatGPT UI</h3>
<p>As you noticed above, we got relatively the same result from the Azure OpenAI API call as we did by testing out our query in the ChatGPT UI. This is because we configured the API call to run the same/similar model as what was selected in the UI.</p>
<p>For the API call, we used the following parameters:</p>
<pre><code class="language-json">&quot;request&quot;: {
             &quot;method&quot; : &quot;POST&quot;,
             &quot;Url&quot;: &quot;https://XXX.openai.azure.com/openai/deployments/pme-gpt-35-turbo/chat/completions?api-version=2023-03-15-preview&quot;,
             &quot;headers&quot;: {&quot;api-key&quot; : &quot;XXXXXXX&quot;,
                         &quot;content-type&quot; : &quot;application/json&quot;
                        },
             &quot;body&quot; : &quot;{ \&quot;messages\&quot;: [ { \&quot;role\&quot;: \&quot;system\&quot;, \&quot;content\&quot;: \&quot;You are a helpful assistant.\&quot;}, { \&quot;role\&quot;: \&quot;user\&quot;, \&quot;content\&quot;: \&quot;What are the potential reasons for the following kubernetes error: {{ctx.payload.second.first_hit}}\&quot;}], \&quot;temperature\&quot;: 0.5, \&quot;max_tokens\&quot;: 2048}&quot; ,
              &quot;connection_timeout&quot;: &quot;60s&quot;,
               &quot;read_timeout&quot;: &quot;60s&quot;
                            }
</code></pre>
<p>By setting the role: system with You are a helpful assistant and using the gpt-35-turbo url portion, we are essentially setting the API to use the davinci model, which is the same as the ChatGPT UI model set by default.</p>
<p>Additionally, for Azure OpenAI Service, you will need to set the URL to something similar the following:</p>
<pre><code class="language-bash">https://YOURSERVICENAME.openai.azure.com/openai/deployments/pme-gpt-35-turbo/chat/completions?api-version=2023-03-15-preview
</code></pre>
<p>If you use OpenAI (versus Azure OpenAI Service), the request call (against <a href="https://api.openai.com/v1/completions">https://api.openai.com/v1/completions</a>) would be as such:</p>
<pre><code class="language-json">&quot;request&quot;: {
            &quot;scheme&quot;: &quot;https&quot;,
            &quot;host&quot;: &quot;api.openai.com&quot;,
            &quot;port&quot;: 443,
            &quot;method&quot;: &quot;post&quot;,
            &quot;path&quot;: &quot;\/v1\/completions&quot;,
            &quot;params&quot;: {},
            &quot;headers&quot;: {
               &quot;content-type&quot;: &quot;application\/json&quot;,
               &quot;authorization&quot;: &quot;Bearer YOUR_ACCESS_TOKEN&quot;
                        },
            &quot;body&quot;: &quot;{ \&quot;model\&quot;: \&quot;text-davinci-003\&quot;,  \&quot;prompt\&quot;: \&quot;What are the potential reasons for the following kubernetes error: {{ctx.payload.second.first_hit}}\&quot;,  \&quot;temperature\&quot;: 1,  \&quot;max_tokens\&quot;: 512,     \&quot;top_p\&quot;: 1.0,      \&quot;frequency_penalty\&quot;: 0.0,   \&quot;presence_penalty\&quot;: 0.0 }&quot;,
            &quot;connection_timeout_in_millis&quot;: 60000,
            &quot;read_timeout_millis&quot;: 60000
          }
</code></pre>
<p>If you are interested in creating a more OpenAI-based version, you can <a href="https://elastic-content-share.eu/downloads/watcher-job-to-integrate-chatgpt-in-elasticsearch/">download an alternative script</a> and look at <a href="https://mar1.hashnode.dev/unlocking-the-power-of-aiops-with-chatgpt-and-elasticsearch">another blog from an Elastic community member</a>.</p>
<h2>Gaining other insights beyond Kubernetes logs</h2>
<p>Now that the script is up and running, you can modify it using different:</p>
<ul>
<li>Inputs</li>
<li>Conditions</li>
<li>Actions</li>
<li>Transforms</li>
</ul>
<p>Learn more on how to modify it <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/xpack-alerting.html">here</a>. Some examples of modifications could include:</p>
<ol>
<li>Look for error logs from application components (e.g., cartService, frontEnd, from the OTel demo), cloud service providers (e.g., AWS/Azure/GCP logs), and even logs from components such as Kafka, databases, etc.</li>
<li>Vary the time frame from running continuously to running over a specific <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-range-query.html">range</a>.</li>
<li>Look for specific errors in the logs.</li>
<li>Query for analysis on a set of errors at once versus just one, which we demonstrated.</li>
</ol>
<p>The modifications are endless, and of course you can run this with OpenAI rather than Azure OpenAI Service.</p>
<h2>Conclusion</h2>
<p>I hope you’ve gotten an appreciation for how Elastic Observability can help you connect to OpenAI services (Azure OpenAI, as we showed, or even OpenAI) to better analyze an error log message instead of having to run several Google searches and hunt for possible insights.</p>
<p>Here’s a quick recap of what we covered:</p>
<ul>
<li>Developing an Elastic watcher script that can be used to find and send Kubernetes errors into OpenAI and insert them into a new index</li>
<li>Configuring Azure OpenAI Service or OpenAI with the right authorization and request parameters</li>
</ul>
<p>Ready to get started? Sign up <a href="https://cloud.elastic.co/registration">for Elastic Cloud</a> and try out the features and capabilities I’ve outlined above to get the most value and visibility out of your OpenTelemetry data.</p>
<h2>Appendix</h2>
<p>Watcher script</p>
<pre><code class="language-bash">PUT _watcher/watch/chatgpt_analysis
{
    &quot;trigger&quot;: {
      &quot;schedule&quot;: {
        &quot;interval&quot;: &quot;5m&quot;
      }
    },
    &quot;input&quot;: {
      &quot;chain&quot;: {
          &quot;inputs&quot;: [
              {
                  &quot;first&quot;: {
                      &quot;search&quot;: {
                          &quot;request&quot;: {
                              &quot;search_type&quot;: &quot;query_then_fetch&quot;,
                              &quot;indices&quot;: [
                                &quot;logs-kubernetes*&quot;
                              ],
                              &quot;rest_total_hits_as_int&quot;: true,
                              &quot;body&quot;: {
                                &quot;query&quot;: {
                                  &quot;bool&quot;: {
                                    &quot;must&quot;: [
                                      {
                                        &quot;match&quot;: {
                                          &quot;kubernetes.container.name&quot;: &quot;konnectivity-agent&quot;
                                        }
                                      },
                                      {
                                        &quot;match&quot; : {
                                          &quot;message&quot;:&quot;error&quot;
                                        }
                                      }
                                    ]
                                  }
                                },
                                &quot;size&quot;: &quot;1&quot;
                              }
                            }
                        }
                    }
                },
                {
                    &quot;second&quot;: {
                        &quot;transform&quot;: {
                            &quot;script&quot;: &quot;return ['first_hit': ctx.payload.first.hits.hits.0._source.message.replace('\&quot;', \&quot;\&quot;)]&quot;
                        }
                    }
                },
                {
                    &quot;third&quot;: {
                        &quot;http&quot;: {
                            &quot;request&quot;: {
                                &quot;method&quot; : &quot;POST&quot;,
                                &quot;url&quot;: &quot;https://XXX.openai.azure.com/openai/deployments/pme-gpt-35-turbo/chat/completions?api-version=2023-03-15-preview&quot;,
                                &quot;headers&quot;: {
                                    &quot;api-key&quot; : &quot;XXX&quot;,
                                    &quot;content-type&quot; : &quot;application/json&quot;
                                },
                                &quot;body&quot; : &quot;{ \&quot;messages\&quot;: [ { \&quot;role\&quot;: \&quot;system\&quot;, \&quot;content\&quot;: \&quot;You are a helpful assistant.\&quot;}, { \&quot;role\&quot;: \&quot;user\&quot;, \&quot;content\&quot;: \&quot;What are the potential reasons for the following kubernetes error: {{ctx.payload.second.first_hit}}\&quot;}], \&quot;temperature\&quot;: 0.5, \&quot;max_tokens\&quot;: 2048}&quot; ,
                                &quot;connection_timeout&quot;: &quot;60s&quot;,
                                &quot;read_timeout&quot;: &quot;60s&quot;
                            }
                        }
                    }
                }
            ]
        }
    },
    &quot;condition&quot;: {
      &quot;compare&quot;: {
        &quot;ctx.payload.first.hits.total&quot;: {
          &quot;gt&quot;: 0
        }
      }
    },
    &quot;actions&quot;: {
        &quot;index_payload&quot; : {
            &quot;transform&quot;: {
                &quot;script&quot;: {
                    &quot;source&quot;: &quot;&quot;&quot;
                        def payload = [:];
                        payload.timestamp = new Date();
                        payload.pod_name = ctx.payload.first.hits.hits[0]._source.kubernetes.pod.name;
                        payload.error_message = ctx.payload.second.first_hit;
                        payload.chatgpt_analysis = ctx.payload.third.choices[0].message.content;
                        return payload;
                    &quot;&quot;&quot;
                }
            },
            &quot;index&quot; : {
                &quot;index&quot; : &quot;chatgpt_k8s_analyzed&quot;
            }
        }
    }
}
</code></pre>
<h3>Additional logging resources:</h3>
<ul>
<li><a href="https://www.elastic.co/getting-started/observability/collect-and-analyze-logs">Getting started with logging on Elastic (quickstart)</a></li>
<li><a href="https://www.elastic.co/guide/en/observability/current/logs-metrics-get-started.html">Ingesting common known logs via integrations (compute node example)</a></li>
<li><a href="https://docs.elastic.co/integrations">List of integrations</a></li>
<li><a href="https://www.elastic.co/blog/log-monitoring-management-enterprise">Ingesting custom application logs into Elastic</a></li>
<li><a href="https://www.elastic.co/blog/observability-logs-parsing-schema-read-write">Enriching logs in Elastic</a></li>
<li>Analyzing Logs with <a href="https://www.elastic.co/blog/reduce-mttd-ml-machine-learning-observability">Anomaly Detection (ML)</a> and <a href="https://www.elastic.co/blog/observability-logs-machine-learning-aiops">AIOps</a></li>
</ul>
<h3>Common use case examples with logs:</h3>
<ul>
<li><a href="https://youtu.be/ax04ZFWqVCg">Nginx log management</a></li>
<li><a href="https://www.elastic.co/blog/vpc-flow-logs-monitoring-analytics-observability">AWS VPC Flow log management</a></li>
<li><a href="https://www.elastic.co/blog/kubernetes-errors-observability-logs-openai">Using OpenAI to analyze Kubernetes errors</a></li>
<li><a href="https://youtu.be/Li5TJAWbz8Q">PostgreSQL issue analysis with AIOps</a></li>
</ul>
<p><em>In this blog post, we may have used third party generative AI tools, which are owned and operated by their respective owners. Elastic does not have any control over the third party tools and we have no responsibility or liability for their content, operation or use, nor for any loss or damage that may arise from your use of such tools. Please exercise caution when using AI tools with personal, sensitive or confidential information. Any data you submit may be used for AI training or other purposes. There is no guarantee that information you provide will be kept secure or confidential. You should familiarize yourself with the privacy practices and terms of use of any generative AI tools prior to use.</em></p>
<p><em>Elastic, Elasticsearch and associated marks are trademarks, logos or registered trademarks of Elasticsearch N.V. in the United States and other countries. All other company and product names are trademarks, logos or registered trademarks of their respective owners.</em></p>
<p><em>Screenshots of Microsoft products used with permission from Microsoft.</em></p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/kubernetes-errors-observability-logs-openai/blog-elastic-configuration.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Process Kubernetes logs with ease using Elastic Streams]]></title>
            <link>https://www.elastic.co/observability-labs/blog/kubernetes-logs-elastic-streams-processing</link>
            <guid isPermaLink="false">kubernetes-logs-elastic-streams-processing</guid>
            <pubDate>Thu, 12 Mar 2026 00:00:00 GMT</pubDate>
            <description><![CDATA[Learn how to process Kubernetes logs with Elastic Streams using conditional blocks, AI-generated Grok patterns, and selective drops to reduce noise and storage cost.]]></description>
            <content:encoded><![CDATA[<p>Streams, a new AI capability within Elastic Observability. Built on the Elasticsearch platform, it's designed for Site Reliability Engineers (SREs) to use logs as the primary signal for investigations, enabling faster answers and quicker issue resolution. For decades, logs have been considered too noisy, expensive, and complex to manage, and many observability vendors have treated them as a second-class citizen. Streams flips this script by transforming raw logs into your most valuable asset to immediately identify not only the root cause, but also the why behind the root cause to enable instant resolution.</p>
<p>Learn more from our previous article <a href="https://www.elastic.co/observability-labs/blog/elastic-observability-streams-ai-logs-investigations">Introducing Streams</a></p>
<p>Many SREs deploy on cloud native archtiectures. Kubernetes is essentially the baseline deployment architecture of choice. Yet Kubernetes logs are messy by default. A single (data)stream often mixes access logs, JSON payloads, health checks, and internal service chatter.</p>
<p>Elastic Streams gives you a faster path. You can isolate subsets of logs with conditionals, use AI to generate Grok patterns from real samples, and drop documents you do not need before they add storage and query cost.</p>
<h2>Why Kubernetes logs get messy fast</h2>
<p>The default Kubernetes container logs stream can contain data from many services at once. In one sample, you might see:</p>
<ul>
<li>HTTP access logs from application pods</li>
<li>Verbose worker or batch job status logs</li>
<li>Platform and container lifecycle events with different formats</li>
</ul>
<p>This is why &quot;one global parsing rule&quot; will fail. You need targeted processing logic per log shape or type of application.
Histrocially doing this kind of custom processing has been error prone and time consuming.</p>
<h2>What Streams Processing changes</h2>
<p>Streams Processing (available in 9.2 and later) moves this workflow into a live, interactive experience:</p>
<ul>
<li>You build conditions and processors in the UI</li>
<li>You validate each change against sample documents before saving</li>
<li>You can use AI to generate extraction patterns from selected logs</li>
</ul>
<p>The result is a safer way to iterate on parsing logic without guessing.</p>
<h2>Walkthrough: parse custom application logs</h2>
<p>We'll start from your Kubernetes stream (logs-kubernetes.containers_logs-default) and create a conditional block that scopes processing to one service.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/kubernetes-logs-elastic-streams-processing/01-conditional-filter-litellm.png" alt="Conditional block filtering Kubernetes logs for litellm before parsing in Elastic Streams" /></p>
<p>Once the condition is saved, it will automatically filter the sample data to a subset of logs that match the condition. This is indicate by the blue highlight in the preview.</p>
<p>Inside that block, we'll add a Grok processor and click <strong>Generate pattern</strong>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/kubernetes-logs-elastic-streams-processing/02-generate-pattern-button.png" alt="Generate pattern button in Elastic Streams using AI to process Kubernetes logs" /></p>
<p>This agentic process will now use an LLM to generate a Grok pattern that will be used to parse the logs. By default this would be using the Elastic Inference Service, but you can configure it to use your own LLM.
Review the generated pattern and accept it once the sample set validates.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/kubernetes-logs-elastic-streams-processing/03-accept-generated-grok.png" alt="Accepting AI-generated Grok pattern after matching selected Uvicorn logs" /></p>
<h2>Walkthrough: drop noisy postgres-loadgen documents</h2>
<p>Not all logs are that important that we'd like to keep them around forever. For example, logs from a load testing tool like a load generator are not useful for long-term analysis, so let's drop those.</p>
<p>To do this we will add a second conditional block for logs you intentionally do not want to index long-term.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/kubernetes-logs-elastic-streams-processing/05-preview-selected-postgres-loadgen.png" alt="Selected tab preview of noisy postgres-loadgen documents before drop" /></p>
<p>Add a drop processor inside this block, then validate in the <strong>Dropped</strong> tab.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/kubernetes-logs-elastic-streams-processing/07-preview-dropped-tab.png" alt="Dropped tab preview showing noisy Kubernetes logs excluded from indexing" /></p>
<h2>Save safely with live simulation</h2>
<p>One of the most useful parts of Streams is the preview-first workflow. You can inspect matched, parsed, skipped, failed, and dropped samples before making the change live.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/kubernetes-logs-elastic-streams-processing/08-save-changes.png" alt="Save changes button after validating processing logic on live samples" /></p>
<h2>YAML mode and the equivalent API request</h2>
<p>The interactive builder works well for most edits, but advanced users can switch to YAML mode for direct control.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/kubernetes-logs-elastic-streams-processing/11-yaml-mode.png" alt="Switching from interactive builder to YAML mode in Streams processing" /></p>
<p>You can also open <strong>Equivalent API Request</strong> to copy the payload for automation and Infrastructure as Code workflows.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/kubernetes-logs-elastic-streams-processing/12-equivalent-api-request.png" alt="Equivalent API request panel for automating Streams processing" /></p>
<h2>A note on backwards compatibility</h2>
<p>Streams Processing builds on Elasticsearch ingest pipelines, so it works with the same ingestion model teams already use.</p>
<p>When you save processing changes, Streams appends logic through the stream processing pipeline model (for example via <code>@custom</code> conventions used by data streams). That means you can adopt conditionals, parsing, and selective dropping incrementally, without changing your Kubernetes log shippers.</p>
<h2>What's next?</h2>
<p>Streams Processing is consistently getting new processing capabilities. Check out the <a href="https://www.elastic.co/docs/solutions/observability/streams/streams">Streams documentation</a> for the latest updates.</p>
<p>Over the coming months more of this will be automated and moved to the background, reducing the manual effort required to process logs.</p>
<p>Another miletsone we're working towards is to offer this processing at read time, rather than write time. Using ES|QL this will enable you to iterate on your parsing logic without having to worry about committing changes that are harder to revert.</p>
<p>Also try this out by getting a free trial on <a href="https://cloud.elastic.co/">Elastic Serverless</a>.</p>
<p>Happy log analytics!!!</p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/kubernetes-logs-elastic-streams-processing/cover.svg" length="0" type="image/svg"/>
        </item>
        <item>
            <title><![CDATA[Native OpenTelemetry support in Elastic Observability]]></title>
            <link>https://www.elastic.co/observability-labs/blog/native-opentelemetry-support-in-elastic-observability</link>
            <guid isPermaLink="false">native-opentelemetry-support-in-elastic-observability</guid>
            <pubDate>Wed, 13 Sep 2023 00:00:00 GMT</pubDate>
            <description><![CDATA[Elastic offers native support for OpenTelemetry by allowing for direct ingest of OpenTelemetry traces, metrics, and logs without conversion, and applying any Elastic feature against OTel data without degradation in capabilities.]]></description>
            <content:encoded><![CDATA[<p>NOTE: Since writing this blog, new OTel data ingest configurations are now available in Elastic. See recent <a href="https://www.elastic.co/observability-labs/blog/elastic-opentelemetry-otel-operator">blog</a></p>
<p>OpenTelemetry is more than just becoming the open ingestion standard for observability. As one of the major Cloud Native Computing Foundation (CNCF) projects, with as many commits as Kubernetes, it is gaining support from major ISVs and cloud providers delivering support for the framework. Many global companies from finance, insurance, tech, and other industries are starting to standardize on OpenTelemetry. With OpenTelemetry, DevOps teams have a consistent approach to collecting and ingesting telemetry data providing a de-facto standard for observability.</p>
<p>Elastic&lt;sup&gt;®&lt;/sup&gt; is strategically standardizing on OpenTelemetry for the main data collection architecture for observability and security. Additionally, Elastic is making a commitment to help OpenTelemetry become the best de facto data collection infrastructure for the observability ecosystem. Elastic is deepening its relationship with OpenTelemetry beyond the recent contribution of Elastic Common Schema (ECS) to OpenTelemetry (OTel).</p>
<p>Today, Elastic supports OpenTelemetry natively, since Elastic 7.14, by being able to directly ingest OpenTelemetry protocol (OTLP) based traces, metrics, and logs.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/native-opentelemetry-support-in-elastic-observability/elastic-blog-1-otel-config-options.png" alt="otel configuration options" /></p>
<p>In this blog, we’ll review the current OpenTelemetry support provided by Elastic, which includes the following:</p>
<ul>
<li><a href="#ingesting-opentelemetry-into-elastic"><strong>Easy ingest of distributed tracing and metrics</strong></a> for applications configured with OpenTelemetry agents for Python, NodeJS, Java, Go, and .NET</li>
<li><a href="#opentelemetry-logs-in-elastic"><strong>OpenTelemetry logs instrumentation and ingest</strong></a> using various configurations</li>
<li><a href="#opentelemetry-is-elastics-preferred-schema"><strong>Open semantic conventions</strong></a> for logs and more through ECS, which is not part of OpenTelemetry</li>
<li><a href="#elastic-observability-apm-and-machine-learning-capabilities"><strong>Machine learning based AIOps capabilities</strong></a>, such as latency correlations, failure correlations, anomaly detection, log spike analysis, predictive pattern analysis, Elastic AI Assistant support, and more, all apply to native OTLP telemetry.</li>
<li><a href="#elastic-allows-you-to-migrate-to-otel-on-your-schedule"><strong>Migrate applications to OpenTelemetry at your own speed</strong></a>. Elastic’s APM capabilities all work seamlessly even with a mix of services using OpenTelemetry and/or Elastic APM agents. You can even combine OpenTelemetry instrumentation with Elastic Agent.</li>
<li><a href="#integrated-kubernetes-and-opentelemetry-views-in-elastic"><strong>Integrated views and analysis with Kubernetes clusters</strong></a>, which most OpenTelemetry applications are running on. Elastic can highlight specific pods and containers related to each service when analyzing issues for applications based on OpenTelemetry.</li>
</ul>
<h2>Ingesting OpenTelemetry into Elastic</h2>
<p>If you’re interested in seeing how simple it is to ingest OpenTelemetry traces and metrics into Elastic, follow the steps outlined in this blog.</p>
<p>Let’s outline what Elastic provides for ingesting OpenTelemetry data. Here are all your options:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/native-opentelemetry-support-in-elastic-observability/elastic-blog-2-flowchart.png" alt="flowchart" /></p>
<h3>Using the OpenTelemetry Collector</h3>
<p>When using the OpenTelemetry Collector, which is the most common configuration option, you simply have to add two key variables.</p>
<p>The instructions utilize a specific opentelemetry-collector configuration for Elastic. Essentially, the Elastic <a href="https://github.com/elastic/opentelemetry-demo/blob/main/kubernetes/elastic-helm/values.yaml">values.yaml</a> file specified in the elastic/opentelemetry-demo configure the opentelemetry-collector to point to the Elastic APM Server using two main values:</p>
<p>OTEL_EXPORTER_OTLP_ENDPOINT is Elastic’s APM Server<br />
OTEL_EXPORTER_OTLP_HEADERS Elastic Authorization</p>
<p>These two values can be found in the OpenTelemetry setup instructions under the APM integration instructions (Integrations-&gt;APM) in your Elastic Cloud.</p>
<h3>Native OpenTelemetry agents embedded in code</h3>
<p>If you are thinking of using OpenTelemetry libraries in your code, you can simply point the service to Elastic’s APM server, because it supports native OLTP protocol. No special Elastic conversion is needed.</p>
<p>To demonstrate this effectively and provide some education on how to use OpenTelemetry, we have two applications you can use to learn from:</p>
<ul>
<li><a href="https://github.com/elastic/opentelemetry-demo">Elastic’s version of OpenTelemetry demo</a>: As with all the other observability vendors, we have our own forked version of the OpenTelemetry demo.</li>
<li><a href="https://github.com/elastic/workshops-instruqt/tree/main/Elastiflix">Elastiflix:</a> This demo application is an example to help you learn how to instrument on various languages and telemetry signals.</li>
</ul>
<p>Check out our blogs on using the Elastiflix application and instrumenting with OpenTelemetry:</p>
<ul>
<li><a href="https://www.elastic.co/blog/getting-started-opentelemetry-instrumentation-sample-app">Elastiflix application</a>, a guide to instrument different languages with OpenTelemetry</li>
<li>Python: <a href="https://www.elastic.co/blog/auto-instrumentation-of-python-applications-opentelemetry">Auto-instrumentation</a>, <a href="https://www.elastic.co/blog/manual-instrumentation-of-python-applications-opentelemetry">Manual-instrumentation</a></li>
<li>Java: <a href="https://www.elastic.co/blog/auto-instrumentation-of-java-applications-opentelemetry">Auto-instrumentation</a>, <a href="https://www.elastic.co/blog/manual-instrumentation-of-java-applications-opentelemetry">Manual-instrumentation</a></li>
<li>Node.js: <a href="https://www.elastic.co/blog/auto-instrument-nodejs-applications-opentelemetry">Auto-instrumentation</a>, <a href="https://www.elastic.co/blog/manual-instrumentation-of-nodejs-applications-opentelemetry">Manual-instrumentation</a></li>
<li>.NET: <a href="https://www.elastic.co/blog/auto-instrumentation-of-net-applications-opentelemetry">Auto-instrumentation</a>, <a href="https://www.elastic.co/blog/manual-instrumentation-of-net-applications-opentelemetry">Manual-instrumentation</a></li>
</ul>
<p>We have created YouTube videos on these topics as well:</p>
<ul>
<li><a href="https://youtu.be/wMXMRsjFg-8?feature=shared">How to Manually Instrument Java with OpenTelemetry (Part 1)</a></li>
<li><a href="https://youtu.be/PX7s6RRLGaU?feature=shared">How to Manually Instrument Java with OpenTelemetry (Part 2)</a></li>
<li><a href="https://youtu.be/hXTlV_RnELc?feature=shared">Custom Java Instrumentation with OpenTelemetry</a></li>
<li><a href="https://youtu.be/E8g9u_uOFO4?feature=shared">Elastic APM - Automatic .NET Instrumentation with OpenTelemetry</a></li>
<li><a href="https://youtu.be/7J9M2JsHwRE?feature=shared">How to Manually Instrument .NET Applications with OpenTelemetry</a></li>
</ul>
<p>Given Elastic and OpenTelemetry’s vast user base, these provide a rich source of education for anyone trying to learn the intricacies of instrumenting with OpenTelemetry.</p>
<h3>Elastic Agents supporting OpenTelemetry</h3>
<p>If you’ve already implemented OpenTelemetry, you can still use them with OpenTelemetry. <a href="https://www.elastic.co/blog/opentelemetry-instrumentation-elastic-apm-agent-features">Elastic APM agents today are able to ship OpenTelemetry</a> spans as part of a trace. This means that if you have any component in your application that emits an OpenTelemetry span, it’ll be part of the trace the Elastic APM agent captures.</p>
<h2>OpenTelemetry logs in Elastic</h2>
<p>If you look at OpenTelemetry documentation, you will see that a lot of language libraries are still in experimental or not implemented yet state. Java is in stable state, per the documentation. Depending on your service’s language, and your appetite for adventure, there exist several options for exporting logs from your services and applications and marrying them together in your observability backend.</p>
<p>In a previous blog, we discussed <a href="https://www.elastic.co/blog/3-models-logging-opentelemetry-elastic">3 different configurations to properly get logging data into Elastic for Java</a>. The blog explores the current state of the art of OpenTelemetry logging and provides guidance on the available approaches with the following tenants in mind:</p>
<ul>
<li>Correlation of service logs with OTel-generated tracing where applicable</li>
<li>Proper capture of exceptions</li>
<li>Common context across tracing, metrics, and logging</li>
<li>Support for slf4j key-value pairs (“structured logging”)</li>
<li>Automatic attachment of metadata carried between services via OTel baggage</li>
<li>Use of an Elastic Observability backend</li>
<li>Consistent data fidelity in Elastic regardless of the approach taken</li>
</ul>
<p>Three models, which are covered in the blog, currently exist for getting your application or service logs to Elastic with correlation to OTel tracing and baggage:</p>
<ul>
<li>Output logs from your service (alongside traces and metrics) using an embedded OpenTelemetry Instrumentation library to Elastic via the OTLP protocol</li>
<li>Write logs from your service to a file scrapped by the OpenTelemetry Collector, which then forwards to Elastic via the OTLP protocol</li>
<li>Write logs from your service to a file scrapped by Elastic Agent (or Filebeat), which then forwards to Elastic via an Elastic-defined protocol</li>
</ul>
<p>Note that (1), in contrast to (2) and (3), does not involve writing service logs to a file prior to ingestion into Elastic.</p>
<h2>OpenTelemetry is Elastic’s preferred schema</h2>
<p>Elastic recently contributed the <a href="https://opentelemetry.io/blog/2023/ecs-otel-semconv-convergence/">Elastic Common Schema (ECS) to the OpenTelemetry (OTel)</a> project, enabling a unified data specification for security and observability data within the OTel Semantic Conventions framework.</p>
<p>ECS, an open source specification, was developed with support from the Elastic user community to define a common set of fields to be used when storing event data in Elasticsearch&lt;sup&gt;®&lt;/sup&gt;. ECS helps reduce management and storage costs stemming from data duplication, improving operational efficiency.</p>
<p>Similarly, OTel’s Semantic Conventions (SemConv) also specify common names for various kinds of operations and data. The benefit of using OTel SemConv is in following a common naming scheme that can be standardized across a codebase, libraries, and platforms for OTel users.</p>
<p>The merging of ECS and OTel SemConv will help advance OTel’s adoption and the continued evolution and convergence of observability and security domains.</p>
<h2>Elastic Observability APM and machine learning capabilities</h2>
<p>All of Elastic Observability’s APM capabilities are available with OTel data (read more on this in our blog, <a href="https://www.elastic.co/blog/opentelemetry-observability">Independence with OpenTelemetry</a>):</p>
<ul>
<li>Service maps</li>
<li>Service details (latency, throughput, failed transactions)</li>
<li>Dependencies between services</li>
<li>Transactions (traces)</li>
<li>ML correlations (specifically for latency)</li>
<li>Service logs</li>
</ul>
<p><img src="https://www.elastic.co/observability-labs/assets/images/native-opentelemetry-support-in-elastic-observability/elastic-blog-3-services.png" alt="services" /></p>
<p>In addition to Elastic’s APM and unified view of the telemetry data, you will now be able to use Elastic’s powerful machine learning capabilities to reduce the analysis, and alerting to help reduce MTTR. Here are some of the ML based AIOps capabilities we have:</p>
<ul>
<li><a href="https://www.elastic.co/blog/reduce-mttd-ml-machine-learning-observability"><strong>Anomaly detection:</strong></a> Elastic Observability, when turned on (<a href="https://www.elastic.co/guide/en/kibana/current/xpack-ml-anomalies.html">see documentation</a>), automatically detects anomalies by continuously modeling the normal behavior of your OpenTelemetry data — learning trends, periodicity, and more.</li>
<li><a href="https://www.elastic.co/blog/reduce-mttd-ml-machine-learning-observability"><strong>Log categorization:</strong></a> Elastic also identifies patterns in your OpenTelemetry log events quickly, so that you can take action quicker.</li>
<li><strong>High-latency or erroneous transactions:</strong> Elastic Observability’s APM capability helps you discover which attributes are contributing to increased transaction latency and identifies which attributes are most influential in distinguishing between transaction failures and successes.</li>
<li><a href="https://www.elastic.co/blog/observability-logs-machine-learning-aiops"><strong>Log spike detector</strong></a> helps identify reasons for increases in OpenTelemetry log rates. It makes it easy to find and investigate causes of unusual spikes by using the analysis workflow view.</li>
<li><a href="https://www.elastic.co/blog/observability-logs-machine-learning-aiops"><strong>Log pattern analysis</strong></a> helps you find patterns in unstructured log messages and makes it easier to examine your data.</li>
</ul>
<h2>Elastic allows you to migrate to OTel on your schedule</h2>
<p>Although OpenTelemetry supports many programming languages, the <a href="https://opentelemetry.io/docs/instrumentation/">status of its major functional components</a> — metrics, traces, and logs — are still at various stages. Thus migrating applications written in Java, Python, and JavaScript are good choices to start with as their metrics, traces, and logs (for Java) are stable.</p>
<p>For the other languages that are not yet supported, you can easily instrument those using Elastic Agents, therefore running your <a href="https://www.elastic.co/observability">full stack observability platform</a> in mixed mode (Elastic agents with OpenTelemetry agents).</p>
<p>Here is a simple example:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/native-opentelemetry-support-in-elastic-observability/elastic-blog-4-services2.png" alt="services 2" /></p>
<p>The above shows a simple variation of our standard Elastic Agent application with one service flipped to OTel — the newsletter-otel service. But we can easily and as needed convert each of these services to OTel as development resources allow.</p>
<p>Hence you can migrate what you need to OpenTelemetry with Elastic as specific languages reach a stable state, and you can then continue your migration to OpenTelemetry agents.</p>
<h2>Integrated Kubernetes and OpenTelemetry views in Elastic</h2>
<p>Elastic manages your Kubernetes cluster using the Elastic Agent, and you can use it on your Kubernetes cluster where your OpenTelemetry application is running. Hence you can not only use OpenTelemetry for your application, but Elastic can also monitor the corresponding Kubernetes cluster.</p>
<p>There are two configurations for Kubernetes:</p>
<p><strong>1. Simply deploying the Elastic Agent daemon set on the kubernetes cluster.</strong> We outline this out in the article entitled <a href="https://www.elastic.co/blog/kubernetes-cluster-metrics-logs-monitoring">Managing your Kubernetes cluster with Elastic Observability</a>. This would also push just the Kubernetes metrics and logs to Elastic.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/native-opentelemetry-support-in-elastic-observability/elastic-blog-5-cloud-nodes.png" alt="elastic cloud nodes" /></p>
<p><strong>2. Deploying the Elastic Agent with not only the Kubernetes Daemon set, but also Elastic’s APM integration, the Defend (Security) integration, and Network Packet capture integration</strong> to provide more comprehensive Kubernetes cluster observability. We outline this configuration in the following article <a href="https://www.elastic.co/blog/implementing-kubernetes-observability-security-opentelemetry">Modern observability and security on Kubernetes with Elastic and OpenTelemetry</a>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/native-opentelemetry-support-in-elastic-observability/elastic-blog-6-flowhcart.png" alt="flowchart" /></p>
<p>Both <a href="https://www.elastic.co/observability/opentelemetry">OpenTelemetry visualization</a> examples use the OpenTelemetry demo, and in Elastic, we tie the Kubernetes information with the application to provide you an ability to see Kubernetes information from your traces in APM. This provides a more integrated approach when troubleshooting.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/native-opentelemetry-support-in-elastic-observability/elastic-blog-7-pod-deets.png" alt="pod details" /></p>
<h2>Summary</h2>
<p>In essence, Elastic's commitment goes beyond mere support for OpenTelemetry. We are dedicated to ensuring our customers not only adopt OpenTelemetry but thrive with it. Through our solutions, expertise, and resources, we aim to elevate the observability journey for every business, turning data into actionable insights that drive growth and innovation.</p>
<blockquote>
<p>Developer resources:</p>
<ul>
<li><a href="https://www.elastic.co/blog/getting-started-opentelemetry-instrumentation-sample-app">Elastiflix application</a>, a guide to instrument different languages with OpenTelemetry</li>
<li>Python: <a href="https://www.elastic.co/blog/auto-instrumentation-of-python-applications-opentelemetry">Auto-instrumentation</a>, <a href="https://www.elastic.co/blog/manual-instrumentation-of-python-applications-opentelemetry">Manual-instrumentation</a></li>
<li>Java: <a href="https://www.elastic.co/blog/auto-instrumentation-of-java-applications-opentelemetry">Auto-instrumentation</a>, <a href="https://www.elastic.co/blog/manual-instrumentation-of-java-applications-opentelemetry">Manual-instrumentation</a></li>
<li>Node.js: <a href="https://www.elastic.co/blog/auto-instrument-nodejs-applications-opentelemetry">Auto-instrumentation</a>, <a href="https://www.elastic.co/blog/manual-instrumentation-of-nodejs-applications-opentelemetry">Manual-instrumentation</a></li>
<li>.NET: <a href="https://www.elastic.co/blog/auto-instrumentation-of-net-applications-opentelemetry">Auto-instrumentation</a>, <a href="https://www.elastic.co/blog/manual-instrumentation-of-net-applications-opentelemetry">Manual-instrumentation</a></li>
<li>Go: <a href="https://elastic.co/blog/manual-instrumentation-of-go-applications-opentelemetry">Manual-instrumentation</a></li>
<li><a href="https://www.elastic.co/blog/best-practices-instrumenting-opentelemetry">Best practices for OpenTelemetry</a></li>
</ul>
<p>General configuration and use case resources:</p>
<ul>
<li><a href="https://www.elastic.co/blog/opentelemetry-observability">Independence with OpenTelemetry on Elastic</a></li>
<li><a href="https://www.elastic.co/blog/implementing-kubernetes-observability-security-opentelemetry">Modern observability and security on Kubernetes with Elastic and OpenTelemetry</a></li>
<li><a href="https://www.elastic.co/blog/3-models-logging-opentelemetry-elastic">3 models for logging with OpenTelemetry and Elastic</a></li>
<li><a href="https://www.elastic.co/blog/adding-free-and-open-elastic-apm-as-part-of-your-elastic-observability-deployment">Adding free and open Elastic APM as part of your Elastic Observability deployment</a></li>
<li><a href="https://www.elastic.co/blog/custom-metrics-app-code-java-agent-plugin">Capturing custom metrics through OpenTelemetry API in code with Elastic</a></li>
<li><a href="https://www.elastic.co/virtual-events/future-proof-your-observability-platform-with-opentelemetry-and-elastic">Future-proof your observability platform with OpenTelemetry and Elastic</a></li>
<li><a href="https://www.elastic.co/blog/kubernetes-k8s-observability-elasticsearch-cncf">Elastic Observability: Built for open technologies like Kubernetes, OpenTelemetry, Prometheus, Istio, and more</a></li>
</ul>
</blockquote>
<p>Don’t have an Elastic Cloud account yet? <a href="https://cloud.elastic.co/registration">Sign up for Elastic Cloud</a> and try out the instrumentation capabilities that I discussed above. I would be interested in getting your feedback about your experience in gaining visibility into your application stack with Elastic.</p>
<p><em>The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.</em></p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/native-opentelemetry-support-in-elastic-observability/ecs-otel-announcement-2.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Tracing, logs, and metrics for a RAG based Chatbot with Elastic Distributions of OpenTelemetry]]></title>
            <link>https://www.elastic.co/observability-labs/blog/openai-tracing-elastic-opentelemetry</link>
            <guid isPermaLink="false">openai-tracing-elastic-opentelemetry</guid>
            <pubDate>Fri, 24 Jan 2025 00:00:00 GMT</pubDate>
            <description><![CDATA[How to observe a OpenAI RAG based application using Elastic. Instrument the app, collect logs, traces, metrics, and understand how well the LLM is performing with Elastic Distributions of OpenTelemetry on Kubernetes and Docker.]]></description>
            <content:encoded><![CDATA[<p>As discussed in the following post, <a href="https://www.elastic.co/observability-labs/blog/elastic-opentelemetry-openai">Elastic added instrumentation for OpenAI based applications in EDOT</a>. The main application most commonly using LLMs is known as a Chatbot. These chatbots not only use large language models (LLMs), but are also using frameworks such as LangChain, and search to improve contextual information during a conversation RAG (Retrieval Augmented Generation). Elastics's sample <a href="https://github.com/elastic/elasticsearch-labs/tree/main/example-apps/chatbot-rag-app">RAG based Chatbot application</a>, showcases how to use Elasticsearch with local data that has embeddings, enabling search to properly pull out the most contextual information during a query with a chatbot connected to an LLM of your choice. It's a great example of how to build out a RAG based application with Elasticsearch.</p>
<p>This app is also now insturmented with EDOT, and you can visualize the Chatbot's traces to OpenAI, as well as relevant logs, and metrics from the application. By running the app as instructed in the github repo with Docker you can see these traces on a local stack. But how about running it against serverless, Elastic cloud or even with Kubernetes?</p>
<p>In this blog we will walk through how to set up Elastic's RAG Based Chatbot application with Elastic cloud and Kubernetes.</p>
<h1>Prerequisites:</h1>
<p>In order to follow along, these few pre-requisites are needed</p>
<ul>
<li>
<p>An Elastic Cloud account — sign up now, and become familiar with Elastic's OpenTelemetry configuration. With Serverless no version required. With regular cloud minimally 8.17</p>
</li>
<li>
<p>Git clone the <a href="https://github.com/elastic/elasticsearch-labs/tree/main/example-apps/chatbot-rag-app">RAG based Chatbot application</a> and go through the <a href="https://www.elastic.co/search-labs/tutorials/chatbot-tutorial/welcome">tutorial</a> on how to bring it up and become more familiar and how to bring up the application using Docker.</p>
</li>
<li>
<p>An account on OpenAI with API keys</p>
</li>
<li>
<p>Kubernetes cluster to run the RAG based Chatbot app</p>
</li>
<li>
<p>The instructions in this blog are also found in <a href="https://github.com/elastic/observability-examples/tree/main/chatbot-rag-app-observability">observability-examples</a> in github.</p>
</li>
</ul>
<h1>Application OpenTelemetry output in Elastic</h1>
<h2>Chatbot-rag-app</h2>
<p>The first item that you will need to get up and running is the ChatBotApp, and once up you should see the following:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/openai-tracing-elastic-opentelemetry/Chatbotapp-general.png" alt="Chatbot app main page" /></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/openai-tracing-elastic-opentelemetry/Chatbotapp-details.png" alt="Chatbot app working" /></p>
<p>As you select some of the questions you will set a response based on the index that was created in Elasticsearch when the app initializes. Additionally there will be queries that are made to LLMs.</p>
<h2>Traces, logs, and metrics from EDOT in Elastic</h2>
<p>Once you have the application running on your K8s cluster or with Docker, and Elastic Cloud up and running you should see the following:</p>
<h3>Logs:</h3>
<p>In Discover you will see logs from the Chatbotapp, and be able to analyze the application logs, any specific log patterns, which saves you time in analysis.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/openai-tracing-elastic-opentelemetry/chatbot-reg-logs.png" alt="Chatbot-logs" /></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/openai-tracing-elastic-opentelemetry/chatbot-reg-logs-patterns.png" alt="Chatbot-log-patterns" /></p>
<h3>Traces:</h3>
<p>In Elastic Observability APM, you can also see tha chatbot details, which include transactions, dependencies, logs, errors, etc.</p>
<p>When you look at traces, you will be able to see the chatbot interactions in the trace.</p>
<ol>
<li>
<p>You will see the end to end http call</p>
</li>
<li>
<p>Individual calls to elasticsearch</p>
</li>
<li>
<p>Specific calls such as invoke actions, and calls to the LLM</p>
</li>
</ol>
<p>You can also get individual details of the traces, and look at related logs, and metrics related to that trace,</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/openai-tracing-elastic-opentelemetry/chatbot-reg-trace.png" alt="CHatbot-traces" /></p>
<h3>Metrics:</h3>
<p>In addition to logs, and traces, any instrumented metrics will also get ingested into Elastic.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/openai-tracing-elastic-opentelemetry/chatbot-reg-metrics.png" alt="Chatbot app metrics" /></p>
<h1>Setting it all up with Docker</h1>
<p>In order to properly set up the Chatbot-app on Docker with telemetry sent over to Elastic, a few things must be set up:</p>
<ol>
<li>
<p>Git clone the chatbot-rag-app</p>
</li>
<li>
<p>Modify the env file as noted in the github README with the following exception:</p>
</li>
</ol>
<p>Use your Elastic cloud's <code>OTEL_EXPORTER_OTLP_ENDPOINT</code> and <code>OTEL_EXPORTER_OTLP_HEADER</code> instead.</p>
<p>You can find these in the Elastic Cloud under <code>integrations-&gt;APM</code></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/openai-tracing-elastic-opentelemetry/otel-credentials.png" alt="OTel credentials" /></p>
<p>Envs for sending the OTel instrumentation you will need the following:</p>
<pre><code class="language-bash">OTEL_EXPORTER_OTLP_ENDPOINT=&quot;https://123456789.apm.us-west-2.aws.cloud.es.io:443&quot;
OTEL_EXPORTER_OTLP_HEADERS=&quot;Authorization=Bearer%20xxxxx&quot;
</code></pre>
<p>Notice the <code>%20</code> in the headers. This will be needed to account for the space in credentials.</p>
<ol start="3">
<li>
<p>Set the following to false - <code>OTEL_SDK_DISABLED=false</code></p>
</li>
<li>
<p>Set the envs for LLMs</p>
</li>
</ol>
<p>In this example we're using OpenAI, hence only three variables are needed.</p>
<pre><code class="language-bash">LLM_TYPE=openai
OPENAI_API_KEY=XXXX
CHAT_MODEL=gpt-4o-mini
</code></pre>
<ol start="5">
<li>Run the docker container as noted</li>
</ol>
<pre><code class="language-bash">docker compose up --build --force-recreate
</code></pre>
<ol start="6">
<li>
<p>Play with the app at <code>localhost:4000</code></p>
</li>
<li>
<p>Then log into Elastic cloud and see the output as shown previously.</p>
</li>
</ol>
<h1>Run chatbot-rag-app on Kubernetes</h1>
<p>In order to set this up, you can follow the following repo on Observability-examples which has the Kubernetes yaml files being used. These will also point to Elastic Cloud.</p>
<ol>
<li>
<p>Set up the Kubernetes Cluster (we're using EKS)</p>
</li>
<li>
<p>Get the appropriate ENV variables:</p>
</li>
</ol>
<ul>
<li>
<p>Find the <code>OTEL_EXPORTER_OTLP_ENDPOINT/HEADER</code> variables as noted in the pervious for Docker.</p>
</li>
<li>
<p>Get your OpenAI Key</p>
</li>
<li>
<p>Elasticsearch URL, and username and password.</p>
</li>
</ul>
<ol start="3">
<li>Follow the instructions in the following <a href="https://github.com/elastic/observability-examples/tree/main/chatbot-rag-app-observability">github repo in observability examples</a> to run two Kubernetes yaml files.</li>
</ol>
<p>Essentially you need only replace the secret variables in k8s-deployment.yaml, and run</p>
<pre><code class="language-bash">kubectl create -f k8s-deployment.yaml
kubectl create -f init-index-job.yaml
</code></pre>
<p>The app needs to be running first, then we use the app to initialize Elasticsearch with indices for the app.</p>
<p><strong><em>Init-index-job.yaml</em></strong></p>
<pre><code class="language-bash">apiVersion: batch/v1
kind: Job
metadata:
  name: init-elasticsearch-index-test
spec:
  template:
    spec:
      containers:
      - name: init-index
        image: ghcr.io/elastic/elasticsearch-labs/chatbot-rag-app:latest
        workingDir: /app/api
        command: [&quot;python3&quot;, &quot;-m&quot;, &quot;flask&quot;, &quot;--app&quot;, &quot;app&quot;, &quot;create-index&quot;]
        env:
        - name: FLASK_APP
          value: &quot;app&quot;
        - name: LLM_TYPE
          value: &quot;openai&quot;
        - name: CHAT_MODEL
          value: &quot;gpt-4o-mini&quot;
        - name: ES_INDEX
          value: &quot;workplace-app-docs&quot;
        - name: ES_INDEX_CHAT_HISTORY
          value: &quot;workplace-app-docs-chat-history&quot;
        - name: ELASTICSEARCH_URL
          valueFrom:
            secretKeyRef:
              name: chatbot-regular-secrets
              key: ELASTICSEARCH_URL
        - name: ELASTICSEARCH_USER
          valueFrom:
            secretKeyRef:
              name: chatbot-regular-secrets
              key: ELASTICSEARCH_USER
        - name: ELASTICSEARCH_PASSWORD
          valueFrom:
            secretKeyRef:
              name: chatbot-regular-secrets
              key: ELASTICSEARCH_PASSWORD
        envFrom:
        - secretRef:
            name: chatbot-regular-secrets
      restartPolicy: Never
  backoffLimit: 4
</code></pre>
<p><strong><em>k8s-deployment.yaml</em></strong></p>
<pre><code class="language-bash">apiVersion: v1
kind: Secret
metadata:
  name: chatbot-regular-secrets
type: Opaque
stringData:
  ELASTICSEARCH_URL: &quot;https://yourelasticcloud.es.us-west-2.aws.found.io&quot;
  ELASTICSEARCH_USER: &quot;elastic&quot;
  ELASTICSEARCH_PASSWORD: &quot;elastic&quot;
  OTEL_EXPORTER_OTLP_HEADERS: &quot;Authorization=Bearer%20xxxx&quot;
  OTEL_EXPORTER_OTLP_ENDPOINT: &quot;https://12345.apm.us-west-2.aws.cloud.es.io:443&quot;
  OPENAI_API_KEY: &quot;YYYYYYYY&quot;

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: chatbot-regular
spec:
  replicas: 2
  selector:
    matchLabels:
      app: chatbot-regular
  template:
    metadata:
      labels:
        app: chatbot-regular
    spec:
      containers:
      - name: chatbot-regular
        image: ghcr.io/elastic/elasticsearch-labs/chatbot-rag-app:latest
        ports:
        - containerPort: 4000
        env:
        - name: LLM_TYPE
          value: &quot;openai&quot;
        - name: CHAT_MODEL
          value: &quot;gpt-4o-mini&quot;
        - name: OTEL_RESOURCE_ATTRIBUTES
          value: &quot;service.name=chatbot-regular,service.version=0.0.1,deployment.environment=dev&quot;
        - name: OTEL_SDK_DISABLED
          value: &quot;false&quot;
        - name: OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT
          value: &quot;true&quot;
        - name: OTEL_EXPERIMENTAL_RESOURCE_DETECTORS
          value: &quot;process_runtime,os,otel,telemetry_distro&quot;
        - name: OTEL_EXPORTER_OTLP_PROTOCOL
          value: &quot;http/protobuf&quot;
        - name: OTEL_METRIC_EXPORT_INTERVAL
          value: &quot;3000&quot;
        - name: OTEL_BSP_SCHEDULE_DELAY
          value: &quot;3000&quot;
        envFrom:
        - secretRef:
            name: chatbot-regular-secrets
        resources:
          requests:
            memory: &quot;512Mi&quot;
            cpu: &quot;250m&quot;
          limits:
            memory: &quot;1Gi&quot;
            cpu: &quot;500m&quot;

---
apiVersion: v1
kind: Service
metadata:
  name: chatbot-regular-service
spec:
  selector:
    app: chatbot-regular
  ports:
  - port: 80
    targetPort: 4000
  type: LoadBalancer
</code></pre>
<p><strong>Open App with LoadBalancer URL</strong></p>
<p>Run the kubectl get services command and get the URL for the chatbot app</p>
<pre><code class="language-bash">% kubectl get services
NAME                                 TYPE           CLUSTER-IP    EXTERNAL-IP                                                               PORT(S)                                                                     AGE
chatbot-regular-service            LoadBalancer   10.100.130.44    xxxxxxxxx-1515488226.us-west-2.elb.amazonaws.com   80:30748/TCP                                                                6d23h
</code></pre>
<ol start="4">
<li>
<p>Play with app and review telemetry in Elastic</p>
</li>
<li>
<p>Once you go to the URL, you should see all the screens we described earlier in the beginning of this blog.</p>
</li>
</ol>
<h1>Conclusion</h1>
<p>With Elastic's Chatbot-rag-app you have an example of how to build out a OpenAI driven RAG based chat application. However, you still need to understand how well it performs, whether its working properly, etc. Using OTel and Elastic’s EDOT gives you the ability to achieve this. Additionally, you will generally run this application on Kubernetes. Hopefully this blog provides the outline of how to achieve this.
Here are the other Tracing blogs:</p>
<p>App Observability with LLM (Tracing)-</p>
<ul>
<li>
<p><a href="https://www.elastic.co/observability-labs/blog/elastic-opentelemetry-langchain-tracing-langtrace">Observing LangChain with Langtrace and OpenTelemetry</a></p>
</li>
<li>
<p><a href="https://www.elastic.co/observability-labs/blog/elastic-opentelemetry-langchain-openlit-tracing">Observing LangChain with OpenLit Tracing</a></p>
</li>
<li>
<p><a href="https://www.elastic.co/observability-labs/blog/elastic-opentelemetry-langchain-tracing">Instrumenting LangChain with OpenTelemetry</a></p>
</li>
</ul>
<p>LLM Observability -</p>
<ul>
<li>
<p><a href="https://www.elastic.co/observability-labs/blog/elevate-llm-observability-with-gcp-vertex-ai-integration">Elevate LLM Observability with GCP Vertex AI Integration</a></p>
</li>
<li>
<p><a href="https://www.elastic.co/observability-labs/blog/llm-observability-aws-bedrock">LLM Observability on AWS Bedrock</a></p>
</li>
<li>
<p><a href="https://www.elastic.co/observability-labs/blog/llm-observability-azure-openai">LLM Observability for Azure OpenAI</a></p>
</li>
<li>
<p><a href="https://www.elastic.co/observability-labs/blog/llm-observability-azure-openai-v2">LLM Observability for Azure OpenAI v2</a></p>
</li>
</ul>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/openai-tracing-elastic-opentelemetry/edot-openai-tracing.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Tracing a RAG based Chatbot with Elastic Distributions of OpenTelemetry and Langtrace]]></title>
            <link>https://www.elastic.co/observability-labs/blog/openai-tracing-langtrace-elastic</link>
            <guid isPermaLink="false">openai-tracing-langtrace-elastic</guid>
            <pubDate>Thu, 06 Feb 2025 00:00:00 GMT</pubDate>
            <description><![CDATA[How to observe a OpenAI RAG based application using Elastic. Instrument the app, collect logs, traces, metrics, and understand how well the LLM is performing with Elastic Distributions of OpenTelemetry on Kubernetes with Langtrace.]]></description>
            <content:encoded><![CDATA[<p>Most AI-driven applications are currently focusing around increasing the value an end user, such as an SRE gets from AI. The main use case is the creation of various chatbots. These chatbots not only use large language models (LLMs), but are also using frameworks such as LangChain, and search to improve contextual information during a conversation (Retrieval Augmented Generation). Elastic’s sample <a href="https://github.com/elastic/elasticsearch-labs/tree/main/example-apps/chatbot-rag-app">RAG based Chatbot application</a>, showcases how to use Elasticsearch with local data that has embeddings, enabling search to properly pull out the most contextual information during a query with a chatbot connected to an LLM of your choice. It's a great example of how to build out a RAG based application with Elasticsearch. However, what about monitoring the application?</p>
<p>Elastic provides the ability to ingest OpenTelemetry data with native OTel SDKs, the off the shelf OTel collector, or even Elastic’s Distributions of OpenTelemetry (EDOT). EDOT enables you to bring in logs, metrics and traces for your GenAI application and for K8s. However you will also generally need libraries to help trace specific components in your application. In tracing GenAI applications you can pick from a large set of libraries.</p>
<ul>
<li>
<p><a href="https://github.com/open-telemetry/opentelemetry-python-contrib/tree/main/instrumentation-genai/opentelemetry-instrumentation-openai-v2">OpenTelemetry OpenAI Instrumentation-v2</a> - allows tracing LLM requests and logging of messages made by the OpenAI Python API library. (note v2 is built by OpenTelemetry, the non v2 version is from a specific vendor and not OpenTelemetry)</p>
</li>
<li>
<p><a href="https://github.com/open-telemetry/opentelemetry-python-contrib/tree/main/instrumentation-genai/opentelemetry-instrumentation-vertexai">OpenTelemetry VertexAI Instrumentation</a> - allows tracing LLM requests and logging of messages made by the VertexAI Python API library</p>
</li>
<li>
<p><a href="https://docs.langtrace.ai/introduction">Langtrace</a> - commercially available library which supports all LLMs in one library, and all traces are also OTel native.</p>
</li>
<li>
<p>Elastic’s EDOT - which recently added tracing. See <a href="https://www.elastic.co/observability-labs/blog/openai-tracing-elastic-opentelemetry">blog</a>.</p>
</li>
</ul>
<p>As you can see OpenTelemetry is the defacto mechanism that is converging to collect and ingest. OpenTelemetry is growing its support for this but it is also early days.</p>
<p>In this blog, we will walk through how to, with minimal code, observe a RAG based chatbot application with tracing using Langtrace. We previously covered Langtrace in a <a href="https://www.elastic.co/observability-labs/blog/elastic-opentelemetry-langchain-tracing-langtrace">blog</a> to highlight tracing Langchain.</p>
<p>In this blog we used langtrace OpenAI, Amazon Bedrock, Cohere, and others in one library.</p>
<h1>Pre-requisites:</h1>
<p>In order to follow along, these few pre-requisites are needed</p>
<ul>
<li>An Elastic Cloud account — sign up now, and become familiar with Elastic’s OpenTelemetry configuration. With Serverless no version required. With regular cloud minimally 8.17</li>
</ul>
<ul>
<li>Git clone the <a href="https://github.com/elastic/elasticsearch-labs/tree/main/example-apps/chatbot-rag-app">RAG based Chatbot application</a> and go through the <a href="https://www.elastic.co/search-labs/tutorials/chatbot-tutorial/welcome">tutorial</a> on how to bring it up and become more familiar.</li>
</ul>
<ul>
<li>An account on your favorite LLM (OpenAI, AzureOpen AI, etc), with API keys</li>
</ul>
<ul>
<li>Be familiar with EDOT to understand how we bring in logs, metrics, and traces from the application through the OTel Collector</li>
</ul>
<ul>
<li>Kubernetes cluster - I’ll be using Amazon EKS</li>
</ul>
<ul>
<li>Look at <a href="https://docs.langtrace.ai/introduction">Langtrace</a> documentation also.</li>
</ul>
<h1>Application OpenTelemetry output in Elastic</h1>
<h2>Chatbot-rag-app</h2>
<p>The first item that you will need to get up and running is the ChatBotApp, and once up you should see the following:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/openai-tracing-langtrace-elastic/Chatbotapp-general.png" alt="Chatbot app main page" /></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/openai-tracing-langtrace-elastic/Chatbotapp-details.png" alt="Chatbot app working" /></p>
<p>As you select some of the questions you will set a response based on the index that was created in Elasticsearch when the app initializes. Additionally there will be queries that are made to LLMs.</p>
<h2>Traces, logs, and metrics from EDOT in Elastic</h2>
<p>Once you have OTel Collector with EDOT configuration on your K8s cluster, and Elastic Cloud up and running you should see the following:</p>
<h3>Logs:</h3>
<p>In Discover you will see logs from the Chatbotapp, and be able to analyze the application logs, any specific log patterns (saves you time in analysis), and view logs from K8s.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/openai-tracing-langtrace-elastic/Chatbotapp-logs.png" alt="Chatbot-logs" /></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/openai-tracing-langtrace-elastic/Chatbotapp-log-patterns.png" alt="Chatbot-log-patterns" /></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/openai-tracing-langtrace-elastic/Chatbotapp-logs-detailed.png" alt="Chatbot-log-details" /></p>
<h3>Traces:</h3>
<p>In Elastic Observability APM, you can also see tha chatbot details, which include transactions, dependencies, logs, errors, etc.</p>
<p>When you look at traces, you will be able to see the chatbot interactions in the trace.</p>
<ol>
<li>
<p>You will see the end to end http call</p>
</li>
<li>
<p>Individual calls to elasticsearch</p>
</li>
<li>
<p>Specific calls such as invoke actions, and calls to the LLM</p>
</li>
</ol>
<p>You can also get individual details of the traces, and look at related logs, and metrics related to that trace,</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/openai-tracing-langtrace-elastic/Chatbotapp-service-traces.png" alt="CHatbot-traces" /></p>
<h3>Metrics:</h3>
<p>In addition to logs, and traces, any instrumented metrics will also get ingested into Elastic.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/openai-tracing-langtrace-elastic/chatbot-reg-metrics.png" alt="Chatbot app metrics" /></p>
<h1>Setting it all up</h1>
<p>In order to properly set up the Chatbot-app on K8s with telemetry sent over to Elastic, a few things must be set up:</p>
<ol>
<li>
<p>Git clone the chatbot-rag-app, and modify one of the python files.</p>
</li>
<li>
<p>Next create a docker container that can be used in Kubernetes. The Docker build <a href="https://github.com/elastic/elasticsearch-labs/blob/main/example-apps/chatbot-rag-app/Dockerfile">here</a> in the Chatbot-app is good to use.</p>
</li>
<li>
<p>Collect all needed env variables. In this example we are using OpenAI, but the files can be modified for any of the LLMs. Hence you will have to get a few environmental variables loaded into the cluster. In the github repo there is a env.example for docker. You can pick and chose what is needed or not needed and adjust appropriately in the K8s file below.</p>
</li>
<li>
<p>Set up your K8s Cluster, and then install the OpenTelemetry collector with the appropriate yaml file and credentials. This will help collect K8s cluster logs and metrics also.</p>
</li>
<li>
<p>Utilize the two yaml files listed below to ensure you can run it on Kubernetes.</p>
</li>
</ol>
<ul>
<li>
<p>Init-index-job.yaml - Initiates the index in elasticsearch with the local corporate information</p>
</li>
<li>
<p>k8s-deployment-chatbot-rag-app.yaml - initializes the application frontend and backend.</p>
</li>
</ul>
<ol start="6">
<li>
<p>Open the app on the load balancer URL against the chatbot-app service in K8s</p>
</li>
<li>
<p>Go to Elasticsearch and look at Discover for logs, go to APM and look for your chatbot-app and review the traces, and finally.</p>
</li>
</ol>
<h2>Modify the code for tracing with Langtrace</h2>
<p>Once you curl the app and untar, go to the chatbot-rag-app directory:</p>
<pre><code class="language-bash">curl https://codeload.github.com/elastic/elasticsearch-labs/tar.gz/main | 
tar -xz --strip=2 elasticsearch-labs-main/example-apps/chatbot-rag-app
cd elasticsearch-labs-main/example-apps/chatbot-rag-app
</code></pre>
<p>Next open the <code>app.py</code> file in the <code>api</code> directory and add the following</p>
<pre><code class="language-bash">from opentelemetry.instrumentation.flask import FlaskInstrumentor

from langtrace_python_sdk import langtrace

langtrace.init(batch=False)

FlaskInstrumentor().instrument_app(app)
</code></pre>
<p>into the code:</p>
<pre><code class="language-bash">import os
import sys
from uuid import uuid4

from chat import ask_question
from flask import Flask, Response, jsonify, request
from flask_cors import CORS

from opentelemetry.instrumentation.flask import FlaskInstrumentor

from langtrace_python_sdk import langtrace

langtrace.init(batch=False)

app = Flask(__name__, static_folder=&quot;../frontend/build&quot;, static_url_path=&quot;/&quot;)
CORS(app)

FlaskInstrumentor().instrument_app(app)

@app.route(&quot;/&quot;)
</code></pre>
<p>See the items in <strong>BOLD</strong> which will add in the langtrace library, and the opentelemetry flask instrumentation. This combination will provide and end to end trace for the https call all the way down to the calls to Elasticsearch, and to OpenAI (or other LLMs).</p>
<h2>Create the docker container</h2>
<p>Use the Dockerfile that is in the chatbot-rag-app directory as is and add the following line:</p>
<p><code>RUN pip3 install --no-cache-dir langtrace-python-sdk</code></p>
<p>into the Dockerfile:</p>
<pre><code class="language-bash">COPY requirements.txt ./requirements.txt
RUN pip3 install -r ./requirements.txt
RUN pip3 install --no-cache-dir langtrace-python-sdk
COPY api ./api
COPY data ./data

EXPOSE 4000
</code></pre>
<p>This enables the <code>langtrace-python-sdk</code> to be installed into the docker container so the langtrace libraries can be used properly.</p>
<h2>Collecting the proper env variables:</h2>
<p>First collect the env variables from Elastic:</p>
<p>Envs for index initialization in Elastic:</p>
<pre><code class="language-bash">
ELASTICSEARCH_URL=https://aws.us-west-2.aws.found.io
ELASTICSEARCH_USER=elastic
ELASTICSEARCH_PASSWORD=elastic

# The name of the Elasticsearch indexes
ES_INDEX=workplace-app-docs
ES_INDEX_CHAT_HISTORY=workplace-app-docs-chat-history

</code></pre>
<p>The <code>ELASTICSEARCH_URL</code> can be found in cloud.elastic.co when you bring up your instance.
The user and password, you will need to setup in Elastic.</p>
<p>Envs for sending the OTel instrumentation you will need the following:</p>
<pre><code class="language-bash">OTEL_EXPORTER_OTLP_ENDPOINT=&quot;https://123456789.apm.us-west-2.aws.cloud.es.io:443&quot;
OTEL_EXPORTER_OTLP_HEADERS=&quot;Authorization=Bearer xxxxx&quot;
</code></pre>
<p>These credentials are found in Elastic under APM integration and under OpenTelemetry</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/openai-tracing-langtrace-elastic/otel-credentials.png" alt="OTel credentials" /></p>
<p>Envs for LLMs</p>
<p>In this example we’re using OpenAI, hence only three variables are needed.</p>
<pre><code class="language-bash">LLM_TYPE=openai
OPENAI_API_KEY=XXXX
CHAT_MODEL=gpt-4o-mini
</code></pre>
<p>All these variables will be needed in the Kubernetes yamls in the next step</p>
<h2>Setup K8s cluster and load up OTel Collector with EDOT</h2>
<p>This step is outlined in the following <a href="https://www.elastic.co/observability-labs/blog/elastic-opentelemetry-otel-operator">Blog</a>. It’s a simple three step process.</p>
<p>This step will bring in all the K8s cluster logs and metrics and setup the OTel collector.</p>
<h2>Setup secrets, initialize indices, and start the app</h2>
<p>Now that the cluster is up, and you have your environmental variables, you will need to</p>
<ol>
<li>
<p>Install and run the <code>k8s-deployments.yaml</code> with the variables</p>
</li>
<li>
<p>Initialize the index</p>
</li>
</ol>
<p>Essentially run the following:</p>
<pre><code class="language-bash">kubectl create -f k8s-deployment.yaml
kubectl create -f init-index-job.yaml
</code></pre>
<p>Here are the two yamls you should use. Also found <a href="https://github.com/elastic/observability-examples/tree/main/chatbot-rag-app-observability">here</a></p>
<p>k8s-deployment.yaml</p>
<pre><code class="language-bash">apiVersion: v1
kind: Secret
metadata:
  name: genai-chatbot-langtrace-secrets
type: Opaque
stringData:
  OTEL_EXPORTER_OTLP_HEADERS: &quot;Authorization=Bearer%20xxxx&quot;
  OTEL_EXPORTER_OTLP_ENDPOINT: &quot;https://1234567.apm.us-west-2.aws.cloud.es.io:443&quot;
 ELASTICSEARCH_URL: &quot;YOUR_ELASTIC_SEARCH_URL&quot;
  ELASTICSEARCH_USER: &quot;elastic&quot;
  ELASTICSEARCH_PASSWORD: &quot;elastic&quot;
  OPENAI_API_KEY: &quot;XXXXXXX&quot;  

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: genai-chatbot-langtrace
spec:
  replicas: 2
  selector:
    matchLabels:
      app: genai-chatbot-langtrace
  template:
    metadata:
      labels:
        app: genai-chatbot-langtrace
    spec:
      containers:
      - name: genai-chatbot-langtrace
        image:65765.amazonaws.com/genai-chatbot-langtrace2:latest
        ports:
        - containerPort: 4000
        env:
        - name: LLM_TYPE
          value: &quot;openai&quot;
        - name: CHAT_MODEL
          value: &quot;gpt-4o-mini&quot;
        - name: OTEL_SDK_DISABLED
          value: &quot;false&quot;
        - name: OTEL_RESOURCE_ATTRIBUTES
          value: &quot;service.name=genai-chatbot-langtrace,service.version=0.0.1,deployment.environment=dev&quot;
        - name: OTEL_EXPORTER_OTLP_PROTOCOL
          value: &quot;http/protobuf&quot;
        envFrom:
        - secretRef:
            name: genai-chatbot-langtrace-secrets
        resources:
          requests:
            memory: &quot;512Mi&quot;
            cpu: &quot;250m&quot;
          limits:
            memory: &quot;1Gi&quot;
            cpu: &quot;500m&quot;

---
apiVersion: v1
kind: Service
metadata:
  name: genai-chatbot-langtrace-service
spec:
  selector:
    app: genai-chatbot-langtrace
  ports:
  - port: 80
    targetPort: 4000
  type: LoadBalancer

</code></pre>
<p>Init-index-job.yaml</p>
<pre><code class="language-bash">apiVersion: batch/v1
kind: Job
metadata:
  name: init-elasticsearch-index-test
spec:
  template:
    spec:
      containers:
      - name: init-index
#update your image location for chatbot rag app
        image: your-image-location:latest
        workingDir: /app/api
        command: [&quot;python3&quot;, &quot;-m&quot;, &quot;flask&quot;, &quot;--app&quot;, &quot;app&quot;, &quot;create-index&quot;]
        env:
        - name: FLASK_APP
          value: &quot;app&quot;
        - name: LLM_TYPE
          value: &quot;openai&quot;
        - name: CHAT_MODEL
          value: &quot;gpt-4o-mini&quot;
        - name: ES_INDEX
          value: &quot;workplace-app-docs&quot;
        - name: ES_INDEX_CHAT_HISTORY
          value: &quot;workplace-app-docs-chat-history&quot;
        - name: ELASTICSEARCH_URL
          valueFrom:
            secretKeyRef:
              name: chatbot-regular-secrets
              key: ELASTICSEARCH_URL
        - name: ELASTICSEARCH_USER
          valueFrom:
            secretKeyRef:
              name: chatbot-regular-secrets
              key: ELASTICSEARCH_USER
        - name: ELASTICSEARCH_PASSWORD
          valueFrom:
            secretKeyRef:
              name: chatbot-regular-secrets
              key: ELASTICSEARCH_PASSWORD
        envFrom:
        - secretRef:
            name: chatbot-regular-secrets
      restartPolicy: Never
  backoffLimit: 4

</code></pre>
<h2>Open App with LoadBalancer URL</h2>
<p>Run the kubectl get services command and get the URL for the chatbot app</p>
<pre><code class="language-bash">% kubectl get services
NAME                                 TYPE           CLUSTER-IP       EXTERNAL-IP                                                               PORT(S)                                                                     AGE
chatbot-langtrace-service            LoadBalancer   10.100.130.44    xxxxxxxxx-1515488226.us-west-2.elb.amazonaws.com   80:30748/TCP                                                                6d23h

</code></pre>
<p>Play with app and review telemetry in Elastic</p>
<p>Once you go to the URL, you should see all the screens we described earlier in the <a href="https://docs.google.com/document/d/1w_3VRDJV3CoLMjOj8Ktnng-6MuKgdzkhKs4CVBWkatc/edit?tab=t.0#bookmark=id.lrmf4nbl2twi">beginning of this blog</a>.</p>
<h1>Conclusion</h1>
<p>With Elastic's Chatbot-rag-app you have an example of how to build out a OpenAI driven RAG based chat application. However, you still need to understand how well it performs, whether its working properly, etc. Using OTel, Elastic’s EDOT and Langtrace gives you the ability to achieve this. Additionally, you will generally run this application on Kubernetes. Hopefully this blog provides the outline of how to achieve this.</p>
<p>Here are the other Tracing blogs:</p>
<p>App Observability with LLM (Tracing)-</p>
<ul>
<li>
<p><a href="https://www.elastic.co/observability-labs/blog/elastic-opentelemetry-langchain-tracing-langtrace">Observing LangChain with Langtrace and OpenTelemetry</a></p>
</li>
<li>
<p><a href="https://www.elastic.co/observability-labs/blog/elastic-opentelemetry-langchain-openlit-tracing">Observing LangChain with OpenLit Tracing</a></p>
</li>
<li>
<p><a href="https://www.elastic.co/observability-labs/blog/elastic-opentelemetry-langchain-tracing">Instrumenting LangChain with OpenTelemetry</a></p>
</li>
</ul>
<p>LLM Observability -</p>
<ul>
<li>
<p><a href="https://www.elastic.co/observability-labs/blog/elevate-llm-observability-with-gcp-vertex-ai-integration">Elevate LLM Observability with GCP Vertex AI Integration</a></p>
</li>
<li>
<p><a href="https://www.elastic.co/observability-labs/blog/llm-observability-aws-bedrock">LLM Observability on AWS Bedrock</a></p>
</li>
<li>
<p><a href="https://www.elastic.co/observability-labs/blog/llm-observability-azure-openai">LLM Observability for Azure OpenAI</a></p>
</li>
<li>
<p><a href="https://www.elastic.co/observability-labs/blog/llm-observability-azure-openai-v2">LLM Observability for Azure OpenAI v2</a></p>
</li>
</ul>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/openai-tracing-langtrace-elastic/edot-openai-tracing.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Collecting OpenShift container logs using Red Hat’s OpenShift Logging Operator]]></title>
            <link>https://www.elastic.co/observability-labs/blog/openshift-container-logs-red-hat-logging-operator</link>
            <guid isPermaLink="false">openshift-container-logs-red-hat-logging-operator</guid>
            <pubDate>Tue, 16 Jan 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[Learn how to optimize OpenShift logs collected with Red Hat OpenShift Logging Operator, as well as format and route them efficiently in Elasticsearch.]]></description>
            <content:encoded><![CDATA[<p>This blog explores a possible approach to collecting and formatting OpenShift Container Platform logs and audit logs with Red Hat OpenShift Logging Operator. We recommend using Elastic® Agent for the best possible experience! We will also show how to format the logs to Elastic Common Schema (<a href="https://www.elastic.co/guide/en/ecs/current/index.html">ECS</a>) for the best experience viewing, searching, and visualizing your logs. All examples in this blog are based on OpenShift 4.14.</p>
<h2>Why use OpenShift Logging Operator?</h2>
<p>A lot of enterprise customers use OpenShift as their orchestrating solution. The advantages of this approach are:</p>
<ul>
<li>
<p>It is developed and supported by Red Hat</p>
</li>
<li>
<p>It can automatically update the OpenShift cluster along with the Operating system to make sure that they are and remain compatible</p>
</li>
<li>
<p>It can speed up developing life cycles with features like source to image</p>
</li>
<li>
<p>It uses enhanced security</p>
</li>
</ul>
<p>In our consulting experience, this latter aspect poses challenges and frictions with OpenShift administrators when we try to install an Elastic Agent to collect the logs of the pods. Indeed, Elastic Agent requires the files of the host to be mounted in the pod, and it also needs to be run in privileged mode. (Read more about the permissions required by Elastic Agent in the <a href="https://www.elastic.co/guide/en/fleet/current/running-on-kubernetes-standalone.html#_red_hat_openshift_configuration">official Elasticsearch® Documentation</a>). While the solution we explore in this post requires similar privileges under the hood, it is managed by the OpenShift Logging Operator, which is developed and supported by Red Hat.</p>
<h2>Which logs are we going to collect?</h2>
<p>In OpenShift Container Platform, we distinguish <a href="https://docs.openshift.com/container-platform/4.14/logging/cluster-logging.html#logging-architecture-overview_cluster-logging">three broad categories of logs</a>: audit, application, and infrastructure logs:</p>
<ul>
<li>
<p><strong>Audit logs</strong> describe the list of activities that affected the system by users, administrators, and other components.</p>
</li>
<li>
<p><strong>Application logs</strong> are composed of the container logs of the pods running in non-reserved namespaces.</p>
</li>
<li>
<p><strong>Infrastructure logs</strong> are composed of container logs of the pods running in reserved namespaces like openshift*, kube*, and default along with journald messages from the nodes.</p>
</li>
</ul>
<p>In the following, we will consider only audit and application logs for the sake of simplicity. In this post, we will describe how to format audit and application Logs in the format expected by the Kubernetes integration to take the most out of Elastic Observability.</p>
<h2>Getting started</h2>
<p>To collect the logs from OpenShift, we must perform some preparation steps in Elasticsearch and OpenShift.</p>
<h3>Inside Elasticsearch</h3>
<p>We first <a href="https://www.elastic.co/guide/en/fleet/8.11/install-uninstall-integration-assets.html#install-integration-assets">install the Kubernetes integration assets</a>. We are mainly interested in the index templates and ingest pipelines for the logs-kubernetes.container_logs and logs-kubernetes.audit_logs.</p>
<p>To format the logs received from the ClusterLogForwarder in <a href="https://www.elastic.co/guide/en/ecs/current/index.html">ECS</a> format, we will define a pipeline to normalize the container logs. The field naming convention used by OpenShift is slightly different from that used by ECS. To get a list of exported fields from OpenShift, refer to <a href="https://docs.openshift.com/container-platform/4.14/logging/cluster-logging-exported-fields.html">Exported fields | Logging | OpenShift Container Platform 4.14</a>. To get a list of exported fields of the Kubernetes integration, you can refer to <a href="https://www.elastic.co/guide/en/beats/filebeat/current/exported-fields-kubernetes-processor.html">Kubernetes fields | Filebeat Reference [8.11] | Elastic</a> and <a href="https://www.elastic.co/guide/en/observability/current/logs-app-fields.html">Logs app fields | Elastic Observability [8.11]</a>. Further, specific fields like kubernetes.annotations must be normalized by replacing dots with underscores. This operation is usually done automatically by Elastic Agent.</p>
<pre><code class="language-bash">PUT _ingest/pipeline/openshift-2-ecs
{
  &quot;processors&quot;: [
    {
      &quot;rename&quot;: {
        &quot;field&quot;: &quot;kubernetes.pod_id&quot;,
        &quot;target_field&quot;: &quot;kubernetes.pod.uid&quot;,
        &quot;ignore_missing&quot;: true
      }
    },
    {
      &quot;rename&quot;: {
        &quot;field&quot;: &quot;kubernetes.pod_ip&quot;,
        &quot;target_field&quot;: &quot;kubernetes.pod.ip&quot;,
        &quot;ignore_missing&quot;: true
      }
    },
    {
      &quot;rename&quot;: {
        &quot;field&quot;: &quot;kubernetes.pod_name&quot;,
        &quot;target_field&quot;: &quot;kubernetes.pod.name&quot;,
        &quot;ignore_missing&quot;: true
      }
    },
    {
      &quot;rename&quot;: {
        &quot;field&quot;: &quot;kubernetes.namespace_name&quot;,
        &quot;target_field&quot;: &quot;kubernetes.namespace&quot;,
        &quot;ignore_missing&quot;: true
      }
    },
    {
      &quot;rename&quot;: {
        &quot;field&quot;: &quot;kubernetes.namespace_id&quot;,
        &quot;target_field&quot;: &quot;kubernetes.namespace_uid&quot;,
        &quot;ignore_missing&quot;: true
      }
    },
    {
      &quot;rename&quot;: {
        &quot;field&quot;: &quot;kubernetes.container_id&quot;,
        &quot;target_field&quot;: &quot;container.id&quot;,
        &quot;ignore_missing&quot;: true
      }
    },
    {
      &quot;dissect&quot;: {
        &quot;field&quot;: &quot;container.id&quot;,
        &quot;pattern&quot;: &quot;%{container.runtime}://%{container.id}&quot;,
        &quot;ignore_failure&quot;: true
      }
    },
    {
      &quot;rename&quot;: {
        &quot;field&quot;: &quot;kubernetes.container_image&quot;,
        &quot;target_field&quot;: &quot;container.image.name&quot;,
        &quot;ignore_missing&quot;: true
      }
    },
    {
      &quot;set&quot;: {
        &quot;field&quot;: &quot;kubernetes.container.image&quot;,
        &quot;copy_from&quot;: &quot;container.image.name&quot;,
        &quot;ignore_failure&quot;: true
      }
    },
    {
      &quot;set&quot;: {
        &quot;copy_from&quot;: &quot;kubernetes.container_name&quot;,
        &quot;field&quot;: &quot;container.name&quot;,
        &quot;ignore_failure&quot;: true
      }
    },
    {
      &quot;rename&quot;: {
        &quot;field&quot;: &quot;kubernetes.container_name&quot;,
        &quot;target_field&quot;: &quot;kubernetes.container.name&quot;,
        &quot;ignore_missing&quot;: true
      }
    },
    {
      &quot;set&quot;: {
        &quot;field&quot;: &quot;kubernetes.node.name&quot;,
        &quot;copy_from&quot;: &quot;hostname&quot;,
        &quot;ignore_failure&quot;: true
      }
    },
    {
      &quot;rename&quot;: {
        &quot;field&quot;: &quot;hostname&quot;,
        &quot;target_field&quot;: &quot;host.name&quot;,
        &quot;ignore_missing&quot;: true
      }
    },
    {
      &quot;rename&quot;: {
        &quot;field&quot;: &quot;level&quot;,
        &quot;target_field&quot;: &quot;log.level&quot;,
        &quot;ignore_missing&quot;: true
      }
    },
    {
      &quot;rename&quot;: {
        &quot;field&quot;: &quot;file&quot;,
        &quot;target_field&quot;: &quot;log.file.path&quot;,
        &quot;ignore_missing&quot;: true
      }
    },
    {
      &quot;set&quot;: {
        &quot;copy_from&quot;: &quot;openshift.cluster_id&quot;,
        &quot;field&quot;: &quot;orchestrator.cluster.name&quot;,
        &quot;ignore_failure&quot;: true
      }
    },
    {
      &quot;dissect&quot;: {
        &quot;field&quot;: &quot;kubernetes.pod_owner&quot;,
        &quot;pattern&quot;: &quot;%{_tmp.parent_type}/%{_tmp.parent_name}&quot;,
        &quot;ignore_missing&quot;: true
      }
    },
    {
      &quot;lowercase&quot;: {
        &quot;field&quot;: &quot;_tmp.parent_type&quot;,
        &quot;ignore_missing&quot;: true
      }
    },
    {
      &quot;set&quot;: {
        &quot;field&quot;: &quot;kubernetes.pod.{{_tmp.parent_type}}.name&quot;,
        &quot;value&quot;: &quot;{{_tmp.parent_name}}&quot;,
        &quot;if&quot;: &quot;ctx?._tmp?.parent_type != null&quot;,
        &quot;ignore_failure&quot;: true
      }
    },
    {
      &quot;remove&quot;: {
        &quot;field&quot;: [
          &quot;_tmp&quot;,
          &quot;kubernetes.pod_owner&quot;
          ],
          &quot;ignore_missing&quot;: true
      }
    },
    {
      &quot;script&quot;: {
        &quot;description&quot;: &quot;Normalize kubernetes annotations&quot;,
        &quot;if&quot;: &quot;ctx?.kubernetes?.annotations != null&quot;,
        &quot;source&quot;: &quot;&quot;&quot;
        def keys = new ArrayList(ctx.kubernetes.annotations.keySet());
        for(k in keys) {
          if (k.indexOf(&quot;.&quot;) &gt;= 0) {
            def sanitizedKey = k.replace(&quot;.&quot;, &quot;_&quot;);
            ctx.kubernetes.annotations[sanitizedKey] = ctx.kubernetes.annotations[k];
            ctx.kubernetes.annotations.remove(k);
          }
        }
        &quot;&quot;&quot;
      }
    },
    {
      &quot;script&quot;: {
        &quot;description&quot;: &quot;Normalize kubernetes namespace_labels&quot;,
        &quot;if&quot;: &quot;ctx?.kubernetes?.namespace_labels != null&quot;,
        &quot;source&quot;: &quot;&quot;&quot;
        def keys = new ArrayList(ctx.kubernetes.namespace_labels.keySet());
        for(k in keys) {
          if (k.indexOf(&quot;.&quot;) &gt;= 0) {
            def sanitizedKey = k.replace(&quot;.&quot;, &quot;_&quot;);
            ctx.kubernetes.namespace_labels[sanitizedKey] = ctx.kubernetes.namespace_labels[k];
            ctx.kubernetes.namespace_labels.remove(k);
          }
        }
        &quot;&quot;&quot;
      }
    },
    {
      &quot;script&quot;: {
        &quot;description&quot;: &quot;Normalize special Kubernetes Labels used in logs-kubernetes.container_logs to determine service.name and service.version&quot;,
        &quot;if&quot;: &quot;ctx?.kubernetes?.labels != null&quot;,
        &quot;source&quot;: &quot;&quot;&quot;
        def keys = new ArrayList(ctx.kubernetes.labels.keySet());
        for(k in keys) {
          if (k.startsWith(&quot;app_kubernetes_io_component_&quot;)) {
            def sanitizedKey = k.replace(&quot;app_kubernetes_io_component_&quot;, &quot;app_kubernetes_io_component/&quot;);
            ctx.kubernetes.labels[sanitizedKey] = ctx.kubernetes.labels[k];
            ctx.kubernetes.labels.remove(k);
          }
        }
        &quot;&quot;&quot;
      }
    }
    ]
}
</code></pre>
<p>Similarly, to handle the audit logs like the ones collected by Kubernetes, we define an ingest pipeline:</p>
<pre><code class="language-bash">PUT _ingest/pipeline/openshift-audit-2-ecs
{
  &quot;processors&quot;: [
    {
      &quot;script&quot;: {
        &quot;source&quot;: &quot;&quot;&quot;
        def audit = [:];
        def keyToRemove = [];
        for(k in ctx.keySet()) {
          if (k.indexOf('_') != 0 &amp;&amp; !['@timestamp', 'data_stream', 'openshift', 'event', 'hostname'].contains(k)) {
            audit[k] = ctx[k];
            keyToRemove.add(k);
          }
        }
        for(k in keyToRemove) {
          ctx.remove(k);
        }
        ctx.kubernetes=[&quot;audit&quot;:audit];
        &quot;&quot;&quot;,
        &quot;description&quot;: &quot;Move all the 'kubernetes.audit' fields under 'kubernetes.audit' object&quot;
      }
    },
    {
      &quot;set&quot;: {
        &quot;copy_from&quot;: &quot;openshift.cluster_id&quot;,
        &quot;field&quot;: &quot;orchestrator.cluster.name&quot;,
        &quot;ignore_failure&quot;: true
      }
    },
    {
      &quot;set&quot;: {
        &quot;field&quot;: &quot;kubernetes.node.name&quot;,
        &quot;copy_from&quot;: &quot;hostname&quot;,
        &quot;ignore_failure&quot;: true
      }
    },
    {
      &quot;rename&quot;: {
        &quot;field&quot;: &quot;hostname&quot;,
        &quot;target_field&quot;: &quot;host.name&quot;,
        &quot;ignore_missing&quot;: true
      }
    },
    {
      &quot;script&quot;: {
        &quot;if&quot;: &quot;ctx?.kubernetes?.audit?.annotations != null&quot;,
        &quot;source&quot;: &quot;&quot;&quot;
          def keys = new ArrayList(ctx.kubernetes.audit.annotations.keySet());
          for(k in keys) {
            if (k.indexOf(&quot;.&quot;) &gt;= 0) {
              def sanitizedKey = k.replace(&quot;.&quot;, &quot;_&quot;);
              ctx.kubernetes.audit.annotations[sanitizedKey] = ctx.kubernetes.audit.annotations[k];
              ctx.kubernetes.audit.annotations.remove(k);
            }
          }
          &quot;&quot;&quot;,
        &quot;description&quot;: &quot;Normalize kubernetes audit annotations field as expected by the Integration&quot;
      }
    }
  ]
}
</code></pre>
<p>The main objective of the pipeline is to mimic what Elastic Agent is doing: storing all audit fields under the kubernetes.audit object.</p>
<p>We are not going to use the conventional @custom pipeline approach because the fields must be normalized before invoking the logs-kubernetes.container_logs integration pipeline that uses fields like kubernetes.container.name and kubernetes.labels to determine the fields service.name and service.version. Read more about custom pipelines in <a href="https://www.elastic.co/guide/en/fleet/8.11/data-streams-pipeline-tutorial.html#data-streams-pipeline-one">Tutorial: Transform data with custom ingest pipelines | Fleet and Elastic Agent Guide [8.11]</a>.</p>
<p>The OpenShift Cluster Log Forwarder writes the data in the indices app-write and audit-write by default. It is possible to change this behavior, but it still tries to prepend the prefix “app” and the suffix “write”, so we opted to send the data to the default destination and use the reroute processor to send it to the right data streams. Read more about the Reroute Processor in our blog <a href="https://www.elastic.co/blog/simplifying-log-data-management-flexible-routing-elastic">Simplifying log data management: Harness the power of flexible routing with Elastic</a> and our documentation <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/reroute-processor.html">Reroute processor | Elasticsearch Guide [8.11] | Elastic</a>.</p>
<p>In this case, we want to redirect the container logs (app-write index) to logs-kubernetes.container_logs and the Audit logs (audit-write) to logs-kubernetes.audit_logs:</p>
<pre><code class="language-bash">PUT _ingest/pipeline/app-write-reroute-pipeline
{
  &quot;processors&quot;: [
    {
      &quot;pipeline&quot;: {
        &quot;name&quot;: &quot;openshift-2-ecs&quot;,
        &quot;description&quot;: &quot;Format the Openshift data in ECS&quot;
      }
    },
    {
      &quot;set&quot;: {
        &quot;field&quot;: &quot;event.dataset&quot;,
        &quot;value&quot;: &quot;kubernetes.container_logs&quot;
      }
    },
    {
      &quot;reroute&quot;: {
        &quot;destination&quot;: &quot;logs-kubernetes.container_logs-openshift&quot;
      }
    }
  ]
}



PUT _ingest/pipeline/audit-write-reroute-pipeline
{
  &quot;processors&quot;: [
    {
      &quot;pipeline&quot;: {
        &quot;name&quot;: &quot;openshift-audit-2-ecs&quot;,
        &quot;description&quot;: &quot;Format the Openshift data in ECS&quot;
      }
    },
    {
      &quot;set&quot;: {
        &quot;field&quot;: &quot;event.dataset&quot;,
        &quot;value&quot;: &quot;kubernetes.audit_logs&quot;
      }
    },
    {
      &quot;reroute&quot;: {
        &quot;destination&quot;: &quot;logs-kubernetes.audit_logs-openshift&quot;
      }
    }
  ]
}
</code></pre>
<p>Please note that given that app-write and audit-write do not follow the data stream naming convention, we are forced to add the destination field in the reroute processor. The reroute processor will also fill up the <a href="https://www.elastic.co/guide/en/ecs/8.11/ecs-data_stream.html">data_stream fields</a> for us. Note that this step is done automatically by Elastic Agent at source.</p>
<p>Further, we create the indices with the default pipelines we created to reroute the logs according to our needs.</p>
<pre><code class="language-bash">PUT app-write
{
  &quot;settings&quot;: {
      &quot;index.default_pipeline&quot;: &quot;app-write-reroute-pipeline&quot;
   }
}


PUT audit-write
{
  &quot;settings&quot;: {
    &quot;index.default_pipeline&quot;: &quot;audit-write-reroute-pipeline&quot;
  }
}
</code></pre>
<p>Basically, what we did can be summarized in this picture:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/openshift-container-logs-red-hat-logging-operator/openshift-summary-blog.png" alt="openshift-summary-blog" /></p>
<p>Let us take the container logs. When the operator attempts to write in the app-write index, it will invoke the default_pipeline “app-write-reroute-pipeline” that formats the logs into ECS format and reroutes the logs to logs-kubernetes.container_logs-openshift datastreams. This calls the integration pipeline that invokes, if it exists, the logs-kubernetes.container_logs@custom pipeline. Finally, the logs-kubernetes_container_logs pipeline may reroute the logs to another data set and namespace utilizing the elastic.co/dataset and elastic.co/namespace annotations as described in the Kubernetes <a href="https://docs.elastic.co/integrations/kubernetes/container-logs#rerouting-based-on-pod-annotations">integration documentation</a>, which in turn can lead to the execution of an another integration pipeline.</p>
<h3>Create a user for sending the logs</h3>
<p>We are going to use basic authentication because, at the time of writing, it is the only supported authentication method for Elasticsearch in OpenShift logging. Thus, we need a role that allows the user to write and read the app-write, and audit-write logs (required by the OpenShift agent) and auto_configure access to logs-*-* to allow custom Kubernetes rerouting:</p>
<pre><code class="language-bash">PUT _security/role/YOURROLE
{
    &quot;cluster&quot;: [
      &quot;monitor&quot;
    ],
    &quot;indices&quot;: [
      {
        &quot;names&quot;: [
          &quot;logs-*-*&quot;
        ],
        &quot;privileges&quot;: [
          &quot;auto_configure&quot;,
          &quot;create_doc&quot;
        ],
        &quot;allow_restricted_indices&quot;: false
      },
      {
        &quot;names&quot;: [
          &quot;app-write&quot;,
          &quot;audit-write&quot;,
        ],
        &quot;privileges&quot;: [
          &quot;create_doc&quot;,
          &quot;read&quot;
        ],
        &quot;allow_restricted_indices&quot;: false
      }
    ],
    &quot;applications&quot;: [],
    &quot;run_as&quot;: [],
    &quot;metadata&quot;: {},
    &quot;transient_metadata&quot;: {
      &quot;enabled&quot;: true
    }

}



PUT _security/user/YOUR_USERNAME
{
  &quot;password&quot;: &quot;YOUR_PASSWORD&quot;,
  &quot;roles&quot;: [&quot;YOURROLE&quot;]
}
</code></pre>
<h3>On OpenShift</h3>
<p>On the OpenShift Cluster, we need to follow the <a href="https://docs.openshift.com/container-platform/4.14/logging/log_collection_forwarding/log-forwarding.html">official documentation</a> of Red Hat on how to install the Red Hat OpenShift Logging and configure Cluster Logging and the Cluster Log Forwarder.</p>
<p>We need to install the Red Hat OpenShift Logging Operator, which defines the ClusterLogging and ClusterLogForwarder Resources. Afterward, we can define the Cluster Logging resource:</p>
<pre><code class="language-yaml">apiVersion: logging.openshift.io/v1
kind: ClusterLogging
metadata:
  name: instance
  namespace: openshift-logging
spec:
  collection:
    logs:
      type: vector
      vector: {}
</code></pre>
<p>The Cluster Log Forwarder is the resource responsible for defining a daemon set that will forward the logs to the remote Elasticsearch. Before creating it, we need to create in the same namespace as the ClusterLogForwarder a secret containing the Elasticsearch credentials for the user we created previously in the namespace, where the ClusterLogForwarder will be deployed:</p>
<pre><code class="language-yaml">apiVersion: v1
kind: Secret
metadata:
  name: elasticsearch-password
  namespace: openshift-logging
type: Opaque
stringData:
  username: YOUR_USERNAME
  password: YOUR_PASSWORD
</code></pre>
<p>Finally, we create the ClusterLogForwarder resource:</p>
<pre><code class="language-yaml">kind: ClusterLogForwarder
apiVersion: logging.openshift.io/v1
metadata:
  name: instance
  namespace: openshift-logging
spec:
  outputs:
    - name: remote-elasticsearch
      secret:
        name: elasticsearch-password
      type: elasticsearch
      url: &quot;https://YOUR_ELASTICSEARCH_URL:443&quot;
      elasticsearch:
        version: 8 # The default is version 6 with the _type field
  pipelines:
    - inputRefs:
        - application
        - audit
      name: enable-default-log-store
      outputRefs:
        - remote-elasticsearch
</code></pre>
<p>Note that we explicitly defined the version of Elasticsearch to be 8, otherwise the ClusterLogForwarder will send the _type field, which is not compatible with Elasticsearch 8 and that we collect only application and audit logs.</p>
<h2>Result</h2>
<p>Once the logs are collected and passed through all the pipelines, the result is very close to the out-of-the-box Kubernetes integration. There are important differences, like the lack of host and cloud metadata information that don’t seem to be collected (at least without an additional configuration). We can view the Kubernetes container logs in the logs explorer:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/openshift-container-logs-red-hat-logging-operator/openshift-summary-blog-graphs.png" alt="openshift-summary-blog-graphs" /></p>
<p>In this post, we described how you can use the OpenShift Logging Operator to collect the logs of containers and audit logs. We still recommend leveraging Elastic Agent to collect all your logs. It is the best user experience you can get. No need to maintain or transform the logs yourself to ECS formatting. Additionally, Elastic Agent uses API keys as the authentication method and collects metadata like cloud information that allow you in the long run to do <a href="https://www.elastic.co/blog/optimize-cloud-resources-cost-apm-metadata-elastic-observability">more</a>.</p>
<p><a href="https://www.elastic.co/observability/log-monitoring">Learn more about log monitoring with the Elastic Stack</a>.</p>
<p><em>Have feedback on this blog?</em> <a href="https://github.com/herrBez/elastic-blog-openshift-logging/issues"><em>Share it here</em></a><em>.</em></p>
<p><em>The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.</em></p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/openshift-container-logs-red-hat-logging-operator/139687_-_Blog_Header_Banner_V1.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Optimizing Observability with ES|QL: Streamlining SRE operations and issue resolution for Kubernetes and OTel]]></title>
            <link>https://www.elastic.co/observability-labs/blog/opentelemetry-kubernetes-esql</link>
            <guid isPermaLink="false">opentelemetry-kubernetes-esql</guid>
            <pubDate>Wed, 01 Nov 2023 00:00:00 GMT</pubDate>
            <description><![CDATA[ES|QL enhances operational efficiency, data analysis, and issue resolution for SREs. This blog covers the advantages of ES|QL in Elastic Observability and how it can apply to managing issues instrumented with OpenTelemetry and running on Kubernetes.]]></description>
            <content:encoded><![CDATA[<p>As an operations engineer (SRE, IT Operations, DevOps), managing technology and data sprawl is an ongoing challenge. Simply managing the large volumes of high dimensionality and high cardinality data is overwhelming.</p>
<p>As a single platform, Elastic® helps SREs unify and correlate limitless telemetry data, including metrics, logs, traces, and profiling, into a single datastore — Elasticsearch®. By then applying the power of Elastic’s advanced machine learning (ML), AIOps, AI Assistant, and analytics, you can break down silos and turn data into insights. As a full-stack observability solution, everything from infrastructure monitoring to log monitoring and application performance monitoring (APM) can be found in a single, unified experience.</p>
<p>In Elastic 8.11, a technical preview is now available of <a href="https://www.elastic.co/blog/esql-elasticsearch-piped-query-language">Elastic’s new piped query language, ES|QL (Elasticsearch Query Language)</a>, which transforms, enriches, and simplifies data investigations. Powered by a new query engine, ES|QL delivers advanced search capabilities with concurrent processing, improving speed and efficiency, irrespective of data source and structure. Accelerate resolution by creating aggregations and visualizations from one screen, delivering an iterative, uninterrupted workflow.</p>
<h2>Advantages of ES|QL for SREs</h2>
<p>SREs using Elastic Observability can leverage ES|QL to analyze logs, metrics, traces, and profiling data, enabling them to pinpoint performance bottlenecks and system issues with a single query. SREs gain the following advantages when managing high dimensionality and high cardinality data with ES|QL in Elastic Observability:</p>
<ul>
<li><strong>Improved operational efficiency:</strong> By using ES|QL, SREs can create more actionable notifications with aggregated values as thresholds from a single query, which can also be managed through the Elastic API and integrated into DevOps processes.</li>
<li><strong>Enhanced analysis with insights:</strong> ES|QL can process diverse observability data, including application, infrastructure, business data, and more, regardless of the source and structure. ES|QL can easily enrich the data with additional fields and context, allowing the creation of visualizations for dashboards or issue analysis with a single query.</li>
<li><strong>Reduced mean time to resolution:</strong> ES|QL, when combined with Elastic Observability's AIOps and AI Assistant, enhances detection accuracy by identifying trends, isolating incidents, and reducing false positives. This improvement in context facilitates troubleshooting and the quick pinpointing and resolution of issues.</li>
</ul>
<p>ES|QL in Elastic Observability not only enhances an SRE's ability to manage the customer experience, an organization's revenue, and SLOs more effectively but also facilitates collaboration with developers and DevOps by providing contextualized aggregated data.</p>
<p>In this blog, we will cover some of the key use cases SREs can leverage with ES|QL:</p>
<ul>
<li>ES|QL integrated with the Elastic AI Assistant, which uses public LLM and private data, enhances the analysis experience anywhere in Elastic Observability.</li>
<li>SREs can, in a single ES|QL query, break down, analyze, and visualize observability data from multiple sources and across any time frame.</li>
<li>Actionable alerts can be easily created from a single ES|QL query, enhancing operations.</li>
</ul>
<p>I will work through these use cases by showcasing how an SRE can solve a problem in an application instrumented with OpenTelemetry and running on Kubernetes. The OpenTelemetry (OTel) demo is on an Amazon EKS cluster, with Elastic Cloud 8.11 configured.</p>
<p>You can also check out our <a href="https://www.youtube.com/watch?v=vm0pBWI2l9c">Elastic Observability ES|QL Demo</a>, which walks through ES|QL functionality for Observability.</p>
<h2>ES|QL with AI Assistant</h2>
<p>As an SRE, you are monitoring your OTel instrumented application with Elastic Observability, and while in Elastic APM, you notice some issues highlighted in the service map.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/opentelemetry-kubernetes-esql/elastic-blog-1-services.png" alt="1 - services" /></p>
<p>Using Elastic AI Assistant, you can easily ask for analysis, and in particular, we check on what the overall latency is across the application services.</p>
<pre><code class="language-plaintext">My APM data is in traces-apm*. What's the average latency per service over the last hour? Use ESQL, the data is mapped to ECS
</code></pre>
&lt;Video vidyardUuid=&quot;wHJpzouDQHB51UftmkHFyo&quot; /&gt;
<p>The Elastic AI Assistant generates an ES|QL query, which we run in the AI Assistant to get a list of the average latencies across all the application services. We can easily see the top four are:</p>
<ul>
<li>load generator</li>
<li>front-end proxy</li>
<li>frontendservice</li>
<li>checkoutservice</li>
</ul>
<p>With a simple natural language query in the AI Assistant, it generated a single ES|QL query that helped list out the latencies across the services.</p>
<p>Noticing that there is an issue with several services, we decide to start with the frontend proxy. As we work through the details, we see significant failures, and through <strong>Elastic APM failure correlation</strong> , it becomes apparent that the frontend proxy is not properly completing its calls to downstream services.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/opentelemetry-kubernetes-esql/elastic-blog-2-failed-transaction.png" alt="2 - failed transaction" /></p>
<h2>ES|QL insightful and contextual analysis in Discover</h2>
<p>Knowing that the application is running on Kubernetes, we investigate if there are issues in Kubernetes. In particular, we want to see if there are any services having issues.</p>
<p>We use the following query in ES|QL in Elastic Discover:</p>
<pre><code class="language-sql">from metrics-* | where kubernetes.container.status.last_terminated_reason != &quot;&quot; and kubernetes.namespace == &quot;default&quot; | stats reason_count=count(kubernetes.container.status.last_terminated_reason) by kubernetes.container.name, kubernetes.container.status.last_terminated_reason | where reason_count &gt; 0
</code></pre>
<p><img src="https://www.elastic.co/observability-labs/assets/images/opentelemetry-kubernetes-esql/elastic-blog-3-two-horizontal-bar-graphs.png" alt="3 - horizontal graph" /></p>
<p>ES|QL helps analyze 1,000s/10,000s of metric events from Kubernetes and highlights two services that are restarting due to OOMKilled.</p>
<p>The Elastic AI Assistant, when asked about OOMKilled, indicates that a container in a pod was killed due to an out-of-memory condition.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/opentelemetry-kubernetes-esql/elastic-blog-4-understanding-oomkilled.png" alt="4 - understanding oomkilled" /></p>
<p>We run another ES|QL query to understand the memory usage for emailservice and productcatalogservice.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/opentelemetry-kubernetes-esql/elastic-blog-5-split-bar-graphs.png" alt="5 - split bar graphs" /></p>
<p>ES|QL easily found the average memory usage fairly high.</p>
<p>We can now further investigate both of these services’ logs, metrics, and Kubernetes-related data. However, before we continue, we create an alert to track heavy memory usage.</p>
<h2>Actionable alerts with ES|QL</h2>
<p>Suspecting a specific issue, that might recur, we simply create an alert that brings in the ES|QL query we just ran that will track for any service that exceeds 50% in memory utilization.</p>
<p>We modify the last query to find any service with high memory usage:</p>
<pre><code class="language-sql">FROM metrics*
| WHERE @timestamp &gt;= NOW() - 1 hours
| STATS avg_memory_usage = AVG(kubernetes.pod.memory.usage.limit.pct) BY kubernetes.deployment.name | where avg_memory_usage &gt; .5
</code></pre>
<p>With that query, we create a simple alert. Notice how the ES|QL query is brought into the alert. We simply connect this to pager duty. But we can choose from multiple connectors like ServiceNow, Opsgenie, email, etc.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/opentelemetry-kubernetes-esql/elastic-blog-6-create-rule.png" alt="6 - create rule" /></p>
<p>With this alert, we can now easily monitor for any services that exceed 50% memory utilization in their pods.</p>
<h2>Make the most of your data with ES|QL</h2>
<p>In this post, we demonstrated the power ES|QL brings to analysis, operations, and reducing MTTR. In summary, the three use cases with ES|QL in Elastic Observability are as follows:</p>
<ul>
<li>ES|QL integrated with the Elastic AI Assistant, which uses public LLM and private data, enhances the analysis experience anywhere in Elastic Observability.</li>
<li>SREs can, in a single ES|QL query, break down, analyze, and visualize observability data from multiple sources and across any time frame.</li>
<li>Actionable alerts can be easily created from a single ES|QL query, enhancing operations.</li>
</ul>
<p>Elastic invites SREs and developers to experience this transformative language firsthand and unlock new horizons in their data tasks. Try it today at <a href="https://ela.st/free-trial">https://ela.st/free-trial</a> now in technical preview.</p>
<blockquote>
<ul>
<li><a href="https://www.elastic.co/demo-gallery/observability">Elastic Observability Tour</a></li>
<li><a href="https://www.elastic.co/blog/log-management-observability-operations">The power of effective log management</a></li>
<li><a href="https://www.elastic.co/blog/context-aware-insights-elastic-ai-assistant-observability">Transforming Observability with the AI Assistant</a></li>
<li><a href="https://www.elastic.co/blog/esql-elasticsearch-piped-query-language">ES|QL announcement blog</a></li>
</ul>
</blockquote>
<p><em>The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.</em></p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/opentelemetry-kubernetes-esql/ES_QL_blog-720x420-05.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Independence with OpenTelemetry on Elastic]]></title>
            <link>https://www.elastic.co/observability-labs/blog/opentelemetry-observability</link>
            <guid isPermaLink="false">opentelemetry-observability</guid>
            <pubDate>Tue, 15 Nov 2022 00:00:00 GMT</pubDate>
            <description><![CDATA[OpenTelemetry has become a key component for observability given its open standards and developer-friendly tools. See how easily Elastic Observability integrates with OTel to provide a platform that minimizes vendor lock-in and maximizes flexibility.]]></description>
            <content:encoded><![CDATA[<p>The drive for faster, more scalable services is on the rise. Our day-to-day lives depend on apps, from a food delivery app to have your favorite meal delivered, to your banking app to manage your accounts, to even apps to schedule doctor’s appointments. These apps need to be able to grow from not only a features standpoint but also in terms of user capacity. The scale and need for global reach drives increasing complexity for these high-demand cloud applications.</p>
<p>In order to keep pace with demand, most of these online apps and services (for example, mobile applications, web pages, SaaS) are moving to a distributed microservice-based architecture and Kubernetes. Once you’ve migrated your app to the cloud, how do you manage and monitor production, scale, and availability of the service? <a href="https://opentelemetry.io/">OpenTelemetry</a> is quickly becoming the de facto standard for instrumentation and collecting application telemetry data for Kubernetes applications.</p>
<p><a href="https://www.elastic.co/what-is/opentelemetry">OpenTelemetry (OTel)</a> is an open source project providing a collection of tools, APIs, and SDKs that can be used to generate, collect, and export telemetry data (metrics, logs, and traces) to understand software performance and behavior. OpenTelemetry recently became a CNCF incubating project and has a significant amount of growing community and vendor support.</p>
<p>While OTel provides a standard way to instrument applications with a standard telemetry format, it doesn’t provide any backend or analytics components. Hence using OTel libraries in applications, infrastructure, and user experience monitoring provides flexibility in choosing the appropriate <a href="https://www.elastic.co/observability">observability tool</a> of choice. There is no longer any vendor lock-in for application performance monitoring (APM).</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/opentelemetry-observability/blog-elastic-otel-1.png" alt="" /></p>
<p>Elastic Observability natively supports OpenTelemetry and its OpenTelemetry protocol (OTLP) to ingest traces, metrics, and logs. All of Elastic Observability’s APM capabilities are available with OTel data. Hence the following capabilities (and more) are available for OTel data:</p>
<ul>
<li>Service maps</li>
<li>Service details (latency, throughput, failed transactions)</li>
<li>Dependencies between services</li>
<li>Transactions (traces)</li>
<li>ML correlations (specifically for latency)</li>
<li>Service logs</li>
</ul>
<p>In addition to Elastic’s APM and unified view of the telemetry data, you will now be able to use Elastic’s powerful machine learning capabilities to reduce the analysis, and alerting to help reduce MTTR.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/opentelemetry-observability/blog-elastic-otel-2.png" alt="" /></p>
<p>Given its open source heritage, Elastic also supports other CNCF based projects, such as Prometheus, Fluentd, Fluent Bit, Istio, Kubernetes (K8S), and many more.</p>
<p>This blog will show:</p>
<ul>
<li>How to get a popular OTel instrumented demo app (Hipster Shop) configured to ingest into <a href="http://cloud.elastic.co">Elastic Cloud</a> through a few easy steps</li>
<li>Highlight some of the Elastic APM capabilities and features around OTel data and what you can do with this data once it’s in Elastic</li>
</ul>
<p>In follow-up blogs, we will detail how to use Elastic’s machine learning with OTel telemetry data, how to instrument OTel application metrics for specific languages, how we can support Prometheus ingest through the OTel collector, and more. Stay tuned!</p>
<h2>Prerequisites and config</h2>
<p>If you plan on following this blog, here are some of the components and details we used to set up the configuration:</p>
<ul>
<li>Ensure you have an account on <a href="http://cloud.elastic.co">Elastic Cloud</a> and a deployed stack (<a href="https://www.elastic.co/guide/en/elastic-stack/current/installing-elastic-stack.html">see instructions here</a>).</li>
<li>We used the OpenTelemetry Demo. Directions for using Elastic with OpenTelemetry Demo are <a href="https://github.com/elastic/opentelemetry-demo">here</a>.</li>
<li>Make sure you have <a href="https://kubernetes.io/docs/reference/kubectl/">kubectl</a> and <a href="https://helm.sh/">helm</a> also installed locally.</li>
<li>Additionally, we are using an OTel manually instrumented version of the application. No OTel automatic instrumentation was used in this blog configuration.</li>
<li>Location of our clusters. While we used Google Kubernetes Engine (GKE), you can use any Kubernetes platform of your choice.</li>
<li>While Elastic can ingest telemetry directly from OTel instrumented services, we will focus on the more traditional deployment, which uses the OpenTelemetry Collector.</li>
<li>Prometheus and FluentD/Fluent Bit — traditionally used to pull all Kubernetes data — is not being used here versus Kubernetes Agents. Follow-up blogs will showcase this.</li>
</ul>
<p>Here is the configuration we will get set up in this blog:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/opentelemetry-observability/blog-elastic-otel-3.png" alt="Configuration to ingest OpenTelemetry data used in this blog" /></p>
<h2>Setting it all up</h2>
<p>Over the next few steps, I’ll walk through an <a href="https://www.elastic.co/observability/opentelemetry">Opentelemetry visualization</a>:</p>
<ul>
<li>Getting an account on Elastic Cloud</li>
<li>Bringing up a GKE cluster</li>
<li>Bringing up the application</li>
<li>Configuring Kubernetes OTel Collector configmap to point to Elastic Cloud</li>
<li>Using Elastic Observability APM with OTel data for improved visibility</li>
</ul>
<h3>Step 0: Create an account on Elastic Cloud</h3>
<p>Follow the instructions to <a href="https://cloud.elastic.co/registration?fromURI=/home">get started on Elastic Cloud</a>.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/opentelemetry-observability/blog-elastic-otel-4.png" alt="" /></p>
<h3>Step 1: Bring up a K8S cluster</h3>
<p>We used Google Kubernetes Engine (GKE), but you can use any Kubernetes platform of your choice.</p>
<p>There are no special requirements for Elastic to collect OpenTelemetry data from a Kubernetes cluster. Any normal Kubernetes cluster on GKE, EKS, AKS, or Kubernetes compliant cluster (self-deployed and managed) works.</p>
<h3>Step 2: Load the OpenTelemetry demo application on the cluster</h3>
<p>Get your application on a Kubernetes cluster in your cloud service of choice or local Kubernetes platform. The application I am using is available <a href="https://github.com/bshetti/opentelemetry-microservices-demo/tree/main/deploy-with-collector-k8s">here</a>.</p>
<p>First clone the directory locally:</p>
<pre><code class="language-bash">git clone https://github.com/elastic/opentelemetry-demo.git
</code></pre>
<p>(Make sure you have <a href="https://kubernetes.io/docs/reference/kubectl/">kubectl</a> and <a href="https://helm.sh/">helm</a> also installed locally.)</p>
<p>The instructions utilize a specific opentelemetry-collector configuration for Elastic. Essentially, the Elastic <a href="https://github.com/elastic/opentelemetry-demo/blob/main/kubernetes/elastic-helm/values.yaml">values.yaml</a> file specified in the elastic/opentelemetry-demo configure the opentelemetry-collector to point to the Elastic APM Server using two main values:</p>
<p>OTEL_EXPORTER_OTLP_ENDPOINT is Elastic’s APM Server<br />
OTEL_EXPORTER_OTLP_HEADERS Elastic Authorization</p>
<p>These two values can be found in the OpenTelemetry setup instructions under the APM integration instructions (Integrations-&gt;APM) in your Elastic cloud.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/opentelemetry-observability/blog-elastic-apm-agents.png" alt="elastic apm agents" /></p>
<p>Once you obtain this, the first step is to create a secret key on the cluster with your Elastic APM server endpoint, and your APM Secret Token with the following instruction:</p>
<pre><code class="language-bash">kubectl create secret generic elastic-secret \
  --from-literal=elastic_apm_endpoint='YOUR_APM_ENDPOINT_WITHOUT_HTTPS_PREFIX' \
  --from-literal=elastic_apm_secret_token='YOUR_APM_SECRET_TOKEN'
</code></pre>
<p>Don't forget to replace:</p>
<ul>
<li>YOUR_APM_ENDPOINT_WITHOUT_HTTPS_PREFIX: your Elastic APM endpoint ( <strong>without https:// prefix</strong> ) with OTEL_EXPORTER_OTLP_ENDPOINT</li>
<li>YOUR_APM_SECRET_TOKEN: your Elastic APM secret token OTEL_EXPORTER_OTLP_HEADERS</li>
</ul>
<p>Now execute the following commands:</p>
<pre><code class="language-bash"># switch to the kubernetes/elastic-helm directory
cd kubernetes/elastic-helm

# add the open-telemetry Helm repostiroy
helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts

# deploy the demo through helm install
helm install -f values.yaml my-otel-demo open-telemetry/opentelemetry-demo
</code></pre>
<p>Once your application is up on Kubernetes, you will have the following pods (or some variant) running on the <strong>default</strong> namespace.</p>
<pre><code class="language-bash">kubectl get pods -n default
</code></pre>
<p>Output should be similar to the following:</p>
<pre><code class="language-bash">NAME                                                  READY   STATUS    RESTARTS      AGE
my-otel-demo-accountingservice-5c77754b4f-vwph6       1/1     Running   0             5d4h
my-otel-demo-adservice-6b8b7c7dc5-mb7j5               1/1     Running   0             5d4h
my-otel-demo-cartservice-76d94b7dcd-2g4lf             1/1     Running   0             5d4h
my-otel-demo-checkoutservice-988bbdb88-hmkrp          1/1     Running   0             5d4h
my-otel-demo-currencyservice-6cf4b5f9f6-vz9t2         1/1     Running   0             5d4h
my-otel-demo-emailservice-868c98fd4b-lpr7n            1/1     Running   6 (18h ago)   5d4h
my-otel-demo-featureflagservice-8446ff9c94-lzd4w      1/1     Running   0             5d4h
my-otel-demo-ffspostgres-867945d9cf-zzwd7             1/1     Running   0             5d4h
my-otel-demo-frauddetectionservice-5c97c589b9-z8fhz   1/1     Running   0             5d4h
my-otel-demo-frontend-d85ccf677-zg9fp                 1/1     Running   0             5d4h
my-otel-demo-frontendproxy-6c5c4fccf6-qmldp           1/1     Running   0             5d4h
my-otel-demo-kafka-68bcc66794-dsbr6                   1/1     Running   0             5d4h
my-otel-demo-loadgenerator-64c545b974-xfccq           1/1     Running   1 (36h ago)   5d4h
my-otel-demo-otelcol-fdfd9c7cf-6lr2w                  1/1     Running   0             5d4h
my-otel-demo-paymentservice-7955c68859-ff7zg          1/1     Running   0             5d4h
my-otel-demo-productcatalogservice-67c879657b-wn2wj   1/1     Running   0             5d4h
my-otel-demo-quoteservice-748d754ffc-qcwm4            1/1     Running   0             5d4h
my-otel-demo-recommendationservice-df78894c7-lwm5v    1/1     Running   0             5d4h
my-otel-demo-redis-7d48567546-h4p4t                   1/1     Running   0             5d4h
my-otel-demo-shippingservice-f6fc76ddd-2v7qv          1/1     Running   0             5d4h
</code></pre>
<h3>Step 3: Open Kibana and use the APM Service Map to view your OTel instrumented Services</h3>
<p>In the Elastic Observability UI under APM, select servicemap to see your services.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/opentelemetry-observability/blog-elastic-observability-APM.png" alt="elastic observability APM" /></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/opentelemetry-observability/blog-elastic-observability-OTEL-service-map.png" alt="elastic observability OTEL service map" /></p>
<p>If you are seeing this, then the OpenTelemetry Collector is sending data into Elastic:</p>
<p><em>Congratulations,</em> <em>you've instrumented the OpenTelemetry demo application using and successfully ingested the telemetry data into the Elastic!</em></p>
<h3>Step 4: What can Elastic show me?</h3>
<p>Now that the OpenTelemetry data is ingested into Elastic, what can you do?</p>
<p>First, you can view the APM service map (as shown in the previous step) — this will give you a full view of all the services and the transaction flows between services.</p>
<p>Next, you can now check out individual services and the transactions being collected.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/opentelemetry-observability/blog-elastic-observability-frontend-overview.png" alt="elastic observability frontend overview" /></p>
<p>As you can see, the frontend details are listed. Everything from:</p>
<ul>
<li>Average service latency</li>
<li>Throughput</li>
<li>Main transactions</li>
<li>Failed traction rate</li>
<li>Errors</li>
<li>Dependencies</li>
</ul>
<p>Let’s get to the trace. In the Transactions tab, you can review all the types of transactions related to the frontend service:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/opentelemetry-observability/blog-elastic-observability-frontend-transactions.png" alt="elastic observability frontend transactions" /></p>
<p>Selecting the HTTP POST transaction, we can see the full trace with all the spans:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/opentelemetry-observability/blog-elastic-observability-frontend-HTTP-POST.png" alt="Average latency for this transaction, throughput, any failures, and of course the trace!" /></p>
<p>Not only can you review the trace but you can also analyze what is related to higher than normal latency for HTTP POST .</p>
<p>Elastic uses machine learning to help identify any potential latency issues across the services from the trace. It’s as simple as selecting the Latency Correlations tab and running the correlation.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/opentelemetry-observability/blog-elastic-latency-correlations.png" alt="elastic observability latency correlations" /></p>
<p>This shows that the high latency transactions are occurring in checkout service with a medium correlation.</p>
<p>You can then drill down into logs directly from the trace view and review the logs associated with the trace to help identify and pinpoint potential issues.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/opentelemetry-observability/blog-elastic-latency-distribution.png" alt="elastic observability latency distribution" /></p>
<h3>Analyze your data with Elastic machine learning (ML)</h3>
<p>Once OpenTelemetry metrics are in Elastic, start analyzing your data through Elastic’s ML capabilities.</p>
<p>A great review of these features can be found here: <a href="https://www.elastic.co/blog/apm-correlations-elastic-observability-root-cause-transactions">Correlating APM telemetry to determine root causes in transactions</a>. And there are many more videos and blogs on <a href="https://www.elastic.co/blog/">Elastic’s Blog</a>. We’ll follow up with additional blogs on leveraging Elastic’s machine learning capabilities for OpenTelemetry data.</p>
<h2>Conclusion</h2>
<p>I hope you’ve gotten an appreciation for how Elastic Observability can help you ingest and analyze OpenTelemetry data with Elastic’s APM capabilities.</p>
<p>A quick recap of lessons and more specifically learned:</p>
<ul>
<li>How to get a popular OTel instrumented demo app (Hipster Shop) configured to ingest into <a href="http://cloud.elastic.co">Elastic Cloud</a>, through a few easy steps</li>
<li>Highlight some of the Elastic APM capabilities and features around OTel data and what you can do with this once it’s in Elastic</li>
</ul>
<p>Ready to get started? Sign up <a href="https://cloud.elastic.co/registration">for Elastic Cloud</a> and try out the features and capabilities I’ve outlined above to get the most value and visibility out of your OpenTelemetry data.</p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/opentelemetry-observability/illustration-scalability-gear-1680x980_(1).jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Build better Service Level Objectives (SLOs) from logs and metrics]]></title>
            <link>https://www.elastic.co/observability-labs/blog/service-level-objectives-slos-logs-metrics</link>
            <guid isPermaLink="false">service-level-objectives-slos-logs-metrics</guid>
            <pubDate>Fri, 23 Feb 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[To help manage operations and business metrics, Elastic Observability's SLO (Service Level Objectives) feature was introduced in 8.12. This blog reviews this feature and how you can use it with Elastic's AI Assistant to meet SLOs.]]></description>
            <content:encoded><![CDATA[<p>In today's digital landscape, applications are at the heart of both our personal and professional lives. We've grown accustomed to these applications being perpetually available and responsive. This expectation places a significant burden on the shoulders of developers and operations teams.</p>
<p>Site reliability engineers (SREs) face the challenging task of sifting through vast quantities of data, not just from the applications themselves but also from the underlying infrastructure. In addition to data analysis, they are responsible for ensuring the effective use and development of operational tools. The growing volume of data, the daily resolution of issues, and the continuous evolution of tools and processes can detract from the focus on business performance.</p>
<p>Elastic Observability offers a solution to this challenge. It enables SREs to integrate and examine all telemetry data (logs, metrics, traces, and profiling) in conjunction with business metrics. This comprehensive approach to data analysis fosters operational excellence, boosts productivity, and yields critical insights, all of which are integral to maintaining high-performing applications in a demanding digital environment.</p>
<p>To help manage operations and business metrics, Elastic Observability's SLO (Service Level Objectives) feature was introduced in <a href="https://www.elastic.co/guide/en/observability/8.12/slo.html">8.12</a>. This feature enables setting measurable performance targets for services, such as <a href="https://sre.google/sre-book/monitoring-distributed-systems/">availability, latency, traffic, errors, and saturation or define your own</a>. Key components include:</p>
<ul>
<li>
<p>Defining and monitoring SLIs (Service Level Indicators)</p>
</li>
<li>
<p>Monitoring error budgets indicating permissible performance shortfalls</p>
</li>
<li>
<p>Alerting on burn rates showing error budget consumption</p>
</li>
</ul>
<p>Users can monitor SLOs in real-time with dashboards, track historical performance, and receive alerts for potential issues. Additionally, SLO dashboard panels offer customized visualizations.</p>
<p>Service Level Objectives (SLOs) are generally available for our Platinum and Enterprise subscription customers.</p>
&lt;Video vidyardUuid=&quot;ngfY9mrkNEkjmpRY4Qd5Pb&quot; /&gt;
<p>In this blog, we will outline the following:</p>
<ul>
<li>
<p>What are SLOs? A Google SRE perspective</p>
</li>
<li>
<p>Several scenarios of defining and managing SLOs</p>
</li>
</ul>
<h2>Service Level Objective overview</h2>
<p>Service Level Objectives (SLOs) are a crucial component for Site Reliability Engineering (SRE), as detailed in <a href="https://sre.google/sre-book/table-of-contents/">Google's SRE Handbook</a>. They provide a framework for quantifying and managing the reliability of a service. The key elements of SLOs include:</p>
<ul>
<li>
<p><strong>Service Level Indicators (SLIs):</strong> These are carefully selected metrics, such as uptime, latency, throughput, error rates, or other important metrics, that represent the aspects of the service and are important from an operations or business perspective. Hence, an SLI is a measure of the service level provided (latency, uptime, etc.), and it is defined as a ratio of good over total events, with a range between 0% and 100%.</p>
</li>
<li>
<p><strong>Service Level Objective (SLO):</strong> An SLO is the target value for a service level measured as a percentage by an SLI. Above the threshold, the service is compliant. As an example, if we want to use service availability as an SLI, with the number of successful responses at 99.9%, then any time the number of failed responses is &gt; .1%, the SLO will be out of compliance.</p>
</li>
<li>
<p><strong>Error budget:</strong> This represents the threshold of acceptable errors, balancing the need for reliability with practical limits. It is defined as 100% minus the SLO quantity of errors that is tolerated.</p>
</li>
<li>
<p><strong>Burn rate:</strong> This concept relates to how quickly the service is consuming its error budget, which is the acceptable threshold for unreliability agreed upon by the service providers and its users.</p>
</li>
</ul>
<p>Understanding these concepts and effectively implementing them is essential for maintaining a balance between innovation and reliability in service delivery. For more detailed information, you can refer to <a href="https://sre.google/workbook/slo-document/">Google's SRE Handbook</a>.</p>
<p>One main thing to remember is that SLO monitoring is <em>not</em> incident monitoring. SLO monitoring is a proactive, strategic approach designed to ensure that services meet established performance standards and user expectations. It involves tracking Service Level Objectives, error budgets, and the overall reliability of a service over time. This predictive method helps in preventing issues that could impact users and aligns service performance with business objectives.</p>
<p>In contrast, incident monitoring is a reactive process focused on detecting, responding to, and mitigating service incidents as they occur. It aims to address unexpected disruptions or failures in real time, minimizing downtime and impact on service. This includes monitoring system health, errors, and response times during incidents, with a focus on rapid response to minimize disruption and preserve the service's reputation.</p>
<p>Elastic®’s SLO capability is based directly off the Google SRE Handbook. All the definitions and semantics are utilized as described in Google’s SRE handbook. Hence users can perform the following on SLOs in Elastic:</p>
<ul>
<li>
<p>Define an SLO on an SLI such as KQL (log based query), service availability, service latency, custom metric, histogram metric, or a timeslice metric. Additionally, set the appropriate threshold.</p>
</li>
<li>
<p>Utilize occurrence versus time slice based budgeting. Occurrences is the number of good events over the number of total events to compute the SLO. Timeslices break the overall time window into slammer slices of a defined duration and compute the number of good slices over the total slices to compute the SLO. Timeslice targets are more accurate and useful when calculating things like a service’s SLO when trying to meet agreed upon customer targets.</p>
</li>
<li>
<p>Manage all the SLOs in a singular location.</p>
</li>
<li>
<p>Trigger alerts from the defined SLO, whether the SLI is off, burn rate is used up, or the error rate is X.</p>
</li>
<li>
<p>Create unique service level dashboards with SLO information for a more comprehensive view of the service.</p>
</li>
</ul>
<p><img src="https://www.elastic.co/observability-labs/assets/images/service-level-objectives-slos-logs-metrics/1-slo-blog.png" alt="Create alerts" /></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/service-level-objectives-slos-logs-metrics/2-slo-blog.png" alt="Create dashboards" /></p>
<p>SREs need to be able to manage business metrics.</p>
<h2>SLOs based on logs: NGINX availability</h2>
<p>Defining SLOs does not always mean metrics need to be used. Logs are a rich form of information, even when they have metrics embedded in them. Hence it’s useful to understand your business and operations status based on logs.</p>
<p>Elastic allows you to create an SLO based on specific fields in the log message, which don’t have to be metrics. A simple example is a simple multi-tier app that has a web server layer (nginx), a processing layer, and a database layer.</p>
<p>Let’s say that your processing layer is managing a significant number of requests. You want to ensure that the service is up properly. The best way is to ensure that all http.response.status_code are less than 500. Anything less ensures the service is up and any errors (like 404) are all user or client errors versus server errors.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/service-level-objectives-slos-logs-metrics/3-slo-blog.png" alt="expanded document" /></p>
<p>If we use Discover in Elastic, we see that there are close to 2M log messages over a seven-day time frame.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/service-level-objectives-slos-logs-metrics/4-slo-blog.png" alt="17k" /></p>
<p>Additionally, the number of messages with http.response.status_code &gt; 500 is minimal, like 17K.</p>
<p>Rather than creating an alert, we can create an SLO with this query:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/service-level-objectives-slos-logs-metrics/5-slo-blog.png" alt="edit SLO" /></p>
<p>We chose to use occurrences as the budgeting method to keep things simple.</p>
<p>Once defined, we can see how well our SLO is performing over a seven-day time frame. We can see not only the SLO, but also the burn rate, the historical SLI, and error budget, and any specific alerts against the SLO.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/service-level-objectives-slos-logs-metrics/6-slo-blog.png" alt="SLOs" /></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/service-level-objectives-slos-logs-metrics/7-slo-blog.png" alt="nginx server availability " /></p>
<p>Not only do we get information about the violation, but we also get:</p>
<ul>
<li>
<p>Historical SLI (7 days)</p>
</li>
<li>
<p>Error budget burn down</p>
</li>
<li>
<p>Good vs. bad events (24 hours)</p>
</li>
</ul>
<p><img src="https://www.elastic.co/observability-labs/assets/images/service-level-objectives-slos-logs-metrics/8-slo-blog.png" alt="Percentages" /></p>
<p>We can see how we’ve easily burned through our error budget.</p>
<p>Hence something must be going on with nginx. To investigate, all we need to do is utilize the <a href="https://www.elastic.co/blog/context-aware-insights-elastic-ai-assistant-observability">AI Assistant</a>, and use its natural language interface to ask questions to help analyze the situation.</p>
<p>Let’s use Elastic’s AI Assistant to analyze the breakdown of http.response.status_code across all the logs from the past seven days. This helps us understand how many 50X errors we are getting.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/service-level-objectives-slos-logs-metrics/9-slo-blog.png" alt="count of http response status code" /></p>
<p>As we can see, the number of 502s is minimal compared to the number of overall messages, but it is affecting our SLO.</p>
<p>However, it seems like Nginx is having an issue. In order to reduce the issue, we also ask the AI Assistant how to work on this error. Specifically, we ask if there is an internal runbook the SRE team has created.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/service-level-objectives-slos-logs-metrics/10-slo-blog.png" alt="ai assistant thread" /></p>
<p>AI Assistant gets a runbook the team has added to its knowledge base. I can now analyze and try to resolve or reduce the issue with nginx.</p>
<p>While this is a simple example, there are an endless number of possibilities that can be defined based on KQL. Some other simple examples:</p>
<ul>
<li>
<p>99% of requests occur under 200ms</p>
</li>
<li>
<p>99% of log message are not errors</p>
</li>
</ul>
<h2>Application SLOs: OpenTelemetry demo cartservice</h2>
<p>A common application developers and SREs use to learn about OpenTelemetry and test out Observability features is the <a href="https://github.com/elastic/opentelemetry-demo">OpenTelemetry demo</a>.</p>
<p>This demo has <a href="https://opentelemetry.io/docs/demo/feature-flags/">feature flags</a> to simulate issues. With Elastic’s alerting and SLO capability, you can also determine how well the entire application is performing and how well your customer experience is holding up when these feature flags are used.</p>
<p><a href="https://www.elastic.co/blog/opentelemetry-observability">Elastic supports OpenTelemetry by taking OTLP directly with no need for an Elastic specific agent</a>. You can send in OpenTelemetry data directly from the application (through OTel libraries) and through the collector.</p>
<p>We’ve brought up the OpenTelemetry demo on a K8S cluster (AWS EKS) and turned on the cartservice feature flag. This inserts errors into the cartservice. We’ve also created two SLOs to monitor the cartservice’s availability and latency.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/service-level-objectives-slos-logs-metrics/11-slo-blog.png" alt="SLOs" /></p>
<p>We can see that the cartservice’s availability is violated. As we drill down, we see that there aren’t as many successful transactions, which is affecting the SLO.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/service-level-objectives-slos-logs-metrics/12-slo-blog.png" alt="cartservice-otel" /></p>
<p>As we drill into the service, we can see in Elastic APM that there is a higher than normal failure rate of about 5.5% for the emptyCart service.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/service-level-objectives-slos-logs-metrics/13-slo-blog.png" alt="apm" /></p>
<p>We can investigate this further in APM, but that is a discussion for another blog. Stay tuned to see how we can use Elastic’s machine learning, AIOps, and AI Assistant to understand the issue.</p>
<h2>Conclusion</h2>
<p>SLOs allow you to set clear, measurable targets for your service performance, based on factors like availability, response times, error rates, and other key metrics. Hopefully with the overview we’ve provided in this blog, you can see that:</p>
<ul>
<li>
<p>SLOs can be based on logs. In Elastic, you can use KQL to essentially find and filter on specific logs and log fields to monitor and trigger SLOs.</p>
</li>
<li>
<p>AI Assistant is a valuable, easy-to-use capability to analyze, troubleshoot, and even potentially resolve SLO issues.</p>
</li>
<li>
<p>APM Service based SLOs are easy to create and manage with integration to Elastic APM. We also use OTel telemetry to help monitor SLOs.</p>
</li>
</ul>
<p>For more information on SLOs in Elastic, check out <a href="https://www.elastic.co/guide/en/observability/current/slo.html">Elastic documentation</a> and the following resources:</p>
<ul>
<li>
<p><a href="https://www.elastic.co/guide/en/observability/8.12/slo.html">What’s new in Elastic Observability 8.12</a></p>
</li>
<li>
<p><a href="https://www.elastic.co/blog/context-aware-insights-elastic-ai-assistant-observability">Introducing the Elastic AI Assistant</a></p>
</li>
<li>
<p><a href="https://www.elastic.co/blog/opentelemetry-observability">Elastic OpenTelemetry support</a></p>
</li>
</ul>
<p>Ready to get started? Sign up for <a href="https://cloud.elastic.co/registration">Elastic Cloud</a> and try out the features and capabilities I’ve outlined above to get the most value and visibility out of your SLOs.</p>
<p><em>The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.</em></p>
<p><em>In this blog post, we may have used or referred to third party generative AI tools, which are owned and operated by their respective owners. Elastic does not have any control over the third party tools and we have no responsibility or liability for their content, operation or use, nor for any loss or damage that may arise from your use of such tools. Please exercise caution when using AI tools with personal, sensitive or confidential information. Any data you submit may be used for AI training or other purposes. There is no guarantee that information you provide will be kept secure or confidential. You should familiarize yourself with the privacy practices and terms of use of any generative AI tools prior to use.</em></p>
<p><em>Elastic, Elasticsearch, ESRE, Elasticsearch Relevance Engine and associated marks are trademarks, logos or registered trademarks of Elasticsearch N.V. in the United States and other countries. All other company and product names are trademarks, logos or registered trademarks of their respective owners.</em></p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/service-level-objectives-slos-logs-metrics/139686_-_Elastic_-_Headers_-_V1_3.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Enhancing SRE troubleshooting with the AI Assistant for Observability and your organization's runbooks]]></title>
            <link>https://www.elastic.co/observability-labs/blog/sre-troubleshooting-ai-assistant-observability-runbooks</link>
            <guid isPermaLink="false">sre-troubleshooting-ai-assistant-observability-runbooks</guid>
            <pubDate>Wed, 08 Nov 2023 00:00:00 GMT</pubDate>
            <description><![CDATA[Empower your SRE team with this guide to enriching Elastic's AI Assistant Knowledge Base with your organization's internal observability information for enhanced alert remediation and incident management.]]></description>
            <content:encoded><![CDATA[<p>The <a href="https://www.elastic.co/blog/context-aware-insights-elastic-ai-assistant-observability">Observability AI Assistant</a> helps users explore and analyze observability data using a natural language interface, by leveraging automatic function calling to request, analyze, and visualize your data to transform it into actionable observability. The Assistant can also set up a Knowledge Base, powered by <a href="https://www.elastic.co/guide/en/machine-learning/current/ml-nlp-elser.html">Elastic Learned Sparse EncodeR</a> (ELSER) to provide additional context and recommendations from private data, alongside the large language models (LLMs) using RAG (Retrieval Augmented Generation). Elastic’s Stack — as a vector database with out-of-the-box semantic search and connectors to LLM integrations and the Observability solution — is the perfect toolkit to extract the maximum value of combining your company's unique observability knowledge with generative AI.</p>
<h2>Enhanced troubleshooting for SREs</h2>
<p>Site reliability engineers (SRE) in large organizations often face challenges in locating necessary information for troubleshooting alerts, monitoring systems, or deriving insights due to scattered and potentially outdated resources. This issue is particularly significant for less experienced SREs who may require assistance even with the presence of a runbook. Recurring incidents pose another problem, as the on-call individual may lack knowledge about previous resolutions and subsequent steps. Mature SRE teams often invest considerable time in system improvements to minimize &quot;fire-fighting,&quot; utilizing extensive automation and documentation to support on-call personnel.</p>
<p>Elastic® addresses these challenges by combining generative AI models with relevant search results from your internal data using RAG. The <a href="https://www.elastic.co/guide/en/observability/current/obs-ai-assistant.html">Observability AI Assistant's internal Knowledge Base</a>, powered by our semantic search retrieval model <a href="https://www.elastic.co/guide/en/machine-learning/current/ml-nlp-elser.html">ELSER</a>, can recall information at any point during a conversation, providing RAG responses based on internal knowledge.</p>
<p>This Knowledge Base can be enriched with your organization's information, such as runbooks, GitHub issues, internal documentation, and Slack messages, allowing the AI Assistant to provide specific assistance. The Assistant can also document and store specific information from an ongoing conversation with an SRE while troubleshooting issues, effectively creating runbooks for future reference. Furthermore, the Assistant can generate summaries of incidents, system status, runbooks, post-mortems, or public announcements.</p>
<p>This ability to retrieve, summarize, and present contextually relevant information is a game-changer for SRE teams, transforming the work from chasing documents and data to an intuitive, contextually sensitive user experience.The Knowledge Base (see <a href="https://www.elastic.co/guide/en/observability/current/obs-ai-assistant.html#obs-ai-requirements">requirements</a>) serves as a central repository of Observability knowledge, breaking documentation silos and integrating tribal knowledge, making this information accessible to SREs enhanced with the power of LLMs.</p>
<p>Your LLM provider may collect query telemetry when using the AI Assistant. If your data is confidential or has sensitive details, we recommend you verify the data treatment policy of the LLM connector you provided to the AI Assistant.</p>
<p>In this blog post, we will cover different ways to enrich your Knowledge Base (KB) with internal information. We will focus on a specific alert, indicating that there was an increase in logs with “502 Bad Gateway” errors that has surpassed the alert’s threshold.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/sre-troubleshooting-ai-assistant-observability-runbooks/elastic-blog-1.png" alt="1 - threshold breached" /></p>
<h2>How to troubleshoot an alert with the Knowledge Base</h2>
<p>Before the KB has been enriched with internal information, when the SRE asks the AI Assistant about how to troubleshoot an alert, the response from the LLM will be based on the data it learned during training; however, the LLM is not able to answer questions related to private, recent, or emerging knowledge. In this case, when asking for the steps to troubleshoot the alert, the response will be based on generic information.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/sre-troubleshooting-ai-assistant-observability-runbooks/elastic-blog-2.png" alt="2 - troubleshooting steps" /></p>
<p>However, once the KB has been enriched with your runbooks, when your team receives a new alert on “502 Bad Gateway” Errors, they can use AI Assistant to access the internal knowledge to troubleshoot it, using semantic search to find the appropriate runbook in the Knowledge Base.</p>
<p>In this blog, we will cover different ways to add internal information on how to troubleshoot an alert to the Knowledge Base:</p>
<ol>
<li>
<p>Ask the assistant to remember the content of an existing runbook.</p>
</li>
<li>
<p>Ask the Assistant to summarize and store in the Knowledge Base the steps taken during a conversation and store it as a runbook.</p>
</li>
<li>
<p>Import your runbooks from GitHub or another external source to the Knowledge Base using our Connector and APIs.</p>
</li>
</ol>
<p>After the runbooks have been added to the KB, the AI Assistant is now able to recall the internal and specific information in the runbooks. By leveraging the retrieved information, the LLM could provide more accurate and relevant recommendations for troubleshooting the alert. This could include suggesting potential causes for the alert, steps to resolve the issue, preventative measures for future incidents, or asking the assistant to help execute the steps mentioned in the runbook using functions. With more accurate and relevant information at hand, the SRE could potentially resolve the alert more quickly, reducing downtime and improving service reliability.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/sre-troubleshooting-ai-assistant-observability-runbooks/Screenshot_2023-11-10_at_9.52.38_AM.png" alt="3 - troubleshooting 502 Bad gateway" /></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/sre-troubleshooting-ai-assistant-observability-runbooks/elastic-blog-4.png" alt="4 - (5) test the backend directly" /></p>
<p>Your Knowledge Base documents will be stored in the indices <em>.kibana-observability-ai-assistant-kb-</em>*. Have in mind that LLMs have restrictions on the amount of information the model can read and write at once, called token limit. Imagine you're reading a book, but you can only remember a certain number of words at a time. Once you've reached that limit, you start to forget the earlier words you've read. That's similar to how a token limit works in an LLM.</p>
<p>To keep runbooks within the token limit for Retrieval Augmented Generation (RAG) models, ensure the information is concise and relevant. Use bullet points for clarity, avoid repetition, and use links for additional information. Regularly review and update the runbooks to remove outdated or irrelevant information. The goal is to provide clear, concise, and effective troubleshooting information without compromising the quality due to token limit constraints. LLMs are great for summarization, so you could ask the AI Assistant to help you make the runbooks more concise.</p>
<h2>Ask the assistant to remember the content of an existing runbook</h2>
<p>The easiest way to store a runbook into the Knowledge Base is to just ask the AI Assistant to do it! Open a new conversation and ask “Can you store this runbook in the KB for future reference?” followed by pasting the content of the runbook in plain text.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/sre-troubleshooting-ai-assistant-observability-runbooks/elastic-blog-5.png" alt="5 - new conversation - let's work on this together" /></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/sre-troubleshooting-ai-assistant-observability-runbooks/elastic-blog-6.png" alt="6 - new converastion" /></p>
<p>The AI Assistant will then store it in the Knowledge Base for you automatically, as simple as that.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/sre-troubleshooting-ai-assistant-observability-runbooks/elastic-blog-7.png" alt="7 - storing a runbook" /></p>
<h2>Ask the Assistant to summarize and store the steps taken during a conversation in the Knowledge Base</h2>
<p>You can also ask the AI Assistant to remember something while having a conversation — for example, after you have troubleshooted an alert using the AI Assistant, you could ask to &quot;remember how to troubleshoot this alert for next time.&quot; The AI Assistant will create a summary of the steps taken to troubleshoot the alert and add it to the Knowledge Base, effectively creating runbooks for future reference. Next time you are faced with a similar situation, the AI Assistant will recall this information and use it to assist you.</p>
<p>In the following demo, the user asks the Assistant to remember the steps that have been followed to troubleshoot the root cause of an alert, and also to ping the Slack channel when this happens again. In a later conversation with the Assistant, the user asks what can be done about a similar problem, and the AI Assistant is able to remember the steps and also reminds the user to ping the Slack channel.</p>
<p>After receiving the alert, you can open the AI Assistant chat and test troubleshooting the alert. After investigating an alert, ask the AI Assistant to summarize the analysis and the steps taken to root cause. To remember them for the next time, we have a similar alert and add extra instruction like to warn the Slack channel.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/sre-troubleshooting-ai-assistant-observability-runbooks/elastic-blog-8.png" alt="8. -teal box" /></p>
<p>The Assistant will use the built-in functions to summarize the steps and store them into your Knowledge Base, so they can be recalled in future conversations.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/sre-troubleshooting-ai-assistant-observability-runbooks/Screenshot_2023-11-08_at_11.34.08_AM.png" alt="9 - Elastic assistant chat (CROP)" /></p>
<p>Open a new conversation, and ask what are the steps to take when troubleshooting a similar alert to the one we just investigated. The Assistant will be able to recall the information stored in the KB that is related to the specific alert, using semantic search based on <a href="https://www.elastic.co/guide/en/machine-learning/current/ml-nlp-elser.html">ELSER</a>, and provide a summary of the steps taken to troubleshoot it, including the last indication of informing the Slack channel.</p>
&lt;Video vidyardUuid=&quot;p14Ss8soJDkW8YoCtKPrQF&quot; loop={true} /&gt;
<h2>Import your runbooks stored in GitHub to the Knowledge Base using APIs or our GitHub Connector</h2>
<p>You can also add proprietary data into the Knowledge Base programmatically by ingesting it (e.g., GitHub Issues, Markdown files, Jira tickets, text files) into Elastic.</p>
<p>If your organization has created runbooks that are stored in Markdown documents in GitHub, follow the steps in the next section of this blog post to index the runbook documents into your Knowledge Base.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/sre-troubleshooting-ai-assistant-observability-runbooks/elastic-blog-10.png" alt="10 - github handling 502" /></p>
<p>The steps to ingest documents into the Knowledge Base are the following:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/sre-troubleshooting-ai-assistant-observability-runbooks/elastic-blog-11.png" alt="11 - using internal knowledge" /></p>
<h3>Ingest your organization’s knowledge into Elasticsearch</h3>
<p><strong>Option 1:</strong> <strong>Use the</strong> <a href="https://www.elastic.co/guide/en/enterprise-search/current/crawler.html"><strong>Elastic web crawler</strong></a> <strong>.</strong> Use the web crawler to programmatically discover, extract, and index searchable content from websites and knowledge bases. When you ingest data with the web crawler, a search-optimized <a href="https://www.elastic.co/blog/what-is-an-elasticsearch-index">Elasticsearch® index</a> is created to hold and sync webpage content.</p>
<p><strong>Option 2: Use Elasticsearch's</strong> <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-index_.html"><strong>Index API</strong></a> <strong>.</strong> <a href="https://www.elastic.co/guide/en/cloud/current/ec-ingest-guides.html">Watch tutorials</a> that demonstrate how you can use the Elasticsearch language clients to ingest data from an application.</p>
<p><strong>Option 3: Build your own connector.</strong> Follow the steps described in this blog: <a href="https://www.elastic.co/search-labs/how-to-create-customized-connectors-for-elasticsearch">How to create customized connectors for Elasticsearch</a>.</p>
<p><strong>Option 4: Use Elasticsearch</strong> <a href="https://www.elastic.co/guide/en/workplace-search/current/workplace-search-content-sources.html"><strong>Workplace Search connectors</strong></a> <strong>.</strong> For example, the <a href="https://www.elastic.co/guide/en/workplace-search/current/workplace-search-github-connector.html">GitHub connector</a> can automatically capture, sync, and index issues, Markdown files, pull requests, and repos.</p>
<ul>
<li>Follow the steps to <a href="https://www.elastic.co/guide/en/workplace-search/current/workplace-search-github-connector.html#github-configuration">configure the GitHub Connector in GitHub</a> to create an OAuth App from the GitHub platform.</li>
</ul>
<p><img src="https://www.elastic.co/observability-labs/assets/images/sre-troubleshooting-ai-assistant-observability-runbooks/elastic-blog-12.png" alt="12 - elastic workplace search" /></p>
<ul>
<li>Now you can connect a GitHub instance to your organization. Head to your organization’s <strong>Search &gt; Workplace Search</strong> administrative dashboard, and locate the Sources tab.</li>
</ul>
<p><img src="https://www.elastic.co/observability-labs/assets/images/sre-troubleshooting-ai-assistant-observability-runbooks/Screenshot_2023-11-08_at_10.19.19_AM.png" alt="13 - screenshot" /></p>
<ul>
<li>Select <strong>GitHub</strong> (or GitHub Enterprise) in the Configured Sources list, and follow the GitHub authentication flow as presented. Upon the successful authentication flow, you will be redirected to Workplace Search and will be prompted to select the Organization you would like to synchronize.</li>
</ul>
<p><img src="https://www.elastic.co/observability-labs/assets/images/sre-troubleshooting-ai-assistant-observability-runbooks/elastic-blog-14.png" alt="14 - configure and connect" /></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/sre-troubleshooting-ai-assistant-observability-runbooks/elastic-blog-15.png" alt="15 - how to add github" /></p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/sre-troubleshooting-ai-assistant-observability-runbooks/elastic-blog-16.png" alt="16 - github" /></p>
<ul>
<li>After configuring the connector and selecting the organization, the content should be synchronized and you will be able to see it in Sources. If you don’t need to index all the available content, you can specify the indexing rules via the API. This will help shorten indexing times and limit the size of the index. See <a href="https://www.elastic.co/guide/en/workplace-search/current/workplace-search-customizing-indexing-rules.html">Customizing indexing</a>.</li>
</ul>
<p><img src="https://www.elastic.co/observability-labs/assets/images/sre-troubleshooting-ai-assistant-observability-runbooks/elastic-blog-17.png" alt="17 - source overview" /></p>
<ul>
<li>The source has created an index in Elastic with the content (Issues, Markdown Files…) from your organization. You can find the index name by navigating to <strong>Stack Management &gt; Index Management</strong> , activating the <strong>Include hidden Indices</strong> button on the right, and searching for “GitHub.”</li>
</ul>
<p><img src="https://www.elastic.co/observability-labs/assets/images/sre-troubleshooting-ai-assistant-observability-runbooks/elastic-blog-18.png" alt="18 - index mgmt" /></p>
<ul>
<li>You can explore the documents you have indexed by creating a Data View and exploring it in Discover. Go to <strong>Stack Management &gt; Kibana &gt; Data Views &gt; Create data view</strong> and introduce the data view Name, Index pattern (make sure you activate “Allow hidden and system indices” in advanced options), and Timestamp field:</li>
</ul>
<p><img src="https://www.elastic.co/observability-labs/assets/images/sre-troubleshooting-ai-assistant-observability-runbooks/elastic-blog-19.png" alt="19 - create data view" /></p>
<ul>
<li>You can now explore the documents in Discover using the data view:</li>
</ul>
<p><img src="https://www.elastic.co/observability-labs/assets/images/sre-troubleshooting-ai-assistant-observability-runbooks/elastic-blog-20.png" alt="20 - data view" /></p>
<h3>Reindex your internal runbooks into the AI Assistant’s Knowledge Base Index, using it's semantic search pipeline</h3>
<p>Your Knowledge Base documents are stored in the indices <em>.kibana-observability-ai-assistant-kb-*</em>. To add your internal runbooks imported from GitHub to the KB, you just need to reindex the documents from the index you created in the previous step to the KB’s index. To add the semantic search capabilities to the documents in the KB, the reindex should also use the ELSER pipeline preconfigured for the KB, <em>.kibana-observability-ai-assistant-kb-ingest-pipeline</em>.</p>
<p>By creating a Data View with the KB index, you can explore the content in Discover.</p>
<p>You execute the query below in <strong>Management &gt; Dev Tools</strong> , making sure to replace the following, both on “_source” and “inline”:</p>
<ul>
<li>InternalDocsIndex : name of the index where your internal docs are stored</li>
<li>text_field : name of the field with the text of your internal docs</li>
<li>timestamp : name of the field of the timestamp in your internal docs</li>
<li>public : (true or false) if true, makes a document available to all users in the defined <a href="https://www.elastic.co/guide/en/kibana/current/xpack-spaces.html">Kibana Space</a> (if is defined) or in all spaces (if is not defined); if false, document will be restricted to the user indicated in</li>
<li>(optional) space : if defined, restricts the internal document to be available in a specific <a href="https://www.elastic.co/guide/en/kibana/current/xpack-spaces.html">Kibana Space</a></li>
<li>(optional) user.name : if defined, restricts the internal document to be available for a specific user</li>
<li>(optional) &quot;query&quot; filter to index only certain docs (see below)</li>
</ul>
<pre><code class="language-bash">POST _reindex
{
    &quot;source&quot;: {
        &quot;index&quot;: &quot;&lt;InternalDocsIndex&gt;&quot;,
        &quot;_source&quot;: [
            &quot;&lt;text_field&gt;&quot;,
            &quot;&lt;timestamp&gt;&quot;,
            &quot;namespace&quot;,
            &quot;is_correction&quot;,
            &quot;public&quot;,
            &quot;confidence&quot;
        ]
    },
    &quot;dest&quot;: {
        &quot;index&quot;: &quot;.kibana-observability-ai-assistant-kb-000001&quot;,
        &quot;pipeline&quot;: &quot;.kibana-observability-ai-assistant-kb-ingest-pipeline&quot;
    },
    &quot;script&quot;: {
        &quot;inline&quot;: &quot;ctx._source.text=ctx._source.remove(\&quot;&lt;text_field&gt;\&quot;);ctx._source.namespace=\&quot;&lt;space&gt;\&quot;;ctx._source.is_correction=false;ctx._source.public=&lt;public&gt;;ctx._source.confidence=\&quot;high\&quot;;ctx._source['@timestamp']=ctx._source.remove(\&quot;&lt;timestamp&gt;\&quot;);ctx._source['user.name'] = \&quot;&lt;user.name&gt;\&quot;&quot;
    }
}
</code></pre>
<p>You may want to specify the type of documents that you reindex in the KB — for example, you may only want to reindex Markdown documents (like Runbooks). You can add a “query” filter to the documents in the source. In the case of GitHub, runbooks are identified with the “type” field containing the string “file,” and you could add that to the reindex query like indicated below. To add also GitHub Issues, you can also include in the query “type” field containing the string “issues”:</p>
<pre><code class="language-json">&quot;source&quot;: {
        &quot;index&quot;: &quot;&lt;InternalDocsIndex&gt;&quot;,
        &quot;_source&quot;: [
            &quot;&lt;text_field&gt;&quot;,
            &quot;&lt;timestamp&gt;&quot;,
            &quot;namespace&quot;,
            &quot;is_correction&quot;,
            &quot;public&quot;,
            &quot;confidence&quot;
        ],
    &quot;query&quot;: {
      &quot;terms&quot;: {
        &quot;type&quot;: [&quot;file&quot;]
      }
    }
</code></pre>
<p>Great! Now that the data is stored in your Knowledge Base, you can ask the Observability AI Assistant any questions about it:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/sre-troubleshooting-ai-assistant-observability-runbooks/elastic-blog-21.png" alt="21 - new conversation" /></p>
&lt;Video vidyardUuid=&quot;zRxsp1EYjmR4FW4yRtSxcr&quot; loop={true} /&gt;
&lt;Video vidyardUuid=&quot;vV5md3mVtY8KxUVjSvtT7V&quot; loop={true} /&gt;
<h2>Conclusion</h2>
<p>In conclusion, leveraging internal Observability knowledge and adding it to the Elastic Knowledge Base can greatly enhance the capabilities of the AI Assistant. By manually inputting information or programmatically ingesting documents, SREs can create a central repository of knowledge accessible through the power of Elastic and LLMs. The AI Assistant can recall this information, assist with incidents, and provide tailored observability to specific contexts using Retrieval Augmented Generation. By following the steps outlined in this article, organizations can unlock the full potential of their Elastic AI Assistant.</p>
<p><a href="https://www.elastic.co/generative-ai/ai-assistant">Start enriching your Knowledge Base with the Elastic AI Assistant today</a> and empower your SRE team with the tools they need to excel. Follow the steps outlined in this article and take your incident management and alert remediation processes to the next level. Your journey toward a more efficient and effective SRE operation begins now.</p>
<p><em>The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.</em></p>
<p><em>In this blog post, we may have used or referred to third party generative AI tools, which are owned and operated by their respective owners. Elastic does not have any control over the third party tools and we have no responsibility or liability for their content, operation or use, nor for any loss or damage that may arise from your use of such tools. Please exercise caution when using AI tools with personal, sensitive or confidential information. Any data you submit may be used for AI training or other purposes. There is no guarantee that information you provide will be kept secure or confidential. You should familiarize yourself with the privacy practices and terms of use of any generative AI tools prior to use.</em></p>
<p><em>Elastic, Elasticsearch, ESRE, Elasticsearch Relevance Engine and associated marks are trademarks, logos or registered trademarks of Elasticsearch N.V. in the United States and other countries. All other company and product names are trademarks, logos or registered trademarks of their respective owners.</em></p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/sre-troubleshooting-ai-assistant-observability-runbooks/11-hand.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[How to Troubleshoot Kubernetes Pod Restarts & OOMKilled Events with Agent Builder]]></title>
            <link>https://www.elastic.co/observability-labs/blog/troubleshoot-kubernetes-pod-restarts-oomkilled-elastic-agent-builder</link>
            <guid isPermaLink="false">troubleshoot-kubernetes-pod-restarts-oomkilled-elastic-agent-builder</guid>
            <pubDate>Wed, 25 Feb 2026 00:00:00 GMT</pubDate>
            <description><![CDATA[Learn how to immediately troubleshoot Kubernetes pod restarts and OOMKilled events with Elastic Agent Builder. We’ll show how to detect, analyze, and remediate failures.]]></description>
            <content:encoded><![CDATA[<h2>Initial Summary</h2>
<ul>
<li>Detect Kubernetes pod restarts and OOMKill events using Elastic Agent Builder</li>
<li>Analyze CPU and memory pressure using ES|QL over Kubernetes metrics</li>
<li>Generate troubleshooting summaries and remediation guidance</li>
</ul>
<p>This article explains how to use <a href="https://www.elastic.co/search-labs/blog/elastic-ai-agent-builder-context-engineering-introduction">Elastic Agent Builder</a> to automatically detect, analyze, and remediate Kubernetes pod failures caused by resource pressure (CPU and memory), with a focus on pods experiencing frequent restarts and OOMKilled events. Elastic Agent Builder lets you quickly create precise agents that utilize all your data with powerful tools (such as ES|QL queries), chat interfaces, and custom agents.</p>
<h2>Introduction: What is the Elastic Agent Builder?</h2>
<p>Elastic has an AI Agent embedded that you can use to get more insights from all of the logs, metrics and traces that you’ve ingested. While that’s great, you can take it one step further and streamline the process by creating tools that the agent can use.</p>
<p>Giving the agent tools means it spends less time ‘thinking’ and quickly gets to assessing what’s important to you. For example, if I have a Kubernetes environment that needs monitoring, and I want to keep an eye on pod restarts and memory and CPU usage without hanging out at the terminal, I can have Elastic alert me if something goes wrong. </p>
<p>Having an alert is great, but how do I get the bigger picture, faster? You need to know what service is having (or creating) the issues, why, and how to fix it.</p>
<h2>Assumptions</h2>
<p>This guide assumes:</p>
<ul>
<li>A running Kubernetes cluster</li>
<li>An Elastic Observability deployment</li>
<li>Kubernetes metrics indexed in Elastic</li>
</ul>
<h2>Step 1: Create a New Elastic Agent</h2>
<p>In Elastic Observability, use the top search bar to search for Agents. Create a new agent.</p>
<p>This agent is going to be the Kubernetes Pod Troubleshooter agent, designed to help users troubleshoot pod restarts, OOMKill terminations and evaluate CPU or memory pressure. </p>
<p>The Kubernetes Pod Troubleshooter agent will:</p>
<ol>
<li>Identify pods that have restarted more than once</li>
<li>Filter for pods that are not in a running state</li>
<li>Retrieve the container termination reason (e.g., OOMKilled)</li>
<li>Analyze CPU and memory utilization for affected services</li>
<li>Flag resource utilization above 60% (warning) and 80% (critical)</li>
<li>Provide remediation recommendations</li>
</ol>
<p>The agent requires instructions to guide how the agent behaves when interacting with tools or responding to queries. This description can set tone, priorities or special behaviours. The instructions below tell the agent to execute the steps outlined above. </p>
<pre><code>You will help users troubleshoot problematic pods by searching the metrics for pods that have restarted more than once and the status is not running. Pods that have the highest number of restarts will be returned to the user.
Once the containers that are not running and have restarted multiple times are found you will use their container ID or image name to to look up the container status reason and reason for the last termination. You will return that reason to the user.
You will also begin basic troubleshooting steps, such as checking  for insufficient cluster resources (CPU or memory) from the metrics and tools available.
Any CPU or memory utilization percentages over 60%, and definitely over 80% should be flagged to the user with remediation steps.
</code></pre>
<p>Getting answers quickly is critical when troubleshooting high-value systems and environments. Using Tools ensures that the workflow is repeatable and that you can trust the results. You also get complete oversight of the process, as the Elastic Agent outlines every step and query that it took and you can explore the results in Discover.</p>
<p>You will create custom tools that the agent will run to complete the Kubernetes troubleshooting tasks that the custom instructions references such as: <code>look up the container status reason and reason for the last termination</code> and <code>checking  for insufficient cluster resources (CPU or memory).</code></p>
<h2>Step 2: Create Tools - Pod Restarts</h2>
<p>The first tool takes the Kubernetes metrics and assesses if the pod has restarted and it has a last terminated reason, and if it has the agent will present that information to the user.</p>
<p>This <code>pod-restarts</code> tool uses a custom ES|QL query that interrogates the Kubernetes metrics data coming from OTel.</p>
<p>The ES|QL query:</p>
<ol>
<li>Filters for containers that have restarted and have a reason for termination; then</li>
<li>Calculates the number of restarts; then</li>
<li>Returns the number of restarts and termination reason per service.</li>
</ol>
<pre><code>FROM metrics-k8sclusterreceiver.otel-default
| WHERE metrics.k8s.container.restarts &gt; 0
| WHERE resource.attributes.k8s.container.status.last_terminated_reason IS NOT NULL
| STATS total_restarts = SUM(metrics.k8s.container.restarts),
        reasons = VALUES(resource.attributes.k8s.container.status.last_terminated_reason) 
  BY resource.attributes.service.name
| SORT total_restarts DESC
</code></pre>
<h2>Step 3: Create Tools - Service Memory</h2>
<p>The custom tools can take input variables, which increases speed and accuracy of the results.</p>
<p>Common reasons for pods not scheduling, or restarting often, is due to the cluster or nodes being under-resourced. The <code>pod-restarts</code> tool returns services that have many restarts and OOMKill termination reasons, which indicate memory pressure.</p>
<p>The <code>eval-pod-memory</code> tool is a custom ES|QL that:</p>
<ol>
<li>Filters for metrics data that match the service name returned from the <code>pod-restarts</code> tool within the last 12 hours; then</li>
<li>Converts memory usage, requests, limits and utilization into megabytes; then</li>
<li>Calculates the average of each of those metrics; then</li>
<li>Groups them into 1 minute groupings and sorts them.</li>
</ol>
<pre><code>FROM metrics-*
| WHERE resource.attributes.service.name == ?servicename
| WHERE @timestamp &gt;= NOW() - 12 hours
| EVAL
  memory_usage_mb = metrics.container.memory.usage / 1024 / 1024,
   memory_request_mb = metrics.k8s.container.memory_request / 1024 / 1024,
   memory_limit_mb = metrics.k8s.container.memory_limit / 1024 / 1024,
   memory_utilization_pct = metrics.k8s.container.memory_limit_utilization * 100
| STATS
   avg_memory_usage = AVG(memory_usage_mb),
   avg_memory_request = AVG(memory_request_mb),
   avg_memory_limit = AVG(memory_limit_mb),
   avg_memory_utilization = AVG(memory_utilization_pct)
   BY bucket = BUCKET(@timestamp, 1 minute)
| SORT bucket ASC
</code></pre>
<h2>Step 4: Create Tools: Service CPU</h2>
<p>As CPU usage is another common reason for pods to fail scheduling or be stuck in endless restart loops, the next tool will evaluate CPU usage, requests and limits.</p>
<p>The <code>eval-pod-cpu</code> tool is a custom ES|QL that:</p>
<ol>
<li>Filters for metrics data that match the service name returned from the <code>pod-restarts</code> tool within the last 12 hours; then</li>
<li>Calculates the average for CPU usage, CPU request utilization and CPU limit utilization.</li>
</ol>
<pre><code>FROM metrics-kubeletstatsreceiver.otel-default
| WHERE k8s.container.name == ?servicename OR resource.attributes.k8s.container.name == ?servicename
| STATS
  avg_cpu_usage = AVG(container.cpu.usage),
  avg_cpu_request_utilization = AVG(k8s.container.cpu_request_utilization) * 100,
  avg_cpu_limit_utilization = AVG(k8s.container.cpu_limit_utilization) * 100
| LIMIT 100
</code></pre>
<h2>Step 5: Assign Tools to Kubernetes Pod Troubleshooter Agent</h2>
<p>Once all of the tools are built you need to assign them to the agent.</p>
<p>This image shows the Kubernetes Pod Troubleshooter agent with the three tools: <code>pod-restarts</code>, <code>eval-pod-cpu</code> and <code>eval-pod-memory</code> assigned to it and active.</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/troubleshoot-kubernetes-pod-restarts-oomkilled-elastic-agent-builder/kubernetes-pod-troubleshooter.png" alt="kubernetes-pod-troubleshooter" /></p>
<h2>Step 6: Test the Kubernetes Pod Troubleshooter Agent</h2>
<p>To simulate memory pressure the Open Telemetry demo is running inside the cluster. Artificially lowering the memory requests and limits and increasing the service load will cause pods to restart.</p>
<p>To do this to the open telemetry demo in your cluster, follow these steps. </p>
<p>Reduce the cart service to one replica by scaling the deployment. Once that is complete, change the resources on the deployment by lowering the memory requests and limits as shown in this command:</p>
<pre><code>kubectl -n otel-demo scale deploy/cart --replicas=1
kubectl -n otel-demo set resources deploy/cart -c cart --requests=memory=50Mi --limits=memory=60Mi
</code></pre>
<p>The OpenTelemetry demo application comes with a load-generator. This is used to simulate requests to the demo site by modifying the users and spawn rate in the load generator deployment, as shown in this command:</p>
<pre><code>kubectl -n otel-demo set env deploy/load-generator LOCUST_USERS=800 LOCUST_SPAWN_RATE=200 LOCUST_BROWSER_TRAFFIC_ENABLED=false
</code></pre>
<p>If you list all of your pods in the cluster or namespace, you should begin to see restarts.</p>
<p>You can now chat with the Kubernetes Pod Troubleshooter agent and ask “Are any of my Kubernetes pods having issues?”.</p>
<p>The screenshot shows the final response from the Kubernetes Pod Troubleshooter agent. It provides a problem summary of its findings from each tool, showing which services were experiencing the most restarts and memory and CPU utilization. </p>
<p>The threshold interpretations were described in the initial agent instructions, where &gt;60% utilization is a warning (sustained pressure) and &gt;80% utilization is critical (high likelihood of restarts or throttling). This aligns with findings presented by the Kubernetes Pod Troubleshooter agent, where the services that had the highest restarts were all above 90% memory utilization. The agent needs clearly defined threshold values to correctly assess the returned memory and CPU utilization values. </p>
<p>Problem summary returned by the Kubernetes Pod Troubleshooter agent:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/troubleshoot-kubernetes-pod-restarts-oomkilled-elastic-agent-builder/problem-summary-by-Kubernetes.png" alt="problem summary by Kubernetes" /></p>
<h2>Conclusion and Final Thoughts</h2>
<p>Elastic Agent Builder enables fast, repeatable Kubernetes troubleshooting by combining ES|QL-driven analysis with constrained AI reasoning.</p>
<p>The creation of custom tools that use specific ES|QL queries combined with downstream queries that take input variables from the output of previous tools eliminates or reduces error propagation and hallucinations. In comparison to generic AI troubleshooting without purpose-built tools, you run the risk of it analyzing too many services (that aren’t relevant to the issue at hand). This will slow down the thinking process and generate longer responses, increasing the likelihood of error propagation and hallucinations. </p>
<p>With the Elastic Agent Builder, you can inspect the output of every tool if you need to, to explore and verify the outputs.</p>
<p>Having a succinct problem summary is a game-changer, bringing your attention straight to the most affected services.</p>
<p>Reasoning returned by the Kubernetes Pod Troubleshooter agent:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/troubleshoot-kubernetes-pod-restarts-oomkilled-elastic-agent-builder/return-pod-troubleshooter-agent.png" alt="summary-returned-kubernetes-pod-troubleshooter" /></p>
<p>Not only that, but the agent can go one step further and offer recommendations for remediation based on what outputs the tools delivered.</p>
<p>Remediation recommendation returned by the Kubernetes Pod Troubleshooter agent:</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/troubleshoot-kubernetes-pod-restarts-oomkilled-elastic-agent-builder/remediation-recommendation-kubernetes-pod-troubleshooter.png" alt="remediation-recommendation-kubernetes-pod-troubleshooter" /></p>
<p>Sign up for <a href="https://www.elastic.co/cloud/serverless">Elastic Cloud Serverless</a> and try this out with your Kubernetes clusters.</p>
<h2>Frequently Asked Questions</h2>
<p><strong>1. When to use the Elastic Agent Builder for Troubleshooting</strong></p>
<p>Use the Elastic Agent Builder for Troubleshooting that works best if:</p>
<ul>
<li>
<p>You need repeatable, auditable troubleshooting workflows</p>
</li>
<li>
<p>You want deterministic analysis instead of free-form AI responses</p>
</li>
<li>
<p>You’re investigating something that is reported in the logs or metrics (i.e. pod restarts, OOMKills, or resource pressure)</p>
</li>
<li>
<p>You want to reduce mean time to resolution (MTTR)</p>
</li>
</ul>
<p><strong>2. Do I need OpenTelemetry to use Elastic Agent Builder for Kubernetes troubleshooting?</strong> </p>
<p>No, you don’t need to use OpenTelemetry. You have two options:</p>
<ul>
<li>
<p>You can collect logs and metrics from Kubernetes using the Elastic Agent; or </p>
</li>
<li>
<p>You can collect logs, traces and metrics with the Elastic Distro for OTel (EDOT) Collector</p>
</li>
</ul>
<p>When following the steps above, this would change the field names that are used in the tools above. For example, <code>kubernetes.container.memory.usage.bytes</code> vs <code>metrics.container.memory.usage</code>.</p>
<p><strong>3. Can this agent be adapted for node-level failures?</strong> </p>
<p>Yes, Elastic has hundreds of <a href="https://www.elastic.co/docs/reference/fleet#integrations">integrations</a>, including AWS (for EKS), Azure (for AKS), Google Cloud (for GKE), as well as host operating system monitoring.</p>
<p>The queries shown above would be modified to use the correct field.</p>
<p><strong>4. Can these tools be reused in automation workflows?</strong> </p>
<p>Yes, <a href="https://www.elastic.co/search-labs/blog/elastic-workflows-automation">Elastic Workflows</a> can reuse the same scripted automations and AI agents you build in Elastic. An agent can handle the initial analysis and investigation (reducing manual effort), and the workflow can continue with structured steps, such as running Elasticsearch queries, transforming data, branching on conditions and calling external APIs or tools like Slack, Jira and PagerDuty. Workflows can also be exposed to Agent Builder as reusable tools, just like the tool created in this guide.</p>
<p>For more advanced automation from a similar scenario as described in this guide, learn how to <a href="https://www.elastic.co/observability-labs/blog/agentic-cicd-kubernetes-mcp-server">integrate AI agents into GitHub Actions to monitor K8s health and improve deployment reliability via Observability</a>.</p>
<p><strong>5. Can these tools be triggered by alerts?</strong> </p>
<p>Yes, alerts can trigger <a href="https://www.elastic.co/search-labs/blog/elastic-workflows-automation">Elastic Workflows</a>, and pass the alert context to the workflow. This workflow may be integrated with an Elastic Agent, as described above.</p>
<p>Additionally, Elastic Alerts allow you to publish investigation guides alongside alerts so an SRE has all of the information they need to begin investigating. Any troubleshooting or investigative agents can be linked to from the investigation guide, meaning the SRE doesn’t have to follow manual processes outlined in an investigation guide and instead let the agent handle the manual, repetitive investigations.</p>
<p><strong>6. How can I get started with Agent Builder?</strong></p>
<p>Sign up for <a href="https://www.elastic.co/cloud/serverless">Elastic Cloud Serverless</a>, a new fully managed, stateless architecture that auto-scales no matter your data, usage, and performance needs.</p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/troubleshoot-kubernetes-pod-restarts-oomkilled-elastic-agent-builder/cover.png" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Using a custom agent with the OpenTelemetry Operator for Kubernetes]]></title>
            <link>https://www.elastic.co/observability-labs/blog/using-the-otel-operator-for-injecting-elastic-agents</link>
            <guid isPermaLink="false">using-the-otel-operator-for-injecting-elastic-agents</guid>
            <pubDate>Tue, 16 Jul 2024 00:00:00 GMT</pubDate>
            <content:encoded><![CDATA[<p>This is the second part of a two part series. The first part is available at <a href="https://www.elastic.co/observability-labs/blog/using-the-otel-operator-for-injecting-java-agents">Zero config OpenTelemetry auto-instrumentation for Kubernetes Java applications</a>. In that first part I walk through setting up and installing the <a href="https://github.com/open-telemetry/opentelemetry-operator/">OpenTelemetry Operator for Kubernetes</a>, and configuring that for auto-instrumentation of a Java application using the <a href="https://github.com/open-telemetry/opentelemetry-java-instrumentation/">OpenTelemetry Java agent</a>.</p>
<p>In this second part, I show how to install <em>any</em> Java agent via the OpenTelemetry operator, using the Elastic Java agents as examples.</p>
<h2>Installation and configuration recap</h2>
<p>Part 1 of this series, <a href="https://www.elastic.co/observability-labs/blog/using-the-otel-operator-for-injecting-java-agents">Zero config OpenTelemetry auto-instrumentation for Kubernetes Java applications</a>, details the installation and configuration of the OpenTelemetry operator and an Instrumentation resource. Here is an outline of the steps as a reminder:</p>
<ol>
<li>Install cert-manager, eg <code>kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.14.4/cert-manager.yaml</code></li>
<li>Install the operator, eg <code>kubectl apply -f https://github.com/open-telemetry/opentelemetry-operator/releases/latest/download/opentelemetry-operator.yaml</code></li>
<li>Create an Instrumentation resource</li>
<li>Add an annotation to either the deployment or the namespace</li>
<li>Deploy the application as normal</li>
</ol>
<p>In that first part, steps 3, 4 &amp; 5 were implemented for the <a href="https://github.com/open-telemetry/opentelemetry-java-instrumentation/">OpenTelemetry Java agent</a>. In this blog I’ll implement them for other agents, using the Elastic APM agents as examples. I assume that steps 1 &amp; 2 outlined above have already been done, ie that the operator is now installed. I will continue using the <code>banana</code> namespace for the examples, so ensure that namespace exists (<code>kubectl create namespace banana</code>). As per part 1, if you use any of the example instrumentation definitions below, you’ll need to substitute <code>my.apm.server.url</code> and <code>my-apm-secret-token</code> with the values appropriate for your collector.</p>
<h2>Using the Elastic Distribution for OpenTelemetry Java</h2>
<p>From version 0.4.0, the <a href="https://github.com/elastic/elastic-otel-java">Elastic Distribution for OpenTelemetry Java</a> includes the agent jar at the path <code>/javaagent.jar</code> in the docker image - which is essentially all that is needed for a docker image to be usable by the OpenTelemetry operator for auto-instrumentation. This means the Instrumentation resource is straightforward to define, and as it’s a distribution of the OpenTelemetry Java agent, all the OpenTelemetry environment can apply:</p>
<pre><code>apiVersion: opentelemetry.io/v1alpha1
kind: Instrumentation
metadata:
  name: elastic-otel
  namespace: banana
spec:
  exporter:
    endpoint: https://my.apm.server.url
  propagators:
    - tracecontext
    - baggage
    - b3
  sampler:
    type: parentbased_traceidratio
    argument: &quot;1.0&quot;
  java:
    image: docker.elastic.co/observability/elastic-otel-javaagent:1.9.0
    env:
      - name: OTEL_EXPORTER_OTLP_HEADERS
        value: &quot;Authorization=Bearer my-apm-secret-token&quot;
      - name: ELASTIC_OTEL_INFERRED_SPANS_ENABLED
        value: &quot;true&quot;
      - name: ELASTIC_OTEL_SPAN_STACK_TRACE_MIN_DURATION
        value: &quot;50&quot;
</code></pre>
<p>I’ve included environment for switching on several features in the agent, including</p>
<ol>
<li>ELASTIC_APM_PROFILING_INFERRED_SPANS_ENABLED to switch on the inferred spans implementation feature described in <a href="https://www.elastic.co/observability-labs/blog/tracing-data-inferred-spans-opentelemetry">this blog</a></li>
<li>Span stack traces are automatically captured if the span takes more than ELASTIC_OTEL_SPAN_STACK_TRACE_MIN_DURATION (default would be 5ms)</li>
</ol>
<p>Adding in the annotation ...</p>
<pre><code>metadata:
  annotations:
    instrumentation.opentelemetry.io/inject-java: &quot;elastic-otel&quot;
</code></pre>
<p>... to the pod yaml gets the application traced, and displayed in the Elastic APM UI, including the inferred child spans and stack traces</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/using-the-otel-operator-for-injecting-elastic-agents/elastic-apm-ui-with-stack-trace.png" alt="Elastic APM UI showing methodB traced with stack traces and inferred spans" /></p>
<p>The additions from the features mentioned above are circled in red - inferred spans (for methodC and methodD) bottom left, and the stack trace top right. (Note that the pod included the <code>OTEL_INSTRUMENTATION_METHODS_INCLUDE</code> environment variable set to <code>&quot;test.Testing[methodB]&quot;</code> so that traces from methodB are shown; for pod configuration see the &quot;Trying it&quot; section in <a href="https://www.elastic.co/observability-labs/blog/using-the-otel-operator-for-injecting-java-agents">part 1</a>)</p>
<h2>Using the Elastic APM Java agent</h2>
<p>From version 1.50.0, the <a href="https://github.com/elastic/apm-agent-java">Elastic APM Java agent</a> includes the agent jar at the path /javaagent.jar in the docker image - which is essentially all that is needed for a docker image to be usable by the OpenTelemetry operator for auto-instrumentation. This means the Instrumentation resource is straightforward to define:</p>
<pre><code>apiVersion: opentelemetry.io/v1alpha1
kind: Instrumentation
metadata:
  name: elastic-apm
  namespace: banana
spec:
  java:
    image: docker.elastic.co/observability/apm-agent-java:1.55.4
    env:
      - name: ELASTIC_APM_SERVER_URL
        value: &quot;https://my.apm.server.url&quot;
      - name: ELASTIC_APM_SECRET_TOKEN
        value: &quot;my-apm-secret-token&quot;
      - name: ELASTIC_APM_LOG_LEVEL
        value: &quot;INFO&quot;
      - name: ELASTIC_APM_PROFILING_INFERRED_SPANS_ENABLED
        value: &quot;true&quot;
      - name: ELASTIC_APM_LOG_SENDING
        value: &quot;true&quot;
</code></pre>
<p>I’ve included environment for switching on several features in the agent, including</p>
<ul>
<li>ELASTIC_APM_LOG_LEVEL set to the default value (INFO) which could easily be switched to DEBUG</li>
<li>ELASTIC_APM_PROFILING_INFERRED_SPANS_ENABLED to switch on the inferred spans implementation equivalent to the feature described in <a href="https://www.elastic.co/observability-labs/blog/tracing-data-inferred-spans-opentelemetry">this blog</a></li>
<li>ELASTIC_APM_LOG_SENDING which switches on sending logs to the APM UI, the logs are automatically correlated with transactions (for all common logging frameworks)</li>
</ul>
<p>Adding in the annotation ...</p>
<pre><code>metadata:
  annotations:
     instrumentation.opentelemetry.io/inject-java: &quot;elastic-apm&quot;
</code></pre>
<p>... to the pod yaml gets the application traced, and displayed in the Elastic APM UI, including the inferred child spans</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/using-the-otel-operator-for-injecting-elastic-agents/elastic-apm-ui-with-inferred-spans.png" alt="Elastic APM UI showing methodB traced with inferred spans" /></p>
<p>(Note that the pod included the <code>ELASTIC_APM_TRACE_METHODS</code> environment variable set to <code>&quot;test.Testing#methodB&quot;</code> so that traces from methodB are shown; for pod configuration see the &quot;Trying it&quot; section in <a href="https://www.elastic.co/observability-labs/blog/using-the-otel-operator-for-injecting-java-agents">part 1</a>)</p>
<h2>Using an extension with the OpenTelemetry Java agent</h2>
<p>Setting up an Instrumentation resource for the OpenTelemetry Java agent is straightforward and was done in <a href="https://www.elastic.co/observability-labs/blog/using-the-otel-operator-for-injecting-java-agents">part 1</a> of this two part series - and you can see from the above examples it’s just a matter of deciding on the docker image URL you want to use. However if you want to include an <em>extension</em> in your deployment, this is a little more complex, but also supported by the operator. Basically the extensions you want to include with the agent need to be in docker images - or you have to build an image which includes the extensions that are not already in images. Then you declare the images and the directories the extensions are in, in the Instrumentation resource. As an example, I’ll show an Instrumentation which uses version 2.5.0 of the <a href="https://github.com/open-telemetry/opentelemetry-java-instrumentation/">OpenTelemetry Java agent</a> together with the <a href="https://github.com/elastic/elastic-otel-java/tree/main/inferred-spans">inferred spans extension</a> from the <a href="https://github.com/elastic/elastic-otel-java">Elastic OpenTelemetry Java distribution</a>. The distro image includes the extension at path <code>/extensions/elastic-otel-agentextension.jar</code>. The Instrumentation resource allows either directories or file paths to be specified, here I’ll list the directory:</p>
<pre><code>apiVersion: opentelemetry.io/v1alpha1
kind: Instrumentation
metadata:
  name: otel-plus-extension-instrumentation
  namespace: banana
spec:
  exporter:
    endpoint: https://my.apm.server.url
  propagators:
    - tracecontext
    - baggage
    - b3
  sampler:
    type: parentbased_traceidratio
    argument: &quot;1.0&quot;
  java:
    image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-java:2.5.0
    extensions:
      - image: &quot;docker.elastic.co/observability/elastic-otel-javaagent:1.9.0&quot;
        dir: &quot;/extensions&quot;
    env:
      - name: OTEL_EXPORTER_OTLP_HEADERS
        value: &quot;Authorization=Bearer my-apm-secret-token&quot;
      - name: ELASTIC_OTEL_INFERRED_SPANS_ENABLED
        value: &quot;true&quot;
</code></pre>
<p>Note that you can have multiple <code>image … dir</code> pairs, ie include multiple extensions from different images. Note also if you are testing this specific configuration that the inferred spans extension included here will be contributed to the OpenTelemetry contrib repo at some point after this blog is published, after which the extension may no longer be present in a later version of the referred image (since it will be available from the <a href="https://github.com/open-telemetry/opentelemetry-java-contrib/">contrib repo</a> instead).</p>
<h2>Next steps</h2>
<p>Here I’ve shown how to use any agent with the <a href="https://github.com/open-telemetry/opentelemetry-operator/">OpenTelemetry Operator for Kubernetes</a>, and configure that for your system. In particular the examples have showcased how to use the Elastic Java agents to auto-instrument Java applications running in your Kubernetes clusters, along with how to enable features, using Instrumentation resources. And you can set it up for either zero config for deployments, or for just one annotation which is generally a more flexible mechanism (you can have multiple Instrumentation resource definitions, and the deployment can select the appropriate one for its application).</p>
<p><em>The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.</em></p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/using-the-otel-operator-for-injecting-elastic-agents/blog-header-720x420.jpg" length="0" type="image/jpg"/>
        </item>
        <item>
            <title><![CDATA[Zero config OpenTelemetry auto-instrumentation for Kubernetes Java applications]]></title>
            <link>https://www.elastic.co/observability-labs/blog/using-the-otel-operator-for-injecting-java-agents</link>
            <guid isPermaLink="false">using-the-otel-operator-for-injecting-java-agents</guid>
            <pubDate>Thu, 11 Jul 2024 00:00:00 GMT</pubDate>
            <description><![CDATA[Walking through how to install and enable the OpenTelemetry Operator for Kubernetes to auto-instrument Java applications, with no configuration changes needed for deployments]]></description>
            <content:encoded><![CDATA[<p>The <a href="https://github.com/open-telemetry/opentelemetry-java-instrumentation/">OpenTelemetry Java agent</a> has a number of <a href="https://opentelemetry.io/docs/languages/java/automatic/#setup">ways to install</a> the agent into a Java application. If you are running your Java applications in Kubernetes pods, there is a separate mechanism (which under the hood uses JAVA_TOOL_OPTIONS and other environment variables) to auto-instrument Java applications. This auto-instrumentation can be achieved with zero configuration of the applications and pods!</p>
<p>The mechanism to achieve zero-config auto-instrumentation of Java applications in Kubernetes is via the <a href="https://github.com/open-telemetry/opentelemetry-operator/">OpenTelemetry Operator for Kubernetes</a>. This operator has many capabilities and the full documentation (and of course source) is available in the project itself. In this blog, I'll walk through installing, setting up and running zero-config auto-instrumentation of Java applications in Kubernetes using the OpenTelemetry Operator.</p>
<h2>Installing the OpenTelemetry Operator&lt;a id=&quot;installing-the-opentelemetry-operator&quot;&gt;&lt;/a&gt;</h2>
<p>At the time of writing this blog, the OpenTelemetry Operator needs the certification manager to be installed, after which the operator can be installed. Installing from the web is straightforward. First install the <code>cert-manager</code> (the version to be installed will be specified in the <a href="https://github.com/open-telemetry/opentelemetry-operator/">OpenTelemetry Operator for Kubernetes</a> documentation):</p>
<pre><code>kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.14.4/cert-manager.yaml
</code></pre>
<p>Then when the cert managers are ready (<code>kubectl get pods -n cert-manager</code>)  ...</p>
<pre><code>NAMESPACE      NAME                                         READY
cert-manager   cert-manager-67c98b89c8-rnr5s                1/1
cert-manager   cert-manager-cainjector-5c5695d979-q9hxz     1/1
cert-manager   cert-manager-webhook-7f9f8648b9-8gxgs        1/1
</code></pre>
<p>... you can install the OpenTelemetry Operator:</p>
<pre><code>kubectl apply -f https://github.com/open-telemetry/opentelemetry-operator/releases/latest/download/opentelemetry-operator.yaml
</code></pre>
<p>You can, of course, use a specific version of the operator instead of the <code>latest</code>. But here I’ve used the <code>latest</code> version.</p>
<h2>An Instrumentation resource&lt;a id=&quot;an-instrumentation-resource&quot;&gt;&lt;/a&gt;</h2>
<p>Now you need to add just one further Kubernetes resource to enable auto-instrumentation: an <code>Instrumentation</code> resource. I am going to use the <code>banana</code> namespace for my examples, so I have first created that namespace (<code>kubectl create namespace banana</code>). The auto-instrumentation is specified and configured by these Instrumentation resources. Here is a basic one which will allow every Java pod in the <code>banana</code> namespace to be auto-instrumented with version 2.5.0 of the <a href="https://github.com/open-telemetry/opentelemetry-java-instrumentation/">OpenTelemetry Java agent</a>:</p>
<pre><code>apiVersion: opentelemetry.io/v1alpha1
kind: Instrumentation
metadata:
  name: banana-instr
  namespace: banana
spec:
  exporter:
    endpoint: &quot;https://my.endpoint&quot;
  propagators:
    - tracecontext
    - baggage
    - b3
  sampler:
    type: parentbased_traceidratio
    argument: &quot;1.0&quot;
  java:
    image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-java:2.5.0
    env:
      - name: OTEL_EXPORTER_OTLP_HEADERS
        value: &quot;Authorization=Bearer MyAuth&quot;
</code></pre>
<p>Creating this resource (eg with <code>kubectl apply -f banana-instr.yaml</code>, assuming the above yaml was saved in file <code>banana-instr.yaml</code>) makes the <code>banana-instr</code> Instrumentation resource available for use. (Note you will need to change <code>my.endpoint</code> and <code>MyAuth</code> to values appropriate for your collector.) You can use this instrumentation immediately by adding an annotation to any deployment in the <code>banana</code> namespace:</p>
<pre><code>metadata:
  annotations:
    instrumentation.opentelemetry.io/inject-java: &quot;true&quot;
</code></pre>
<p>The <code>banana-instr</code> Instrumentation resource is not yet set to be applied by <em>default</em> to all pods in the banana namespace. Currently it's zero-config as far as the <em>application</em> is concerned, but it requires an annotation added to a <em>pod or deployment</em>. To make it fully zero-config for <em>all pods</em> in the <code>banana</code> namespace, we need to add that annotation to the namespace itself, ie editing the namespace (<code>kubectl edit namespace banana</code>) so it would then have contents similar to</p>
<pre><code>apiVersion: v1
kind: Namespace
metadata:
  name: banana
  annotations:
    instrumentation.opentelemetry.io/inject-java: &quot;banana-instr&quot;
...
</code></pre>
<p>Now we have a namespace that is going to auto-instrument <em>every</em> Java application deployed in the <code>banana</code> namespace with the 2.5.0 <a href="https://github.com/open-telemetry/opentelemetry-java-instrumentation/">OpenTelemetry Java agent</a>!</p>
<h2>Trying it&lt;a id=&quot;trying-it&quot;&gt;&lt;/a&gt;</h2>
<p>There is a simple example Java application at <a href="http://docker.elastic.co/demos/apm/k8s-webhook-test">docker.elastic.co/demos/apm/k8s-webhook-test</a> which just repeatedly calls the chain <code>main-&gt;methodA-&gt;methodB-&gt;methodC-&gt;methodD</code> with some sleeps in the calls. Running this (<code>kubectl apply -f banana-app.yaml</code>) using a very basic pod definition:</p>
<pre><code>apiVersion: v1
kind: Pod
metadata:
  name: banana-app
  namespace: banana
  labels:
    app: banana-app
spec:
  containers:
    - image: docker.elastic.co/demos/apm/k8s-webhook-test
      imagePullPolicy: Always
      name: banana-app
      env: 
      - name: OTEL_INSTRUMENTATION_METHODS_INCLUDE
        value: &quot;test.Testing[methodB]&quot;
</code></pre>
<p>results in the app being auto-instrumented with no configuration changes! The resulting app shows up in any APM UI, such as Elastic APM</p>
<p><img src="https://www.elastic.co/observability-labs/assets/images/using-the-otel-operator-for-injecting-java-agents/elastic-apm-ui-transaction.png" alt="Elastic APM UI showing methodB traced" /></p>
<p>As you can see, for this example I also added this env var to the pod yaml, <code>OTEL_INSTRUMENTATION_METHODS_INCLUDE=&quot;test.Testing[methodB]&quot;</code> so that there were traces showing from methodB.</p>
<h2>The technology behind the auto-instrumentation&lt;a id=&quot;the-technology-behind-the-auto-instrumentation&quot;&gt;&lt;/a&gt;</h2>
<p>To use the auto-instrumentation there is no specific need to understand the underlying mechanisms, but for those of you interested, here’s a quick outline.</p>
<ol>
<li>The <a href="https://github.com/open-telemetry/opentelemetry-operator/">OpenTelemetry Operator for Kubernetes</a> installs a <a href="https://kubernetes.io/docs/reference/access-authn-authz/admission-controllers/">mutating webhook</a>, a standard Kubernetes component.</li>
<li>When deploying, Kubernetes first sends all definitions to the mutating webhook.</li>
<li>If the mutating webhook sees that the conditions for auto-instrumentation should be applied (ie
<ol>
<li>there is an Instrumentation resource for that namespace and</li>
<li>the correct annotation for that Instrumentation is applied to the definition in some way, either from the definition itself or from the namespace),</li>
</ol>
</li>
<li>then the mutating webhook “mutates” the definition to include the environment defined by the Instrumentation resource.</li>
<li>The environment includes the explicit values defined in the env, as well as some implicit OpenTelemetry values (see the <a href="https://github.com/open-telemetry/opentelemetry-operator/">OpenTelemetry Operator for Kubernetes</a> documentation for full details).</li>
<li>And most importantly, the operator
<ol>
<li>pulls the image defined in the Instrumentation resource,</li>
<li>extracts the file at the path <code>/javaagent.jar</code> from that image (using shell command <code>cp</code>)</li>
<li>inserts it into the pod at path <code>/otel-auto-instrumentation-java/javaagent.jar</code></li>
<li>and adds the environment variable <code>JAVA_TOOL_OPTIONS=-javaagent:/otel-auto-instrumentation-java/javaagent.jar</code>.</li>
</ol>
</li>
<li>The JVM automatically picks up that JAVA_TOOL_OPTIONS environment variable on startup and applies it to the JVM command-line.</li>
</ol>
<h2>Next steps&lt;a id=&quot;next-steps&quot;&gt;&lt;/a&gt;</h2>
<p>This walkthrough can be repeated in any Kubernetes cluster to demonstrate and experiment with auto-instrumentation (you will need to create the banana namespace first). In part 2 of this two part series, <a href="https://www.elastic.co/observability-labs/blog/using-the-otel-operator-for-injecting-elastic-agents">Using a custom agent with the OpenTelemetry Operator for Kubernetes</a>, I show how to install any Java agent via the OpenTelemetry operator, using the Elastic Java agents as examples.</p>
]]></content:encoded>
            <category>observability-labs</category>
            <enclosure url="https://www.elastic.co/observability-labs/assets/images/using-the-otel-operator-for-injecting-java-agents/blog-header.png" length="0" type="image/png"/>
        </item>
    </channel>
</rss>